Present: Bruce Barnett, Ian Brawn, Adam Davis, James Edwards, Eric Eisenhandler (chair), Norman Gee, Tony Gillman, Steve Hillier, Murrough Landon, Gilles Mahout, Adrian Mirea, Tamsin Moye, Viraj Perera, Weiming Qian, Dave Sankey, Richard Staley, Dimitrios Typaldos, Jürgen Thomas, Peter Watkins
Click this side Click this side for summaries for slides (pdf) Subsystem tests Overview of subsystem tests...........................Bruce Cluster Processor testing............................Gilles Energy-sum testing...................................Jürgen Processor hardware and firmware CPM 1.0 and 1.5, and LSM, status....................Richard PCB manufacture and assembly...........................Tony CMM status..............................................Ian What can we learn from Stockholm?.......................Ian CANbus.................................................Adam G-link readout, VME--, and TTC......................Weiming 9U ROD specification status..........................Norman 9U ROD design.........................................Viraj Crate orders for 2004..................................Eric Trigger interfaces TileCal patch panels...................................Tony Summary of CTP, LTP and Busy Module reviews..........Norman Schedule Hardware schedule......................................Tony Slice test and testbeam planning.....................Norman Online and offline software, etc. Offline simulation...........................Steve for Alan Online software summary............................Murrough Any other business – Testbeam open discussion – Commissioning open discussion – Firmware working group Date of next UK meeting
Bruce presented an overview of subsystem testing that has been taking place in Lab 12 of R1. Many functional features of CPM, JEM and CMM have been explored (see the talks of Gilles and Jürgen) – as have as their readout (ROD and ROS), backplane and TTC interfaces.
The test infrastructure has advanced in turn. Ongoing problems are the time competition that the testing presents to computing maintenance and software development. There is also a need to ensure that individual testing weeks flow one to the other and that in the long term the relevant test phase-space is explored.
A number of areas require attention – in particular definition of the precise objectives imposed by upcoming FDRs and subsequent module production. These must start to shape the programme in a more time-bounded way. In the future there will be new modules and firmware – these will tend to distract the work again with the local view. A view of the tests in the perspectives of test beam and then installation and commissioning needs to be elaborated. We need to be able to test CP and JEP together when we have PPMs.
The subsystem test of CPMs at RAL consists of three CPMs running next to each other. The timing of individual boards need to be adjusted in order to have the data shared between CPMs error-free. The timing scan is performed by using the scanpath firmware, and the settings usually are to delay the internal clock (deskew1) by 104 ps for the CPM on the left, and by 1 ns for the CPM on the right. It has been discovered that even if the data agree with the simulation, parity errors were generated. The 1 ns error-free timing windows measured before are then reduced to 400 ps in order to have no parity errors. Once timing adjustments are made, data are correctly processed through the three boards.
The readout path has also been tested successfully by feeding RoI and DAQ data to a ROD. CPM firmware been modified in order to add a minimum time of five ticks between two successive DAV signals, and BCN number is now kept constant between slices. The CPM has also been changed to start/stop on reception of a broadcast command identical to the Dss. Although runs of 1 h have been performed successfully for the DAQ, the readout path is not stable and the G-link loses its lock from time to time. The RoI path seems more sensitive, possibly due to a different G-link driver. Measurement of the G-link output on the scope shows that too much jitter could be the source of the problem.
Some VME misbehaviour on CPM5 have been observed, with DTACK signals generated despite the fact that the CPM5 VME space was not addressed. Back at Birmingham this behaviour has not been seen, and at RAL it now seems that the problem has disappeared too. But on the other hand a new problem has been observed: on reception of a "stop" command, all CP chips belonging to CPM5 seem to lose all their delay settings. This is new and needs investigating. Full readout tests have been performed and it all works very well, at different L1A rates and up to five slices in case of the DAQ. The problem happens when two L1As separated by five ticks are used; the BCN number is not refreshed for the second L1A. This is a firmware issue and it needs to be corrected, but at least it works for six ticks between two L1As.
Integration testing of the JEM resumed during 12–16 January, with Uli, Cano and Stefan visiting bringing along JEM 0.2 and 0.1. The G-Link problem has been fixed, and updated firmware and simulation were used. Two CMMs were available with energy-sum firmware, and 32 LVDS channels from two DSSs. As it was not possible to sychronise playback memory runs due to not seeing the TTC short command, the LVDS inputs were used. The jet multiplicities were not looked at, but overwritten by energy results.
The system produced the readout data as predicted, also for multi-slice readout. An overnight run of 300,000 events with two slices was performed. The system failed eventually at night due to unknown reasons. CMM tests continued with one CMM being set to be a system merger, the other to be a crate merger, then connected via cables. The JEMs send energy sums to both mergers. The CMMs were tested using the two JEMs as data input. A simple program dumped the various spy memory contents. The filling of the spy memories was very sensitive to the input clock settings. This was due to a firmware problem now fixed by Ian. Putting the JEMs into different quadrants caused sum-Ex and sum-Ey to cancel as expected. The final CTP bits have not been checked, also since the sum-ET threshold could not be set. This problem was not seen in Ian's test setup.
The next tests on 9–13 February will focus on matching the jet algorithm with the simulation, and try a complete chain with readout using new generator variants and the neutral ROD format for the CMM. Work on producing physics test vectors from Atlfast-Athena has started, which can now be read using a JEM generator option.
Norman asked about sending the signs of Ex and Ey to the CTP.
Present CPM (V1.0): Five CPMs have now been made; three are fully working but two have assembly defects with a number of BGA packages that remain even after being re-worked.
CPM#5 showed a fault with its VME interface when plugged into the RAL crate, but not when used back at Birmingham. The associated CPLD has been reprogrammed and the fault has yet to re-appear at RAL.
For all CPMs, the onboard G-link transmitters became unstable when CPMs were adjacent. Under certain load conditions, the termination-voltage supply circuit became unstable. This has been fixed, but the G-link receivers occasionally lose lock, more so when L1As are sent over the TTC link.. Richard suspects this problem with the link is caused by excessive jitter on the clock derived from the onboard TTCdec. With excessive TTC activity, i.e. when the DDS removes the L1A signal, the jitter on the clock from the TTCdec will even cause the CP chip DLL/calibration circuit to fail.
As shown on the slide, the signal into the TTCdec looks fine, and although the cycle-to-cycle jitter seems acceptable, there appears to be some mid-frequency modulation of the clock frequency. The CPM PLLs pass on the clock without any reduction in jitter. The G-link transmitter fortunately does reduce the jitter, but not enough. This jitter worsens with increased TTC activity. One reason for the poor performance of the TTCdec could be the analogue supply connection of the TTCrx. Pin F5 should be tied to A_VDD and not the digital supply. The latest version of TTCdec has lower jitter, but needs a different connector so it will not fit on the current CPMs. A solution needs to be found, as the present TTCdec is unusable.
Birmingham is investigating the purchase of jitter analysis software for their Agilent scopes.
Next CPM (v 1.5): A reminder that this is still a prototype, and time is running out. Also, eventually, the CP chip firmware will need updating to decouple the clock that samples the backplane data from the module clock that outputs the Hit data.
Two PCBs of the new design will be made, having gold finish. The past problems with gold are now thought to be due to the inexperience of the assembly company used. Viraj drew up a contract for responsibility and obtained quotes from three other companies (see following talk by Tony). The order has been placed with DDi , who will deliver two modules by mid-March.
LVDS Source Module: Reminder, this module has 6U VME format, using ALTERA FPGAs to serialise and drive the LVDS data at 480Mb/s. Richard would like the RAL drawing office to lay out the PCB in March or April. The LVDS serialisation firmware has been written and simulated. An updated specification will be placed on the web for comment. (So far only one reply.)
The saga of the last two years concerning 9U board manufacture and assembly has been very mixed. Two types of surface finish have been tried – Au on Ni, and Sn – and ATLAS trigger and CMS tracker have had good and bad experiences with both. The cost of this empiricism is too high, in time, money and effort, so we need expert advice from organisations such as NPL or TWI, and possibly consultants.
The use of a "one-stop shop" company avoids the problem of an endless blame-shifting loop between PCB and assembly companies. Three promising such companies were short-listed and then visited as part of a fact-finding exercise. All were found to have impressive capabilities and very comprehensive diagnostic equipment (Ersascope, 3-d X-ray, Automatic Optical Inspection, ...), which we believe is the key to achieving good yield, when coupled with high quality QA.
The unanimous advice we received was to use Au on Ni surface finish for high reliability, as with 60% of current surface-mount boards. The CPM1.5 layout was assessed as being sound and not presenting any problems for PCB manufacture or assembly.
One-stop shop companies will guarantee working boards, at their expense in case of initial failures, given that they also procure the components and, in some companies, carry out the JTAG testing.
Mike Johnson, ID Director, is chairing a working group from ID, CMS and ATLAS to study longer-term issues of manufacture, assembly and QA of complex PCBs.
Bruce asked about plans for burn-in of production modules. At least one of the companies offers this as an extra-cost service.
Ian reported on developments since the last meeting. A bug which corrupted VME access to the Crate Output RAM in the energy-summing firmware has been fixed. A problem reported with access to the ET thresholds cannot be reproduced outside the trigger lab; it is not caused by a bug in the CMM hardware or firmware. The fifth CMM has returned from having some faulty components replaced and awaits commissioning.
Ian congratulated Stockholm on the increased performance they have obtained from their firmware in the past 12 months. He examined the changes in design philosophy and method that have led to these improvements to see if they could be applied to the UK branch of the collaboration. He concluded that, as the UK and Stockholm methods of design are now very similar, there is little the two groups can teach each other at the moment.
Bruce asked if the CP chip would benefit from being reworked in the style of the jet firmware, with extensive use of generic parameters to increase flexibility. Ian thought this would not improve latency and the work would only be justified if the CP algorithm was liable to require frequent alteration in the future.
It is clear that a document is required to outline the requirements of the monitoring system and a draft specification is being developed. This document outlines the minimum requirements and a proposal for the alert messaging scheme along with specifying the use of message buffers within the distributed embedded system. This is an essential document that needs urgent attention so that both Adam Davis and Andrey Belkin are not spending time developing unusable code.
Eric made it clear that it is now our responsibility to control the crate fans and crate shutdown if necessary, contradicting the comments made in the last meeting. It is therefore necessary to obtain further documentation on how to control the crate, using the CAN, as the current documentation does not cover this.
In discussion, the need for some sort of overall DCS design was stressed, probably to be done by a working group.
The G-Link DAV gap requirement for ROD firmware has been investigated. Various G-Link data formats with various DAV gaps were simulated. There are three factors affecting the G-Link DAV gap:
A VME-- signal integrity problem is observed during the 9U custom backplane VME scan test. This problem originates from the current VMM design. Many proposals to improve VMM were presented, including:
Norman reported on progress with the 9U ROD specification. The document has been further corrected with help from Eric. Except for some tidying, the register model is also complete with help from Panagiotis. Details of data formats, data rates, and some details of register bits have still to be added.
In manufacture, one module will be populated, then two more for tests. The issues of firmware sharing and the firmware catalogue still await solutions. It is proposed to update the 6U ROD data formats for compatibility with the new 9U formats. This will need careful scheduling.
There was some discussion on how to mix 6U and 9U RODs, with suggestions about using the neutral data format and/or a flag in the header indicating whether the data came from a 6U or 9U ROD.
The schematics are almost complete now. A few items such as allocation of some signals (GA5, 3V3 to rear S-Link card) need finalising.
Meanwhile we have started checking the schematics. The schedule was to hand over the design to the Drawing Office last week, however last minute unavoidable changes have delayed this. The main one was the power distribution due to the change from VME6xP (J0 has a few more +5V pins) to VME64x. This meant we needed to use 48V supplies from the J1 connector with DC-DC converters to generate the additional power that we need.
While the checking goes on, the D.O. can start some of the layout (already checked) such as the optical + electrical modules (9 off) and G-links + input FPGAs (5 off) modules; this is step and repeat once one module has been placed. There is a long waiting time for Xilinx parts, hence we need to place the order now!
Lastly Panagiotis, who has been working on the ATLAS level-1 trigger for over two years and currently doing the schematics for the 9U ROD, has resigned from RAL and will be leaving at the end of February
We had been asked to place all orders for ATLAS-standard crates for 2004 by December. Eric reported on some details and the status of the orders. We need four different types of crate, with a further subdivision into crates for the trigger system (plus spares), and crates for test rigs.
TileCal receiver crates: Identical to LAr receiver crates, so Pittsburgh order has been copied. No 6U slots, custom non-VME backplane. These are low-ish power so can be air-cooled. Two crates ordered.
CP and JEP crates: Higher power, so except for test rigs they will be water-cooled. Two 6U slots for CPU, custom backplane. Seven ordered for system, two for test rigs (we already have another as well).
ROD crates: Again, power consumption dictates water cooling for all but test rigs. Two 6U slots for CPU. There are some problems, due to limitations and confusion concerning power-handling capabilities of pins. We had asked for VME64xP as a result, but were advised to avoid this due to possible but unlikely J0 conflicts. A solution has been found for VME64x, but this demands some 48V power that will be converted down to 5V. Three crates ordered for system (PPr, CP, JEP), and one for a test rig.
Preprocessor crates: Here the power needed is higher than is provided in any ATLAS-standard option. And again there is confusion regarding the use of VME64xP vs. VME64x. Heidelberg will continue direct discussions with Wiener. Nine crates needed for system, and one for a test rig.
The function of the TileCal Patch-Panel (TCPP) is to peel off the TileCal muon trigger signals from the calorimeter trigger signals which share the same cables. It consists of 64 unpowered 9U modules in four crates, each receiving four 50-way cables from the TileCal and re-ordering the signals on to four 37-way cables.
The design and manufacture of three prototype modules was done by CERN, with assembly by Birmingham. A Test Plan is being prepared to check for appropriate connectivity, grounding, crosstalk effects, etc.
Norman chaired a CTP review at CERN on 16 January. Planned as an FDR, it was changed to an interim review with the FDR to be held after the test beam. The CTP has expanded from one board to a subsystem of eleven modules of six types, with three backplane buses. Prototypes exist for the monitoring module and backplanes, but three modules have no documentation, and the other two modules are at the design stage and have limited documentation. The CORE module is complex, and a reduced prototype will be considered for the beam. The reviewers have asked for additional information. The review also received the URD (which needs modest updating) and a document on DAQ, which has some consequences for L1Calo.
The ROD-BUSY and LTP passed their PDRs at the same review. Central software support has been requested.
The Pre-PRR schedule has been updated to reflect the further slippages in some areas. The full Slice Tests cannot proceed until the first PPM is available in April. In May, the first CPM1.5 and JEM1.0 should be available, when the Slice Tests will continue with a hybrid mix of old and new module designs. The critical item is the 9U ROD, of which we will need two tested modules to replace all the 6U RODs in the Slice Tests. These will probably not be available at least until September, when the Slice Test system should consist entirely of the new-generation modules. The Combined Test-Beam run in August/September will impose a two-month break in the Slice test programme, which is unlikely to be complete by August. Although the main review process is scheduled to start in June, there could be serious overload from competing demands on our time, so it is proposed to shorten the 12-month sub-system commissioning period in 2005/6 to allow time to carry out the Slice Tests more thoroughly, and therefore to delay the FDR and PRR period until Q4 2004.
Norman noted the huge amount of work from many people going into lab tests and software. There is also quite a lot of travelling and quite a lot of late evenings. Many more of the interfaces have been tested.
We need a list of detailed checks to be made during tests, to ensure systematic coverage. The list should include evidence needed for FDRs.
Major missing areas include histogramming and
and test beam analysis software. A group is starting work. Big decisions
still to be
made include the timing of the move to ROD-Crate DAQ, choice of conditions database.
Alan is now formally in charge of the offline simulation. It has been converted to "ATLAS" units, which are MeV and mm. The trigger-tower simulation by the calorimeter people is progressing, but slowly.
A most welcome development in the L1Calo online software has been the inclusion of software for the PPM. There have also been developments in the JEM and CMM areas connected with firmware and simulation changes and addition of test vector generators.
In the near future we will need to move to a new version of the ATLAS online software. This requires changes to our database code (mostly done) and also to module services as the run state model has changed. We hope to make this migration in the week 16–20 February. We will also probably have to move to the new ROD crate DAQ when that is released.
We still have some major gaps in our software. One of these is monitoring, either in the online environment or in ATHENA (or both). We are starting to address this. We also need to think about the conditions database – however we have done no work in this area at all.
Testbeam open discussion and commissioning open discussion: Eric reminded everyone that on 10 Feb. there would be an open meeting at CERN, with people invited to phone in and make comments; the agenda aims for discussion and is not packed with talks. The following day a similar meeting is aimed at addressing commissioning.
Firmware working group: For some time we have been aware of a need for an archive and version control system for our large body of firmware. Eric announced formation of a working group to seek a solution. It will consist of Steve Hillier (chair), Ian Brawn, Tony Gillman, Dave Sankey, and Sam Silverstein.
Date not yet fixed; will be after the March Joint Meeting at Heidelberg.
Eric Eisenhandler, 25 February 2004