

#### FPGA-based Compute Node for Data Acquisition and Trigger in HADES and PANDA

Under the collaboration of

JLU (Giessen) and IHEP (Beijing)

JUSTUS-LIEBIG-UNIVERSITAT GIESSEN



<u>Ming Liu<sup>1</sup></u>, Johannes Lang<sup>1</sup>, Zhen'an Liu<sup>2</sup>, Hao Xu<sup>2</sup>, Qiang Wang<sup>12</sup>, Dapeng Jin<sup>2</sup>, Soeren Lange<sup>1</sup>, Johannes Roskoss<sup>1</sup>, Andreas Kopp<sup>1</sup>, David Muenchow<sup>1</sup>, Wolfgang Kuehn<sup>1</sup>

Acknowledgement: BMBF&GSI 06GI179 06GI180, FZ-Juelich COSY-099 41821475



### Outline

- Physics background of HADES & PANDA
- ATCA platform architecture
- Compute Node (CN) HW design
- HW/SW co-design on FPGAs
- Detector-specific algorithm development
- Current status and outlook



### **Physics Background**

Modern experiments such as HADES and PANDA at Fair require the data acquisition and trigger system with features:

- High reaction rate and high data rate (PANDA, 10-20 MHz, >200 GB/s)
- Large channel count from detectors (>10<sup>5</sup> channels)
- General-purpose use for multiple experiments
- A design methodology for easy application development
- Scalability for new detectors and higher data rate

•

Motivation: a **powerful**, **scalable**, and **universal** platform for DAQ and triggering.



### **Computation Platform Architecture**



- Pattern recognition algorithms implemented for triggering
- Multiple CNs for algorithm partition and parallel/pipelined processing
- CNs internally interconnected by the full-mesh ATCA backplane
- The number of crates to be decided according to the incoming data rate and computation needs
- External interconnections:
  - Optical links
  - Gigabit Ethernet



### ATCA Full-mesh Backplane

- Full-mesh backplane network
- High flexibility to correlate results from different algorithms
- High performance







### **Compute Node**

Gigabit enet



- Prototype board with 5 Xilinx Virtex-4 FX60 FPGAs
- 4 FPGAs as algo. Processors
- 1 FPGA as a switch
- 2 GB DDR2 per FPGA
- Full-mesh communication onboard
- IPMC, Flash, CPLD, ...
- External links

Optical links Gigabit Ethernet





## HW/SW Co-design on FPGAs

Aim: to ease and accelerate development on CNs for different experiments & algorithms

#### Partitioning strategy:

- Computation-intensive algorithms implemented in the FPGA fabric for high performance and real-time features (parallel & pipelined processing in HW)
- Slow controls in SW (OS + Applications):
  - To remotely upgrade the HW and SW designs
  - Network test and measurements
  - To display and adjust experimental parameters

— ....

• Communication stack processing (TCP/IP) in Linux OS in SW



### HW Design on FPGAs



- A uniform system design for all applications (MPMC-based)
- Customized processing modules for different algorithms
- Easy system integration with the guarantee of high performance



## SW Design

- Open-source Linux on embedded PowerPCs
- Physicists favorite OS and easy to operate and program
- Device drivers:
  - For Ethernet, UART, Flash memory, etc.
  - For customized processing units
- Applications for slow controls:
  - High level scripts
  - C/C++ programs
  - Webpages on Apache server
  - Java program on VM
  - ...
- Many tools provided, NFS, telnet, ...
- Software cost: almost zero





### Remote Reconfigurability

- Remote reconfigurability is provided to solve the spatial constraint in experiments.
- Both the OS kernel and the FPGA bitstreams are stored in the NOR flash memories.
- With the support of the MTD driver, the bitstreams and the kernel can be overwritten and upgraded in Linux.
- Commands are issued remotely through network.
- Backup mechanism to guarantee the system alive.





## Algorithm Development

#### Example: HADES track reconstruction (inner)



- Particle tracks bent in the magnetic field between the coils
- Straight lines before & after the coil approximately
- Inner and outer tracks pointing to RICH and TOF detector respectively and helping them to find patterns (correlation)
- Similar principle for inner and outer segments. Only inner part discussed
- The particle track reconstruction algorithm for HADES was previously implemented in SW, due to the complexity.
- Now implemented and investigated as a case study in HW



#### **Basic Principle**





#### **Basic Principle**





### HADES Track Reconstruction



- PLB slave interface (PLB IPIF) for system control
- LocalLink master interface for data movement from/to memory
- Algorithm processor (tracking processor)



### Modular Design



- TPU for track reconstruction computation
- Input: fired wire Nos.
- Output: position of track candidates on the proj. plane
- Sub-modules:
  - Wire No. Wr. FIFO
  - Proj. LUT & Addr.
     LUT
  - Bus master
  - Accumulate unit
  - Peak finder



#### Implementation Results

| Resources        | MPMC-based FPGA<br>system design (no ap-<br>plication processor) | TPU module                                            | MPMC-based sys-<br>tem with the <i>TPU</i> |
|------------------|------------------------------------------------------------------|-------------------------------------------------------|--------------------------------------------|
| 4-input LUTs     | 10008 out of $50560$ $(19.8%)$                                   | 6210  out of  50560 (12.3%)                           | 16218 out of $50560$ (32.1%)               |
| Slice Flip-Flops | 8440 out of $50560$ (16.7%)                                      | 2966 out of $50560$ $(5.9\%)$                         | 11406 out of $50560$ (22.6%)               |
| Block RAMs       | 53 out of 232 $(22.8\%)$                                         | $ \begin{array}{cccccccccccccccccccccccccccccccccccc$ | 98 out of 232 (42.2%)                      |
| DSP Slices       | 0                                                                | 0                                                     | 0                                          |

- Resource utilization of Virtex-4 FX60 (<1/5 of the FPGA, acceptable!)
- Timing limitation: 125 MHz without optimization effort
- Clock frequency fixed at 100 MHz, to match the PLB speed



### **Performance Evaluation**

#### Experimental setup:

- A C program running on the Xeon 2.4 GHz computer as the software reference
- Measurement points on different wire multiplicities (10, 30, 50, 200, 400 fired wires out of 2110)
- Speedup of 10.8 24.3 times per module compared to the software solution
- Multiple cores integrated on each FPGA for parallel processing (performance speedup of more than two orders of magnitude expected for each CN)





### Other Algorithms for HADES & PANDA

Except for the HADES MDC tracking, other algorithms are also being developed for HADES and PANDA:

- HADES ring recognition for RICH (Johannes Roskoss, HK 67.105)
- HADES shower recognition for Electromagnetic Shower (Andreas Kopp , HK 67.105)
- PANDA tracking for Straw Tube Tracker (David Muenchow, HK 67.101)
- •

# All algorithms are to be implemented on CNs for HW processing.



#### **Current Status**

- The first version CN PCB has been tested
  - Optical links (@ 2Gbps to TRB2, 0 bit error for 150-hour test)
  - Gigabit Ethernet (UDP/IP:~400 Mbps, TCP/IP:~300 Mbps)
  - JTAG chain
  - CPLD+Flash system start-up mechanism and remote reconfigurability
  - DDR2 SDRAM
  - Other peripherals
- Algorithms under development & implementation



### Outlook

- The next version PCB will be produced soon.
- More than 3 boards for network investigation.
- All algorithms to be implemented.
- Network parallel/pipelined processing investigation with multiple CNs.
- In the end of 2009, one running ATCA crate for HADES upgrade.
- PANDA in the future ...



# Thanks for your attention!



## Algorithm Development

Example 2: A universal event selector





#### **Event Selector**



Measurement results:

- Processing capability of data flow
- Event selection rates of 100% & 25%
- Different FIFO sizes (DMA sizes)
- Processing throughput of ~150 & ~100 MB/s (could be higher)