



## A Reconfigurable Design Framework for FPGA Adaptive Computing

**Ming Liu**<sup>+</sup>\*, Wolfgang Kuehn<sup>+</sup>, Zhonghai Lu<sup>+</sup>, Shuo Yang<sup>+</sup>, Axel Jantsch<sup>+</sup>

<sup>+</sup> Justus-Liebig-University Giessen (JLU), Germany
<sup>‡</sup> Royal Institute of Technology (KTH), Sweden

# Outline

# Introduction & Motivation

- Reconfigurable Framework for Adaptive Computing
  - HW infrastructure
  - OS, device drivers & scheduler SW
  - Context saving and restoring
- **A Case Study**
- Technical Perspectives in applications
- Conclusion & Future Work

# **Introduction & Motivation**

- Adaptive computing: algorithms adapted to ambient conditions during system run-time.
- Benefits:
  - Higher performance
  - Lower power consumption
  - Multitasking on limited resources
  - • • •
- Conventional adaptive multitasking on general-purpose CPUs + OSes: well-fledged as the development of OSes & scheduler
  - Computing resources (CPU) intelligently and efficiently utilized
- In the FPGA world???
  - Static designs? Not adaptive
  - Partial Reconfiguration (PR)? Technical support
- Motivation: a complete design framework for more efficient hardware resource management, based on FPGA PR technology.

# Framework (HW Infrastructure)

- Xilinx PR design flow [1]
- Modular algorithm designs (A1, A2, ...)
- PR Region (PRR)
- PR communication interface (BMs)
- System manager (GP CPU)
- Peripherals, memories, …
- ICAP design [2]
  - mst\_hwicap
  - Conf. speed: ~235 MB/s
  - Conf. overhead: xx µ s to xxx µ s

[2] M. Liu, W. Kuehn, Z. Lu, and A. Jantsch, "Run-time Partial Reconfiguration Speed Investigation and Architectural Design Space Exploration", *In Proc. Of the International Conference on Field Programmable Logic and Applications*, Aug. 2009. [1] Xilinx Inc., "Early Access Partial Reconfiguration User Guide for ISE 8.1.01i", UG208 (v1.1), Mar. 2006.



### Framework (OS, Drivers & Scheduler)

- Embedded OS or Standalone
- Device drivers for algorithm modules
  - Software control registers
  - Interrupts
- Algorithm scheduler
  - Application programs (flexible & portable)
  - Monitors ambient conditions and triggers algorithm switching
  - HW processes are preemptable and comply with the scheduler
  - Flexible disciplines



# Framework (Context Switching)

#### Context

- Control registers
- Buffered incoming data
- Intermediate calculation results
- To be saved and restored for algorithm modules in many cases
- Concrete approaches [3][4]:
  - Register read and write
  - Bitstream readout and analysis

[3] H. Kalte and M. Porrmann, "Context Saving and Restoring for Multitasking in Reconfigurable Systems", *In Proc. of the International Conference on Field Programmable Logic and Applications*, Aug. 2005.

[4] C. Huang and P. Hsiung, "Software-controlled Dynamically Swappable Hardware Design in Partially Reconfigurable Systems", *EURASIP Journal on Embedded Systems*, Jan. 2008.

# A Case Study

- A case study switching a NOR flash memory controller and an SRAM controller (V4-FX20)
- A pre-verification for algorithm switching in real applications
  - Existing IP cores and no need to modify
  - Same connection interfaces (PLB, I/Os)
  - To save I/O pins and resources on the FPGA

## Motivation:

- NOR flash memory: embedded Linux kernel
- SRAM: LUT storage for application-specific computation
- To share FPGA resources and access different memories according to system requirements

# A Case Study (HW Design)

- PR Region (PRR) reserved
- Flash and SRAM controller to be loaded: standard IP cores and no need to modify
- Shared I/O pins to external devices
- Customized BM interfaces
  - To lock signal routing between static & PR designs
  - BM\_out\_en to isolate unpredictable outputs during reconfiguration
  - Reset to solely reset the newly loaded core after reconfiguration
- ♦ GPIO controls
  - BM\_out\_en
  - Reset



## A Case Study (Operation Flow)



- Operations in Linux
- Context saving
- Remove device driver
- Disable BM outputs
- d. Module reconfiguration
- e. Reset the module
  - Re-enable BM outputs
  - . Insert device driver

## A Case Study (Results)

#### Evaluation

- Lstatic =  $L_1 + L_2 + ... + L_n$  (static LUTs)
- Fstatic = F1 + F2 + ... + Fn (static Flip-Flops)
- LPR = LPRR +1/2\*LBM
- FPR = FPRR
- LPRR = FPRR = Max[(L1+1/2\*LBM), ...(Ln+1/2\*LBM),
  - F1, ...Fn]+Rmargin (to reserve the PR region)



Resource utilization benefits with PR

|           | static        | pr design         | $\diamond$ util. factor |
|-----------|---------------|-------------------|-------------------------|
| I/O pins  | 56(s)+61(f)   | • 61              | ♦ 52.1%                 |
| 4-LUTs    | 954(s)+923(f) | 1624(prr+1/2*BMs) | ♦ 86.5%                 |
| Slice FFs | 728(s)+867(f) | 1296(prr)         | ♦ 81.3%                 |

(PR LUTs)

(PR Flip-Flops)

Reconfiguration overhead: 299 µ s for 71 KB bitstream
 More benefits foreseen when more and larger algorithm modules multiplex the PR region

# **Application Perspectives**

- To be applied and verified in nuclear and particle physics experiments (HADES, PANDA, WASA, etc..)
- Large-scale massive computing for online data acquisition (DAQ) and triggering based on FPGA clusters

Motivation:

- Multiple pattern recognition algorithms
- Multiple cores for parallel processing
- Different computation features for algorithms (computation-bounded, memory-bounded, ...)
- Conventional approach: algorithm partitions are statically distributed on FPGA nodes by designers
- Too complicated to manage and modify the design

# **Application Perspectives**

#### • Expected benefits from adaptive computing:

- Easy dynamic design management
- Efficient resource utilization for higher performance
- Reduced FPGA size/count (costs)



#### Adaptive design:



- Uniform design in adaptive computing easy to maintain system designs
- No data distribution requirements for optical hubs (all kinds of sub-events fed into all FPGAs)
- Balanced computing and more efficient FPGA resource utilization [5]

[5] M. Liu, Z. Lu, W. Kuehn, and A. Jantsch, "FPGA-based Adaptive Computing Framework for Correlated Multi-stream Processing", In Proc. of the Design, Automation & Test in Europe Conference 2010, to appear.

Dec. 09, 2009

Static design:

# **Conclusion and Future Work**

Conclusion:

- A design framework for FPGA-based adaptive computing
- ♦ Key aspects discussed
- ♦ A case study using general memory controllers
- Technical perspectives in target applications

#### Future Work:

- Individual in-depth research in different aspects of the framework
- Verification with real algorithms for physics experiments





# Thank You !

# Framework (PR Tech. Support)

- ICAP for FPGA configuration
- ICAP designs [2]
  - Xilinx opb\_hwicap
  - Xilinx xps\_hwicap
  - Improved mst\_hwicap
  - Improved bram\_hwicap
- Practical mst\_hwicap
- Conf. speed: ~235 MB/s (from DDR memory)
- Conf. overhead: xx µs xxx µs



