FPGA ML Accelerator
A custom hardware accelerator for convolutional neural network inference on Xilinx Ultrascale+ FPGAs.
Overview
A high-performance CNN inference accelerator targeting Xilinx Ultrascale+ FPGAs. The design achieves 2.4 TOPS at 200 MHz, enabling real-time image classification at the edge without a GPU.
Problem
Deploying deep learning models at the edge is constrained by power budgets and latency requirements. GPUs are power-hungry and CPUs are too slow for real-time inference. FPGAs offer a middle ground — programmable hardware that can be tuned for specific workloads.
The challenge: designing an accelerator that is both fast enough for real-time inference and flexible enough to support multiple CNN architectures without re-synthesis.
Solution
The accelerator uses a systolic array architecture with configurable dataflow:
- Compute Core — 16×16 systolic array of multiply-accumulate (MAC) units operating in INT8 precision.
- On-Chip Buffer — Dual-port BRAM-based buffer with prefetching to hide memory latency.
- DMA Engine — Custom AXI4-Stream DMA controller for efficient data movement between DDR4 and the compute fabric.
- Control Processor — Lightweight RISC-V core for layer sequencing and configuration.
Architecture
The design is parameterized in SystemVerilog, allowing synthesis-time configuration of:
- Array dimensions (8×8, 16×16, 32×32)
- Precision (INT4, INT8, FP16)
- Buffer depth and tiling strategy
A Python-based compiler maps TensorFlow Lite models to the hardware instruction set.
Key Learnings
- Quantization is critical. Moving from FP32 to INT8 reduced compute area by 4× with < 1% accuracy loss on ResNet-50.
- Memory bandwidth is the bottleneck. Even with on-chip buffering, DDR4 bandwidth limits throughput for large models. Careful tiling and data reuse are essential.
- Verification takes longer than design. The UVM testbench ended up being 3× the size of the RTL itself.
Tech Stack
- SystemVerilog, Xilinx Vivado 2023.2
- UVM for verification
- Python (TensorFlow Lite, model compiler)
- AXI4/AXI4-Stream interfaces
- Xilinx Alveo U250 development board