Active

FPGA ML Accelerator

A custom hardware accelerator for convolutional neural network inference on Xilinx Ultrascale+ FPGAs.

Overview

A high-performance CNN inference accelerator targeting Xilinx Ultrascale+ FPGAs. The design achieves 2.4 TOPS at 200 MHz, enabling real-time image classification at the edge without a GPU.

Deploying deep learning models at the edge is constrained by power budgets and latency requirements. GPUs are power-hungry and CPUs are too slow for real-time inference. FPGAs offer a middle ground — programmable hardware that can be tuned for specific workloads.

The challenge: designing an accelerator that is both fast enough for real-time inference and flexible enough to support multiple CNN architectures without re-synthesis.

Solution

The accelerator uses a systolic array architecture with configurable dataflow:

Compute Core — 16×16 systolic array of multiply-accumulate (MAC) units operating in INT8 precision.
On-Chip Buffer — Dual-port BRAM-based buffer with prefetching to hide memory latency.
DMA Engine — Custom AXI4-Stream DMA controller for efficient data movement between DDR4 and the compute fabric.
Control Processor — Lightweight RISC-V core for layer sequencing and configuration.

Architecture

The design is parameterized in SystemVerilog, allowing synthesis-time configuration of:

Array dimensions (8×8, 16×16, 32×32)
Precision (INT4, INT8, FP16)
Buffer depth and tiling strategy

A Python-based compiler maps TensorFlow Lite models to the hardware instruction set.

Key Learnings

Quantization is critical. Moving from FP32 to INT8 reduced compute area by 4× with < 1% accuracy loss on ResNet-50.
Memory bandwidth is the bottleneck. Even with on-chip buffering, DDR4 bandwidth limits throughput for large models. Careful tiling and data reuse are essential.
Verification takes longer than design. The UVM testbench ended up being 3× the size of the RTL itself.

Tech Stack

SystemVerilog, Xilinx Vivado 2023.2
UVM for verification
Python (TensorFlow Lite, model compiler)
AXI4/AXI4-Stream interfaces
Xilinx Alveo U250 development board

Overview

Problem

Solution

Architecture

Key Learnings

Tech Stack