Active

FPGA ML Accelerator

A custom hardware accelerator for convolutional neural network inference on Xilinx Ultrascale+ FPGAs.

Overview

A high-performance CNN inference accelerator targeting Xilinx Ultrascale+ FPGAs. The design achieves 2.4 TOPS at 200 MHz, enabling real-time image classification at the edge without a GPU.

Problem

Deploying deep learning models at the edge is constrained by power budgets and latency requirements. GPUs are power-hungry and CPUs are too slow for real-time inference. FPGAs offer a middle ground — programmable hardware that can be tuned for specific workloads.

The challenge: designing an accelerator that is both fast enough for real-time inference and flexible enough to support multiple CNN architectures without re-synthesis.

Solution

The accelerator uses a systolic array architecture with configurable dataflow:

  1. Compute Core — 16×16 systolic array of multiply-accumulate (MAC) units operating in INT8 precision.
  2. On-Chip Buffer — Dual-port BRAM-based buffer with prefetching to hide memory latency.
  3. DMA Engine — Custom AXI4-Stream DMA controller for efficient data movement between DDR4 and the compute fabric.
  4. Control Processor — Lightweight RISC-V core for layer sequencing and configuration.

Architecture

The design is parameterized in SystemVerilog, allowing synthesis-time configuration of:

  • Array dimensions (8×8, 16×16, 32×32)
  • Precision (INT4, INT8, FP16)
  • Buffer depth and tiling strategy

A Python-based compiler maps TensorFlow Lite models to the hardware instruction set.

Key Learnings

  1. Quantization is critical. Moving from FP32 to INT8 reduced compute area by 4× with < 1% accuracy loss on ResNet-50.
  2. Memory bandwidth is the bottleneck. Even with on-chip buffering, DDR4 bandwidth limits throughput for large models. Careful tiling and data reuse are essential.
  3. Verification takes longer than design. The UVM testbench ended up being 3× the size of the RTL itself.

Tech Stack

  • SystemVerilog, Xilinx Vivado 2023.2
  • UVM for verification
  • Python (TensorFlow Lite, model compiler)
  • AXI4/AXI4-Stream interfaces
  • Xilinx Alveo U250 development board
systemverilogfpgamachine-learningvivadopython
Was this helpful?