GPUs and Accelerators
May 16, 2026·14 min read·advanced
The CPU is no longer the only major compute engine in a typical system. GPUs, NPUs (neural processing units), DSPs, video codecs, cryptographic accelerators, and specialized AI engines all sit…
The CPU is no longer the only major compute engine in a typical system. GPUs, NPUs (neural processing units), DSPs, video codecs, cryptographic accelerators, and specialized AI engines all sit alongside CPUs, often consuming more silicon area and more power than the CPU itself. This chapter, referenced from Chapter 29 (data-level parallelism), covers the architectural principles of these accelerators — what makes them different from CPUs, why they exist, and how they integrate into systems.
We focus on three broad categories: GPUs (general-purpose throughput accelerators), neural / matrix accelerators (specialized for AI workloads), and fixed-function accelerators (codecs, crypto, networking).
01. Why Accelerators Exist
A CPU is a generalist. It must handle branchy, irregular code; small data sets with unpredictable access patterns; varied instruction mixes; tight latency requirements. The architectural cost of being good at this is high: deep pipelines, sophisticated branch predictors, large caches, out-of-order execution, complex schedulers. A modern CPU spends most of its silicon area and power budget on infrastructure to extract performance from sequential code, not on the actual arithmetic.
Many important workloads have different characteristics:
- Highly parallel: thousands or millions of independent operations.
- Regular access patterns: sequential, strided, or known in advance.
- Throughput-oriented: latency of any individual operation doesn't matter.
- Limited control flow diversity: same code applied to many data items.
For these workloads, a CPU is overbuilt. An accelerator can deliver 10-100× more throughput per watt by removing the generalist machinery and packing in arithmetic units. The cost: poor performance on workloads that don't fit the accelerator's pattern.
The economic logic: spend silicon where the workload is, not where the workload isn't.
02. GPU Architecture
A modern GPU (NVIDIA H100, AMD MI300, Intel Data Center GPU Max, Apple M3 Pro/Max GPU, the integrated GPUs in mobile SoCs) shares a common architectural shape, despite differences in vocabulary across vendors.
SIMT Execution Model
GPUs use SIMT (Single Instruction, Multiple Threads) execution. The basic unit is a warp (NVIDIA) or wavefront (AMD) — a group of 32 or 64 threads that execute the same instruction in lockstep across different data. Conceptually, a warp is like a SIMD lane vector, but each lane has its own program counter (in current architectures) and can mask off independently.
When threads in a warp branch differently (warp divergence), the hardware serializes the divergent paths: it executes the "true" branch with mask of true threads, then the "false" branch with mask of false threads, then reconverges. Divergence reduces throughput proportionally to how many paths exist.
Above warps, GPUs group hundreds to thousands of threads into thread blocks (NVIDIA) or workgroups (AMD/OpenCL/Vulkan). A thread block is the unit of cooperation: threads in a block share shared memory (a small fast SRAM scratchpad) and can synchronize via barriers.
Above thread blocks, the application launches a grid of blocks. The grid can have millions of threads.
Streaming Multiprocessors
The execution unit is the SM (Streaming Multiprocessor, NVIDIA) or CU (Compute Unit, AMD) or Xe-core (Intel). An SM contains:
- Several warp schedulers (typically 4 in modern NVIDIA SMs).
- Multiple SIMD execution units (FP32, FP64, INT, special-function, tensor).
- A register file (very large — 256 KB on Hopper SMs).
- Shared memory / L1 cache (~256 KB on Hopper, configurable split).
- Texture units (in graphics-capable GPUs).
A modern flagship has 100-200 SMs. The H100 has 132 SMs; the H200 same; the B200 has 208. Each SM can hold thousands of in-flight threads (occupancy depends on register/memory usage per thread).
Latency Hiding through Massive Multithreading
The SM's central trick: when one warp stalls (memory access, dependency), another warp runs. With dozens of warps resident per SM, there's almost always a runnable warp — memory latency that would crush a CPU is hidden behind compute from other warps.
The cost: the register file must hold state for all those warps. NVIDIA Hopper SMs have a 256 KB register file per SM — more registers than many CPUs have caches.
The benefit: GPUs achieve high arithmetic throughput on memory-bound workloads. Where a CPU would stall waiting for DRAM, the GPU computes on other warps.
Memory Hierarchy
GPU memory is hierarchical, with different levels and characteristics than CPU:
- Registers: per-thread; fastest.
- Shared memory / L1: per-SM; ~100 KB to 256 KB; programmer-visible scratchpad with very low latency.
- L2 cache: chip-wide; tens of MB on flagships (H100 has 50 MB).
- HBM: stacked DRAM via interposer; multi-TB/s aggregate bandwidth.
The bandwidth ratios are extreme: HBM bandwidth on H100 is 3 TB/s, while DDR5 on a server CPU is ~500 GB/s. GPU compute throughput at FP32 is hundreds of TFLOPS; CPU compute at FP32 is hundreds of GFLOPS. GPUs are about 100× ahead on both dimensions for these workloads.
Tensor Cores
A more recent addition: dedicated tensor cores (NVIDIA) or matrix engines (AMD, Intel) inside each SM. These are SIMD multipliers performing matrix-multiply-accumulate (MMA) on small tiles of input matrices.
A tensor core might compute a 16×16 matrix multiply per cycle, in formats like FP16 × FP16 → FP32, BF16, INT8, FP8, or FP4. The throughput of tensor cores is 5-20× the throughput of general FP arithmetic — at the cost of being applicable only to matrix-multiply-shaped problems.
This shape covers most of deep learning: convolutions, fully-connected layers, attention. Tensor cores are why modern GPUs train neural networks so fast.
Successive generations have added more tensor formats and higher density:
- V100 (2017): first tensor cores, FP16 → FP32.
- A100 (2020): TF32, BF16, INT8, FP64, sparsity support.
- H100 (2022): FP8 (E5M2 and E4M3); transformer engine.
- B200 (2024): FP6, FP4; second-gen transformer engine; massive throughput.
Programming Model
GPUs are programmed through APIs:
- CUDA (NVIDIA): the dominant ecosystem. C++-like with kernel launches, threads, blocks, grids.
- HIP (AMD): a CUDA-like API, source-portable.
- OpenCL: cross-vendor but losing relevance.
- SYCL: C++ single-source heterogeneous programming, gaining traction in HPC.
- Vulkan / DirectX / Metal: graphics APIs with compute capabilities.
- Triton / OpenAI Triton: Python-based GPU programming for ML kernels.
- PyTorch / TensorFlow / JAX: high-level Python frameworks that lower to GPU kernels.
CUDA's dominance is a key enabler of NVIDIA's market position: years of ecosystem, libraries (cuDNN, cuBLAS, CUTLASS, NCCL), and developer familiarity. Competitors must either replicate CUDA (HIP / ROCm), abstract above it (Triton, MLIR), or accept the friction.
03. Neural Processing Units
A NPU is an accelerator specialized for neural network inference (and sometimes training). Where a GPU is a general-purpose throughput machine adapted to AI, an NPU is purpose-built.
NPUs trade flexibility for efficiency:
- Fixed dataflow: data movement patterns are baked in (e.g., convolution-friendly).
- Reduced precision: INT8, FP8, INT4 are first-class.
- Massive on-chip SRAM: keep weights and activations on-chip.
- Specialized memory hierarchies: optimized for NN access patterns.
Mobile NPUs
Every modern smartphone SoC has an NPU:
- Apple Neural Engine (since A11, 2017): up to 38 TOPS in M3.
- Qualcomm Hexagon AI (Snapdragon platforms).
- Google Tensor Processing Unit (in Pixel chips).
- Samsung Exynos NPU.
- MediaTek APU.
These run on-device AI: face recognition, speech recognition, photo enhancement, real-time translation. Power budget: 1-5W. Performance: 5-50 TOPS at INT8 / FP16.
Data Center AI Accelerators
At the high end:
- Google TPU (v1 through v5/Trillium): the original wide-and-shallow systolic-array AI accelerator. Trained PaLM and Gemini.
- AWS Trainium / Inferentia: Amazon's custom AI accelerators.
- Cerebras WSE: a wafer-scale processor with 850,000 cores on a single die.
- Graphcore IPU: massively parallel in-memory compute.
- SambaNova: dataflow architecture for neural nets.
- Groq LPU (Language Processing Unit): deterministic, highly pipelined inference accelerator.
- Tenstorrent: RISC-V-based AI chips with tile-based architecture.
- Intel Gaudi: AI training accelerator.
Each takes a different design path. TPUs and Groq use systolic arrays; Cerebras uses massive parallelism on a wafer-scale chip; Graphcore uses many small processors with high inter-processor bandwidth; Tenstorrent uses an array of programmable tiles.
These accelerators are the reason AI is economically feasible at scale. A few hundred thousand H100s or TPU v5s power most of the world's large-model training and inference. The capital deployed in AI accelerators in 2024-2025 dwarfs investment in any other compute category.
Systolic Arrays
A common pattern in NPUs: the systolic array. Originally proposed by Kung and Leiserson (1978), it's a grid of small processing elements where data flows through in lock-step waves. Each element multiplies and adds, then passes data to neighbors.
For matrix multiplication, a systolic array streams one matrix horizontally and another vertically; results accumulate at each cell. The TPU v1's matrix unit is a 256×256 systolic array of INT8 multiply-accumulate cells.
Systolic arrays are extremely efficient: minimal control overhead per operation, excellent data reuse, simple structure. The trade-off: the data layout must match the array's expected pattern. General-purpose computation doesn't fit; pure matrix-multiply does.
04. Specialized Accelerators
Beyond GPUs and NPUs, modern SoCs include many fixed-function accelerators:
Video codecs. H.264, H.265 (HEVC), AV1, VP9 — all standardized; encode and decode in hardware. Phone camera apps wouldn't be feasible without dedicated codec ASICs. A modern phone SoC has separate encode and decode units; high-end SoCs support multiple streams simultaneously.
Image signal processors (ISPs). Process raw sensor data into images: demosaic, white balance, denoising, sharpening, HDR fusion, lens correction. Phone cameras run sophisticated computational photography pipelines on dedicated ISPs that run for milliseconds while consuming watts (vs. seconds and tens of watts on a CPU).
Audio DSPs. Always-on listening, noise cancellation, beamforming microphone arrays. Run continuously at very low power.
Cryptographic accelerators. AES, SHA, RSA, ECC primitives in hardware. AES-NI (Intel) and AArch64's Crypto Extensions accelerate symmetric crypto in the CPU itself; dedicated accelerators handle TLS termination at scale.
Networking offload. Smart NICs (Mellanox/NVIDIA ConnectX, Intel IPU, AMD Pensando) terminate TCP, do RDMA, run flow tables in hardware. SmartSwitch fabrics. Offloading freedom from the CPU is essential for 100/400/800 Gbps networking.
Storage offload. NVMe controllers handle their own queues and DMAs; SSDs have powerful internal processors managing wear-leveling, garbage collection, and increasingly computational storage features.
Display engines. Compositing layers, scaling, color management — done in hardware on every modern device.
The pattern: any sufficiently common operation, run frequently enough at scale, becomes a candidate for hardware specialization.
05. Integration: Coherent vs. Non-Coherent
How accelerators integrate with the CPU varies:
Discrete via PCIe. The traditional model. The accelerator has its own memory; CPU and accelerator communicate through DMA over PCIe. Examples: most discrete GPUs in workstations and servers.
Pros: simple integration, accelerator can be upgraded independently, large memory possible. Cons: data movement is expensive (PCIe Gen 5 at 32 GB/s is much less than internal accelerator bandwidth).
Coherent attached (CXL, NVLink-C2C). Cache-coherent links between CPU and accelerator. The accelerator can access CPU memory coherently, and vice versa.
Pros: simplifies programming, reduces data-movement overhead. Cons: limited bandwidth compared to on-package, coherence imposes overhead.
Integrated on-chip. SoC-style integration: CPU and accelerator share memory via the on-chip interconnect.
Pros: lowest latency, easy programming, best for small accelerators. Cons: silicon-area constrained, accelerator size limited by system design.
Examples: every smartphone SoC; Apple Silicon's unified memory architecture (M1/M2/M3); AMD APUs.
On-package (chiplets, advanced packaging). The new middle ground. CPU and accelerator are separate dies in one package, connected by very high-bandwidth links.
Examples: NVIDIA Grace Hopper Superchip (Grace CPU + Hopper GPU); Apple M3 Max with CPU and GPU dies; AMD MI300A APU (CPU + GPU + HBM in one package).
Advanced packaging (Chapter 55) is what makes this option practical.
06. Programming Models for Accelerators
Heterogeneous programming has converged toward a few patterns:
Kernel offload. Host CPU runs the application; specific compute kernels are offloaded to the accelerator. Data must be moved (or accessed via shared coherent memory). CUDA and HIP follow this model.
Unified memory. CPU and accelerator share a memory space (logically, sometimes also physically). The runtime moves data behind the scenes. Apple's unified memory architecture; CUDA's Unified Memory feature.
Task graphs. The application builds a DAG of tasks; runtime schedules them across CPU and accelerators. Used by some frameworks (e.g., Legion / Stanford StarPU).
Compile-time targeting. The compiler decides what runs where based on annotations or auto-vectorization. SYCL's design; Intel oneAPI's vision.
Domain-specific frameworks. PyTorch, TensorFlow, JAX abstract over hardware; user code is high-level; the framework's runtime dispatches to GPU/TPU/CPU automatically. The dominant pattern in AI.
07. Performance and Cost
For a representative workload comparison: training a large transformer model.
- CPU only: feasible for tiny models; impractical for production-scale.
- Single high-end GPU: trains small models; hits memory limits for large ones.
- 8-GPU server (H100 NVL): trains models up to ~70B parameters with distributed strategies.
- GPU cluster (1000s of H100s): trains frontier models (GPT-4-class, Gemini Ultra-class).
Costs scale: a single H100 retails for 40,000; a fully-loaded H100 server is ~50M-$500M in compute alone. The economics of accelerator deployment are now a major consideration in AI strategy.
Efficiency: tensor-core FLOPS per watt has improved dramatically.
- A100 FP16: ~150 TFLOPS at 400W → 0.4 TFLOPS/W.
- H100 FP8: ~3000 TFLOPS at 700W → 4.3 TFLOPS/W.
- B200 FP4: ~5000 TFLOPS at 1000W → 5 TFLOPS/W.
The gains come from process improvements, lower-precision math, and architectural improvements. Total system efficiency (including memory, networking, cooling) is lower; data centers measure PUE (power usage effectiveness) of 1.1-1.5×.
08. Future Directions
Several trends will shape accelerator design:
More precision formats. FP4, FP3, even ternary or binary formats. Each bit of precision is twice the throughput per area / power.
Sparse computation. Skipping zero or near-zero values; structured sparsity patterns. Already supported by Hopper / Blackwell tensor cores.
Dataflow architectures. Rather than instruction-driven, reconfigurable dataflows that compile to specific kernels. SambaNova, Cerebras pursue this.
Memory-compute integration. PIM (processing in memory) — compute on or very close to memory. Samsung's HBM-PIM; SK Hynix's PIM-DIMM; Mythic's analog compute-in-memory.
Optical interconnect. Photonic links between chips, dies, even within chips. Lightmatter, Ayar Labs.
Specialized for new model architectures. Mixture-of-experts, state-space models, neural ODEs may favor different microarchitectures than dense transformers.
The accelerator landscape will not stabilize soon. AI workloads are moving fast, and hardware that takes 3-5 years to design has to anticipate where the workloads will be in 2030.
09. Implications for the CPU
Accelerators don't make the CPU irrelevant. They redefine its role:
- The CPU runs orchestration, control flow, OS, and the parts of workloads that don't fit the accelerator.
- The CPU manages data flow between accelerators, storage, and network.
- The CPU handles security boundaries, scheduling, accounting.
In a typical AI inference server, the CPU is a small fraction of total compute but is responsible for receiving requests, formatting them, dispatching to the GPU, and returning results. CPU bottlenecks here can starve a million-dollar GPU.
In an autonomous vehicle, the CPU runs the operating system, mediates between sensors and accelerators, executes safety-critical control loops. Specialized accelerators handle perception (vision NPU), localization (specific compute units), planning (often back on CPU).
In a smartphone, the CPU runs the OS, the apps' main threads, and the parts of UI and logic that don't fit other engines. The GPU draws the screen. The NPU runs always-on AI features. The ISP processes camera frames. The audio DSP handles voice. All cooperatively.
The CPU isn't going away. It's becoming the conductor, while the orchestra grows more diverse.
10. Summary
Accelerators specialize for workloads where the CPU's general-purpose machinery is overkill. GPUs are throughput-oriented SIMT engines with massive multithreading hiding memory latency; NPUs are AI-specialized with reduced-precision arithmetic and dataflow optimization; fixed-function blocks (codecs, ISPs, crypto, networking) handle specific operations more efficiently than any programmable engine.
Integration spans from PCIe-attached discrete cards to coherent on-package chiplets. Programming models range from low-level (CUDA) to high-level (PyTorch). The economics of accelerators have become a major driver of computing investment, especially in AI.
For the CPU, accelerators are both a partner and a competitor: more workloads move off the CPU, but the CPU's role as orchestrator becomes more central and demands continued evolution.
The next chapter looks at the other end of the spectrum: embedded and real-time systems, where deterministic latency, low power, and reliability are the dominant concerns rather than raw throughput.