Part VIIAdvanced and Frontier

Modern Packaging

May 16, 2026·13 min read·advanced

For most of computing history, a "chip" meant a single die in a single package. Moore's Law shrank the transistors; packaging stayed simple. That era is ending. Modern flagship CPUs and GPUs…

For most of computing history, a "chip" meant a single die in a single package. Moore's Law shrank the transistors; packaging stayed simple. That era is ending. Modern flagship CPUs and GPUs increasingly comprise multiple dies — chiplets — connected through advanced packaging. High-bandwidth memory is stacked vertically and integrated next to the compute die. Substrates have become active circuits. Packaging is now a first-class architectural concern.

This chapter covers the why and how of modern packaging: the economics that drove the industry to chiplets, the interconnect technologies that make them practical, the 3D-stacked memory now standard in AI accelerators, and the consequences for system architecture.

01. Why Chiplets

Reticle limits cap the maximum die size at any given process to roughly 850 mm². Going larger requires stitching across exposures, which is costly and yield-limiting. Beyond reticle, yield is the bigger issue: a defect rate of (say) 0.1 defects/cm² on a 800 mm² die means many dies have multiple defects, and yield is poor.

Chiplets break a large design into multiple smaller dies. Yield improves dramatically — defects in one chiplet don't ruin the whole design. Different chiplets can use different process nodes — analog or I/O blocks on a cheaper, mature process; high-density logic on a leading-edge process. Different dies can be designed independently and reused across products.

Costs in a leading-edge process are dominated by mask costs (tens of millions for a full mask set), so reusing chiplet designs across products amortizes this. AMD's chiplet strategy has been a driver of their cost competitiveness against Intel's monolithic designs.

02. Multi-Chip Packages

Before "chiplets" was a buzzword, multi-chip modules (MCMs) were already in use. Two or more dies side by side in one package, connected through the package substrate.

Examples spanning decades:

  • Pentium Pro (1995): two dies — CPU and L2 cache — in one package.
  • Core 2 Quad (2007): two dual-core dies in one package.
  • AMD Zen architectures (2017+): chiplets connected via Infinity Fabric.

The package substrate is essentially a small PCB with the chips attached. Substrate routing supports modest bandwidth — hundreds of GB/s of total bandwidth, depending on signal pin counts and frequency.

03. AMD's Chiplet Approach

AMD pioneered the chiplet approach in modern x86 with Zen. The Zen 2 / Zen 3 / Zen 4 client and server processors comprise:

  • CCDs (Core Complex Dies): each containing 8 cores plus L3 cache. CCDs are made on the leading-edge process (TSMC N7, N5, N4).
  • IOD (I/O Die): containing the memory controllers, PCIe controllers, Infinity Fabric, and other I/O. Made on a more mature, cheaper process (TSMC N12, N6).

A 16-core Ryzen 9 has two CCDs plus an IOD. A 64-core EPYC has 8 CCDs plus an IOD. The IOD is a hub: every memory access from any CCD goes through the IOD's memory controllers.

Tradeoffs:

  • Pro: cost. Smaller dies, higher yield, optimal process per function.
  • Pro: scalability. Same CCD design used in 8-core consumer parts and 96-core servers.
  • Con: latency. Memory access from a CCD goes through the IOD; cross-CCD cache coherence requires multiple hops.
  • Con: power for inter-chiplet links. Infinity Fabric traffic consumes significant power.

The latency cost is real: AMD CCD-to-DRAM latency is higher than Intel's monolithic equivalents by 10-20 ns. AMD compensates with larger caches and aggressive prefetching.

04. Intel and Modular Packaging

Intel was slower to embrace chiplets but committed substantially with the Sapphire Rapids generation (2022) and beyond:

  • Meteor Lake (2023) desktop / mobile: "tile" architecture with separate compute, GPU, SoC, and I/O tiles connected by Foveros (3D packaging) and a base tile.
  • Ponte Vecchio GPU: 47 active tiles using EMIB and Foveros.
  • Sapphire Rapids Xeon: four "tile" arrangement using EMIB.

Intel's terminology is "tile" rather than "chiplet" — same concept. The distinction is implementation: tiles can be 3D stacked (Foveros) where chiplets are usually 2D side-by-side.

05. Interconnect Standards: UCIe

A key barrier to mixing chiplets from different vendors has been the lack of a standard interconnect. UCIe (Universal Chiplet Interconnect Express) is a 2022 standard from Intel, AMD, ARM, Samsung, TSMC, Google, Microsoft, Meta, and Qualcomm. UCIe defines:

  • A physical layer (high-speed parallel signaling).
  • A die-to-die link layer (flow control, retransmission).
  • A protocol layer compatible with PCIe, CXL, or arbitrary streaming.
  • Multiple form factors: standard package (organic substrate), advanced package (silicon interposer or EMIB).

Bandwidth in advanced package mode is enormous — terabytes per second per die-to-die link, far exceeding any PCIe.

UCIe is not yet widely deployed in shipping products but is gaining traction. The vision is a heterogeneous chiplet ecosystem: pick a CPU chiplet from one vendor, a GPU chiplet from another, an I/O chiplet from a third, integrate them into a custom package.

06. High-Bandwidth Memory (HBM)

DRAM is fundamentally limited by package pin count: a typical DDR5 channel has ~80 pins delivering ~50 GB/s. Multi-channel parts hit hundreds of GB/s; HBM goes to multiple TB/s.

HBM is DRAM stacked vertically — typically 8 to 12 dies — and connected to the host through thousands of fine-pitch microbumps via a silicon interposer. Each HBM stack delivers hundreds of GB/s; multiple stacks deliver TB/s.

HBM2 (2016): up to 256 GB/s per stack.

HBM2E (2019): 460 GB/s per stack.

HBM3 (2022): 819 GB/s per stack.

HBM3E (2024): 1.2 TB/s per stack.

HBM4 (announced for 2025-2026): 1.5+ TB/s per stack.

A typical AI accelerator (NVIDIA H100, AMD MI300, Intel Gaudi 3) has 4-8 HBM stacks, totaling 5-10 TB/s of memory bandwidth — orders of magnitude beyond traditional DDR-based systems.

The cost: HBM is expensive. Each stack costs more than an entire DDR5 DIMM; the silicon interposer adds significant area; assembly is more complex. HBM is reserved for products where memory bandwidth is the binding constraint — data center AI, high-end GPUs, HPC accelerators.

For consumer parts, HBM hasn't replaced GDDR. GDDR6 / GDDR6X / GDDR7 deliver less bandwidth (still hundreds of GB/s per part) at much lower cost. NVIDIA's RTX gaming cards use GDDR6/7; the data center H100/H200/B200 use HBM.

07. CoWoS and Other Interposer Technologies

The dominant advanced-packaging technology for HBM and large GPUs is CoWoS (Chip-on-Wafer-on-Substrate, TSMC). A silicon interposer the size of multiple chips is fabricated on a wafer; chiplets are bonded face-down to the interposer; the assembly is then mounted on a package substrate.

Silicon interposers carry many thousands of fine-pitch traces (microns wide, microns apart) — far more than a package substrate can route. This is what makes HBM's thousand-bit-wide bus possible.

Variants:

  • CoWoS-S: silicon interposer (most common).
  • CoWoS-R: redistribution layer (cheaper but lower bandwidth).
  • CoWoS-L: local silicon interconnects (efficient for select connections).

Intel has a similar technology: EMIB (Embedded Multi-Die Interconnect Bridge). Instead of a full interposer, EMIB embeds small silicon bridges in the package substrate at chiplet-to-chiplet boundaries. Cheaper than full interposer; targeted bandwidth where needed.

Capacity limits: a CoWoS interposer is currently around 80 mm × 80 mm at most (to match reticle limitations), so the total chiplet area is bounded. Multi-reticle interposers ("CoWoS-L" with redistribution) extend this. Future designs target multi-reticle interposers up to several times the current size.

08. 3D Stacking

Beyond 2D side-by-side and 2.5D interposer-based, 3D stacking vertically integrates dies. A common bond is at the die's natural connections (Through-Silicon Vias, TSVs), with hybrid or copper-to-copper bonding between dies.

Examples:

AMD 3D V-Cache. Stack a 64 MB SRAM die on top of a Zen 3 / Zen 4 CCD, tripling L3 cache to 96 MB. Used in Ryzen 7 5800X3D, 7800X3D, 9800X3D, and EPYC parts. Performance gains in cache-sensitive games and workloads are dramatic — sometimes 20%+ over the non-stacked version.

Intel Foveros. 3D stacking used in Lakefield and Meteor Lake. A "base tile" provides power and clock distribution; logic tiles stack on top.

HBM itself. HBM stacks 8-12 DRAM dies plus a base logic die.

3D stacking introduces thermal challenges: heat from a lower die has to travel through upper dies. The 3D V-Cache, for example, is on the top of the CCD specifically because it's the cooler component (SRAM dissipates less than the active core); reversing the order would melt the SRAM.

Future 3D logic-on-logic stacking is an active research area. The thermal challenge is more severe — getting heat out of the lower die requires either high-conductivity bond materials, integrated microfluidic cooling, or constrained power densities.

09. CXL and Disaggregation

A different angle: CXL (Compute Express Link) is a high-bandwidth, low-latency cache-coherent interconnect built on PCIe physical layers. CXL 1.0 supports memory expansion: an add-in card with DRAM can be accessed by the CPU as memory. CXL 2.0 / 3.0 add memory pooling, switching, and richer coherence options.

CXL doesn't replace direct memory channels but supplements them: when a server needs more memory than its DDR slots provide, CXL can add capacity. When multiple servers in a rack need to share memory, CXL switches enable it.

Combined with chiplets: future systems may have "memory chiplets" connected by CXL or UCIe, scaling memory capacity independently of compute. Disaggregating memory from CPU enables flexible system composition.

NVIDIA's NVLink is a cache-coherent inter-GPU and CPU-GPU interconnect. NVLink 4 / 5 (Hopper / Blackwell generation) delivers 900 GB/s to 1.8 TB/s per GPU. Eight GPUs in a server connect through NVSwitches to provide all-to-all bandwidth.

Apple's M-series uses an internal fabric (UltraFusion, in M1 Ultra and successors) that connects two M-Max dies with terabit-per-second bandwidth, presenting them as one unified system with shared memory and shared interconnects. The user sees one chip; underneath it's two.

Both technologies illustrate how on-package interconnect now matters as much as off-package interconnect. The bandwidth between GPU dies in a Blackwell B200 is comparable to the bandwidth between sockets in a multi-socket CPU server.

11. Ponte Vecchio: A Case Study

Intel's GPU Max (Ponte Vecchio) deserves a special mention as one of the most complex packages ever built:

  • 47 active tiles total.
  • Mix of TSMC N5, TSMC N7, Intel 7, Samsung 11LPP processes.
  • 8 HBM2E stacks.
  • Foveros (3D) and EMIB (2.5D) packaging.
  • Over 100 billion transistors total.

Ponte Vecchio shipped in the Aurora supercomputer at Argonne National Lab. The complexity meant lengthy bring-up; the chip taught Intel's packaging team a great deal that informed Meteor Lake and beyond.

12. Power Delivery in Advanced Packages

A modern server CPU dissipates 350-500 W; an AI accelerator can exceed 1000 W. All this power must be delivered to the silicon at low voltage (~1V), meaning currents of 500+ amps through the package.

Solutions:

  • Many small VRMs placed close to the silicon, each delivering modest current to a local domain.
  • Higher input voltages (12V or 48V into the package, stepped down on-package).
  • Vertical power delivery through TSVs (separate dedicated paths for power vs. signals).

Intel's research on PowerVia (backside power delivery) routes power on the back side of the silicon, freeing the front side entirely for signal routing. Adopted in Intel 20A and 18A processes; expected to deliver meaningful efficiency and density gains.

13. Substrates and Connections

The package substrate itself has evolved:

  • Organic substrates (most common for CPUs and GPUs): modest layer count, modest density.
  • Glass substrates (announced by Intel and others, 2024): higher density, better dimensional stability, potential for very large packages.
  • Silicon interposers (CoWoS): highest density but most expensive.

Future packages will likely combine these: glass for the main substrate, silicon for high-density bridges, perhaps with embedded passive components.

14. Implications for Architecture

Modern packaging changes architectural choices:

Memory-attached compute (PIM — Processing in Memory): some HBM3 generations allow simple compute (data movement, basic ops) inside the HBM stack, bypassing the host. Not universally deployed but emerging.

Latency tolerance. Off-die accesses are fast in advanced packaging (often comparable to on-die L3 latency). Some architectural choices are easier when "local" includes the whole package.

Scale-up vs. scale-out. With UCIe and high-bandwidth packaging, more compute fits in one logical "node" before scaling out to multiple servers. NVIDIA's GH200 / GB200 NVLink domains span dozens of GPUs in one coherence domain.

Heterogeneity. Chiplets enable mixing CPU, GPU, NPU, FPGA, custom accelerators in one package. Apple's M-series exemplifies this: CPU, GPU, NPU, media engine, Secure Enclave, all together.

Disaggregation. Conversely, what used to be on-die can move off-die when bandwidth is sufficient. Memory pooled via CXL; storage tiered to NVMe; specialized accelerators added per-job.

The line between "chip" and "system" is blurring.

15. Cost and Yield Economics

A simple yield model: for DD defects/cm² on dies of area AA cm², yield is approximately eDAe^{-DA}. A 800 mm² die at 0.1 defects/cm² yields ~45%; an 80 mm² die at the same defect rate yields ~92%. Splitting one die into eight smaller dies dramatically improves yield.

But: chiplets pay overhead — the inter-chiplet links require extra signaling, area, and power. The crossover point where chiplets win depends on die size and process maturity. For leading-edge nodes, the crossover is well below 200 mm²; chiplets are advantageous for nearly all flagship designs.

For mature processes or smaller designs, monolithic remains attractive. Mid-range CPUs (e.g., laptop chips with 4-8 cores) often stay monolithic.

16. Looking Forward

Several trends point ahead:

Multi-reticle interposers: packages spanning multiple reticle exposures, integrating tens of chiplets. Already in some research designs.

Optical I/O integration: photonic interconnects integrated into packages. Lightmatter, Ayar Labs, and several research groups are pushing this. Promises bandwidth and energy efficiency far beyond electrical I/O for very long links.

Backside power delivery: industry-wide migration from front-side to back-side power, freeing front-side metal for signals.

Compute-in-memory: not just data movement but actual computation in HBM and other memories. Samsung's HBM-PIM, SK Hynix's PIM-DIMM, and several research efforts.

Substrate-as-circuit: glass substrates with embedded passives, antennas, or even active devices.

Thermal innovation: microfluidic cooling integrated into packages; advanced TIMs; vapor chambers built into package lids.

The pace of packaging innovation has accelerated as Moore's Law has slowed. When you can't shrink transistors much further, you integrate more of them — and more diverse functions — into the package.

17. Summary of Part XI

Part XI has covered advanced topics:

  • Chapter 50 went deep on cache: non-blocking caches, prefetching, victim caches, inclusion policies, coherence protocols beyond MESI, cache QoS.
  • Chapter 51 examined modern branch prediction (TAGE, perceptron, hybrids) and the speculative-execution attack family — Spectre, Meltdown, MDS, and successors — along with mitigations.
  • Chapter 52 was the physics: CMOS power scaling, DVFS, thermal design, manufacturing processes, and the end of Dennard scaling.
  • Chapter 53 covered reliability and validation: error sources, ECC, RAS, validation methodology, post-silicon test, and silent data corruption.
  • Chapter 54 discussed performance analysis: top-down methodology, performance counters, roofline, profiling tools, microbenchmarking pitfalls.
  • Chapter 55 is modern packaging: chiplets, HBM, advanced interposers, 3D stacking, CXL, and the implications.

Together, these chapters cover the parts of CPU design that don't fit neatly into "instruction set" or "microarchitecture" but are critical to how modern systems actually work and perform.

The final part, Part XII, looks beyond the CPU itself: GPUs and accelerators, embedded and real-time computing, and reconfigurable / emerging architectures. The CPU is no longer the only — or even the primary — compute device in many systems; understanding what surrounds it is essential to understanding modern computing.

Book mode
computer-architectureadvancedmicro-architecture
Was this helpful?