Part VII·Advanced and Frontier·Chapter 50 of 62

Part VIIAdvanced and Frontier

Advanced Cache

May 16, 2026·12 min read·advanced

Chapter 17 introduced caches: associativity, replacement, write policies, the basic three-level hierarchy. This chapter takes a deeper look at modern cache design — the techniques used in shipping silicon today that go beyond the introductory model. We cover non-blocking caches, prefetching in detail, victim caches and other helper structures, inclusion policies and their consequences, partitioning and quality-of-service, transactional memory's interaction with caches, and the cache aspects of cache-coherent interconnects in many-core systems.

This chapter is referenced from Chapter 17 (basic cache design) and Chapter 26 (out-of-order load/store handling).

01.Non-Blocking Caches

A blocking cache stalls on every miss until the missing line returns. A non-blocking (or "lockup-free") cache continues to serve hits and accept new requests while misses are outstanding. Modern caches are aggressively non-blocking — high-end L1 caches can have 16+ outstanding misses, L2 caches 32+, L3 caches 64+ or more.

The core mechanism is the MSHR (Miss Status Holding Register). Each MSHR tracks one outstanding miss: the missing address, the requesting transactions waiting on it, and the eventual data destination. A typical entry:

Block address being fetched.
Source level (where the request was sent: L2, L3, memory controller).
Pending request list: the loads/stores/prefetches that touched this line.
State (issued, returning, complete).

When a load misses and an MSHR exists for the same line, it is a "secondary miss" — the new load attaches to the existing MSHR and waits without issuing a new fetch. This is MSHR coalescing: many misses to the same line cost only one fetch.

The number of MSHRs limits memory-level parallelism (MLP, Chapter 25). If 16 loads are independent and would each miss, but the cache has 8 MSHRs, only 8 can be in flight at a time; the rest stall. Modern high-performance cores push MSHR counts up to keep MLP high.

MSHR variants:

Per-line MSHRs (most common): one entry tracks one cache line.
Sub-block MSHRs: track byte ranges within a line — useful when lines are long and partial overlap is common.
Hierarchical MSHRs: L1 MSHRs in the load/store unit; L2 MSHRs in the L2 controller; etc.

The interaction between MSHRs and the OoO load queue is intricate. A load in the load queue waiting on an MSHR doesn't block the rest of the queue; new loads can issue while the old one waits.

02.Prefetching

Hardware prefetching is one of the highest-impact optimizations in modern caches. Workloads that touch memory in regular patterns benefit enormously when prefetching anticipates the next accesses and pulls them in early. Software prefetching (using prefetch instructions explicitly) helps where hardware can't predict.

Stream / Stride Prefetchers

The simplest hardware prefetcher detects sequential access. If a program reads addresses A, A+64, A+128, ... (sequential cache lines), the prefetcher fetches A+192, A+256, ... ahead.

A stream buffer holds prefetched lines until the demand stream reaches them. The number of streams a prefetcher can track is finite — modern L1 prefetchers track 4-8 streams; L2 prefetchers can track 16-32.

A stride prefetcher generalizes: detect any constant stride between accesses (every 256 bytes, every 1 KiB, etc.). Useful for matrix code accessing rows or columns with non-unit stride.

Region / Spatial Prefetchers

Many workloads access multiple lines within a region (say, a 4 KiB block) but in irregular order. A spatial pattern prefetcher learns: when a program touches address A in a region, it also typically touches addresses A+B and A+C. The prefetcher records this learned pattern and, on the next access to a similar region, prefetches the additional lines.

Variants:

SMS (Spatial Memory Streaming): tracks per-region access bitmaps.
AMPM (Access Map Pattern Matching): learns per-PC patterns.

Indirect / Pointer Prefetchers

Linked data structures (linked lists, trees) defeat stream prefetchers. The next address depends on data fetched from the previous address. Some research designs detect this and pre-fetch the next pointer chain.

In practice, indirect prefetching is hard to do well; most production prefetchers don't include it. Software prefetching with explicit __builtin_prefetch calls is the typical solution.

Page-Aware Prefetching

Prefetchers must respect page boundaries: prefetching across a page boundary requires a TLB lookup, which can fault. Modern prefetchers stop at page boundaries by default, restarting on the next demand miss in the new page.

Adaptive Prefetching

Prefetchers can pollute the cache with unused data, evicting useful lines. Modern designs adapt based on accuracy:

Track which prefetches were used vs. evicted unused.
If accuracy drops, throttle aggressiveness.
If accuracy is high, prefetch further ahead.

The Intel "L2 Streamer", AMD's L1/L2 prefetchers, ARM's CMC and SMC prefetchers — all incorporate adaptivity.

Software Prefetching

When the hardware can't predict, software can:

__builtin_prefetch(&array[i + 16], 0, 0);

The hint tells the cache to fetch the line, with optional read/write and locality hints. On x86, this maps to PREFETCHT0 / T1 / T2 / NTA (different cache levels). On ARM, to PRFM. On RISC-V, similar (still in some flux).

Software prefetching is most useful for irregular access patterns where the program knows the next access addresses ahead of time. Compilers like LLVM can auto-insert prefetches for some loop patterns; explicit programmer use is more reliable.

03.Victim Caches

A victim cache is a small, fully-associative buffer that holds recently evicted lines. When a line is evicted from the main cache, it goes to the victim cache instead of being discarded. On the next miss, both the main cache and the victim cache are checked; a hit in the victim cache pulls the line back into the main cache without going to the next level.

Originally proposed by Norm Jouppi (1990) for direct-mapped caches, where conflict misses were severe. With set-associativity now standard, victim caches are less common — but the idea persists in non-inclusive cache designs (see below) where evicted L1 lines reside in the L2.

04.Inclusion Policies

When a system has multiple cache levels, three policies relate them:

Inclusive. Every line in L1 is also in L2; every line in L2 is also in L3. The lower-level (more inclusive) cache is a strict superset.

Pro: simplifies coherence — to know if any cache has a line, just check the largest cache.
Pro: snoop filtering — external snoops can be filtered at the largest cache.
Con: capacity — the smaller cache's contents are duplicated in the larger.
Con: cross-eviction — when a line is evicted from L3, it must also be evicted from L2 and L1 (back-invalidation).

Intel x86 traditionally used inclusive L3 caches (through Broadwell). The back-invalidation cost made this less attractive as L3 grew.

Non-Inclusive. No inclusion guarantee, but no exclusion guarantee either. A line might be in L1 only, or in L1 and L2, or only in L3. Coherence requires a separate directory or snoop mechanism.

Pro: better effective capacity than inclusive.
Pro: no back-invalidation pressure.
Con: coherence is more complex.

Intel switched to non-inclusive L3 starting with Skylake-X / Server, persisting through current generations. ARM cores typically use non-inclusive L2-L3.

Exclusive. A line is in exactly one level. AMD uses exclusive L3 in some designs (Zen 1-3 had exclusive L3; Zen 4-5 are mostly non-inclusive).

Pro: maximum effective capacity (sum of levels).
Con: every L1 miss that hits L2 must move the line up and allocate the replacement line down — more bandwidth between levels.

The choice of inclusion policy is a major architectural decision; it affects coherence protocol, snoop filtering, capacity efficiency, and the bandwidth between levels.

05.Cache Coherence Beyond MESI

Chapter 31 introduced MESI. Real cache coherence protocols are richer:

MOESI (with Owned state): a line can be modified in one cache and shared (non-clean) by others, with the modified-and-shared cache as the "owner" responsible for writeback. AMD uses MOESI variants in their CCX/CCD interconnect.

MESIF (with Forward state): one cache holds the "F" copy of a shared line; that cache is responsible for forwarding the line to a requester, eliminating ambiguity when multiple caches share. Intel uses MESIF in their inter-socket QPI/UPI protocol.

Directory protocols. In many-core systems, snoop-based protocols don't scale (broadcasting every miss to every cache). A directory protocol maintains a per-line record of which caches share the line. On a miss, the directory is consulted, and only the relevant caches are notified. Most multi-socket systems and large many-core chips use directory protocols at the socket-to-socket level.

Snoop filters. On-chip caches use snoops (broadcast queries) to maintain coherence within a socket. To avoid wasting bandwidth on snoops to caches that don't have the line, a snoop filter tracks which caches have which lines. The filter is queried first; only caches that might have the line are snooped.

Cluster organization. Modern many-core chips group cores into clusters with cluster-private caches. Coherence within a cluster is fast; between clusters, a higher-level interconnect mediates. ARM's CMN-700 and CMN-S3 interconnects, AMD's Infinity Fabric, Intel's Mesh, all reflect this.

06.Cache Compression

Some research and a few production designs compress cache lines. A 64-byte line might compress to 32 or 16 bytes; the cache packs multiple compressed lines into one physical slot.

Variable-line compression: each compressed line uses as much space as it needs. Compressed sharing: two cache lines with similar contents share a physical slot via XOR-encoding. Statistical compression: BPC (Base+Pattern Compression), DISH (Dictionary-based Sharing), etc.

In production, compression is more common in main memory (e.g., NVIDIA GPUs' delta-color compression for framebuffers) than in CPU caches. The complexity vs. benefit ratio in CPU caches has been judged unfavorable so far.

07.Cache Quality of Service

In multi-tenant systems (cloud servers running many VMs), one VM's cache-heavy workload can evict another VM's working set, causing unpredictable performance. Cache QoS mechanisms address this:

Intel CAT (Cache Allocation Technology): partition the L3 cache into ways; assign each VM a subset. CAT was introduced in Broadwell-EP and refined in subsequent generations.

AMD's L3 cache partitioning: similar functionality via QoS extensions to Zen 3+.

ARM MPAM (Memory Partitioning and Monitoring): a more general framework for cache and memory bandwidth allocation. Implemented in Neoverse N2 and beyond.

These mechanisms typically partition by way (e.g., a 16-way cache divides into 16 partitions). The downside: per-partition associativity is lower, which can hurt hit rate.

Bandwidth QoS is a sibling concept: limit how much memory bandwidth a workload can consume. Helps prevent one tenant from saturating shared memory channels.

08.Persistent Memory and Cache

The brief era of Intel Optane DC Persistent Memory (3D XPoint) raised cache-related challenges. Persistent memory is byte-addressable like DRAM but persistent like flash. To make persistent stores actually durable, the CPU's cached data must be flushed to the persistent layer.

x86 added new instructions:

CLWB (Cache Line Write-Back): write the modified line back to memory, but keep it in cache. Faster than CLFLUSH, which evicts as well.
CLFLUSHOPT: optimized flush (relaxed ordering vs. CLFLUSH).
PCOMMIT: commit pending writes to memory (deprecated; replaced by ADR/ASR mechanisms in platforms).

ARM has added analogous instructions (DC CVAC, DC CVAP, DC CVADP) for persistent-memory-aware code.

With Optane DC's discontinuation in 2022, the urgency dropped, but the instructions remain. CXL-attached memory and emerging persistent media may revive interest.

09.Cache Hierarchy in Modern Chips

A walk through a representative modern chip — say, an AMD Zen 4 (e.g., Ryzen 9 7950X):

L1: 32 KiB instruction + 32 KiB data per core, 8-way, ~4-cycle latency.
L2: 1 MiB per core, 8-way, ~14-cycle latency.
L3: 32 MiB per CCD (chiplet of 8 cores), 16-way, ~50-cycle latency. Two CCDs in a 16-core part; cross-CCD L3 access goes via the Infinity Fabric (much higher latency).
DRAM: ~100 ns or 300+ cycles at 5 GHz.

Apple M3 P-cores have larger L1 (192 KiB I + 128 KiB D) and L2 (16 MiB shared per cluster of P-cores), no L3 in the traditional sense — the SLC (System Level Cache) sits between cores and memory and is shared with the GPU and other accelerators.

ARM Neoverse V2 has a similar structure: 64 KiB L1, 1-2 MiB L2, large shared SLC.

The trend is toward bigger L2 (private), bigger L3 (shared), and SLCs that include accelerator traffic. The total on-chip cache approaches or exceeds 100 MiB on flagship server chips.

10.Cache Effects in Software

For software performance, several cache properties matter beyond what Chapter 17 covered:

Cache associativity and conflict misses. With a 16-way cache, accessing 17 hot addresses that all map to the same set generates conflict misses. This shows up in workloads with strided access patterns where the stride aligns with the cache's set count.

Way prediction: some L1 caches predict which way to read, accessing only that way to save power. Mispredictions are corrected (re-access the correct way). Software is typically unaware.

Cache-line bouncing. Two cores writing to the same cache line cause the line to bounce between them — a serious anti-pattern. The classic example: per-CPU statistics counters on the same line.

False sharing. Two cores writing to different variables on the same cache line cause the same bouncing, even though the variables are unrelated. Mitigation: align hot per-thread data to cache lines.

Cache coloring: in older systems with physically indexed caches, virtual-to-physical mapping affected which cache sets a page maps to. Page coloring tried to give each process consistent cache occupancy. Less relevant in modern systems with VIPT L1s and PIPT L2s.

NUCA effects. Non-Uniform Cache Access in large shared caches: accessing a slice on the far side of the chip is slower than the near side. Programmers can sometimes affect this via thread placement.

11.Summary

Modern caches are far more sophisticated than the introductory model. Non-blocking operation supports tens of in-flight misses; prefetchers anticipate future access patterns; inclusion policies trade capacity for coherence simplicity; cache QoS partitions resources among tenants; coherence protocols scale to many-core via directories and snoop filters. The cache hierarchy is one of the most dynamic and tuned parts of a modern CPU, with deep interaction with the load/store unit (Chapter 26), branch behavior (Chapter 51), and memory consistency (Chapter 31).

The next chapter delves into branch prediction's frontier and the speculative-execution attacks that exposed how leaky modern cores can be. Chapter 51 is the deep dive on Spectre, Meltdown, and the broader family of side-channel attacks.

Book mode