Part III·Memory and Storage·Chapter 16 of 62

Part IIIMemory and Storage

Memory Hierarchy

May 16, 2026·30 min read·intermediate

If a computer architecture book has one big secret, it is this: real memory is nothing like the flat, instantaneous store that introductory programming pretends. The C programmer's mental model has every byte at every address equally accessible, with a load taking the same time as any other load. The architectural truth is that "memory" is a multi-level system in which the fastest level holds a few hundred bytes and answers in a fraction of a nanosecond, while the slowest level holds terabytes and answers in milliseconds — a ratio of about a billion to one. Almost everything interesting in modern processor design is, at heart, an attempt to hide that ratio from software.

This chapter introduces the memory hierarchy as a whole. It explains why the hierarchy exists, what each layer is good for, what costs and benefits each layer brings, and the principle — locality of reference — that makes the whole arrangement work. The four chapters after this one zoom in: caches in Chapter 17, DRAM in Chapter 18, virtual memory in Chapter 19, storage in Chapter 20. The point of this chapter is to step back and see the system as a stack.

01.Why a Hierarchy at All

A natural question for a beginner is why we do not simply build memory that is fast, large, and cheap all at once. The short answer is that no single technology delivers all three, and the gap between what each technology can deliver is large enough to make compromise impractical.

Memory technologies trade three quantities against one another:

Speed — how quickly the memory responds to a request. We measure speed in latency (time per access) and bandwidth (bytes per second).
Capacity — how many bytes the memory can hold.
Cost — how many dollars per bit, including silicon area, power, and packaging.

For any given technology, you can have two of these but not all three.

A few flip-flops on the processor die — what we call a register or an SRAM cell — answer in less than a nanosecond, but each cell takes six transistors and consumes power even at rest. Filling a chip with such cells gives a small, very fast memory; filling a server with them is impossible. SRAM is the natural choice for registers and caches, where capacity is small (kilobytes to tens of megabytes) and speed is everything.

A different kind of cell — DRAM, the dynamic memory cell — uses a single transistor and a tiny capacitor. Each cell is roughly one-tenth the size of an SRAM cell, so DRAM gives ten times the capacity per unit area. The price is speed: a DRAM cell loses its charge over time, must be refreshed periodically, and has substantial access latency (tens of nanoseconds). DRAM is the natural choice for main memory in the gigabytes range.

A third kind of memory — Flash, the non-volatile cell used in SSDs — stores charge on an isolated gate that retains state for years. Flash gives hundreds of gigabytes or terabytes per device, at very low cost per bit, but with access times in the tens of microseconds and limited endurance (each cell can be erased and rewritten only some thousands or tens of thousands of times). Flash is the natural choice for secondary storage.

A fourth — the spinning magnetic disk, the hard drive — was the standard for decades. It gives terabytes per drive at the lowest cost per bit, with millisecond access times due to mechanical seek and rotational latency. Its role today is mostly archival.

The differences are not small. Latency from registers to disk spans roughly nine orders of magnitude. Capacity spans roughly ten orders of magnitude. Cost per bit spans about six orders of magnitude. There is no plausible engineering of any single technology that bridges these gaps. The system has to be built as a hierarchy in which fast-but-small layers stand close to the processor and slow-but-large layers stand far away.

A simple way to picture it:

Figure: Memory hierarchy from registers down to network storage, with each tier annotated by capacity and access latency, growing in size and slowing as you descend

LaTeX

\begin{tikzpicture}[font=\footnotesize, line cap=round]
  \draw[thick] (0, 0)    rectangle (3.5, -0.7); \node at (1.75, -0.35) {registers};         \node[anchor=west] at (3.8, -0.35) {few hundred bytes, $<$ 1 ns, extremely costly};
  \draw[thick] (0, -0.7) rectangle (3.5, -1.4); \node at (1.75, -1.05) {L1 cache};          \node[anchor=west] at (3.8, -1.05) {tens of KB, $\sim$1 ns, very costly};
  \draw[thick] (0, -1.4) rectangle (3.5, -2.1); \node at (1.75, -1.75) {L2 cache};          \node[anchor=west] at (3.8, -1.75) {hundreds of KB, $\sim$3 ns};
  \draw[thick] (0, -2.1) rectangle (3.5, -2.8); \node at (1.75, -2.45) {L3 cache};          \node[anchor=west] at (3.8, -2.45) {tens of MB, $\sim$10 ns};
  \draw[thick] (0, -2.8) rectangle (3.5, -3.5); \node at (1.75, -3.15) {main memory};       \node[anchor=west] at (3.8, -3.15) {gigabytes, $\sim$80 ns};
  \draw[thick] (0, -3.5) rectangle (3.5, -4.2); \node at (1.75, -3.85) {storage};           \node[anchor=west] at (3.8, -3.85) {terabytes, $\sim$50 $\mu$s (SSD)};
  \draw[thick] (0, -4.2) rectangle (3.5, -4.9); \node at (1.75, -4.55) {network / archives}; \node[anchor=west] at (3.8, -4.55) {effectively unlimited, ms or more};
\end{tikzpicture}

The general design rule is that each level is roughly an order of magnitude slower and an order of magnitude larger than the level above it.

02.Registers

The top of the hierarchy, closest to the CPU, is the register file. Registers are typically built from very fast SRAM cells with extra ports so that several registers can be read and written in the same cycle. We met them already in Chapter 7; here the relevant point is their place in the hierarchy.

A modern processor has a small number of architecturally visible registers — 16 in x86-64, 31 in AArch64, 32 in RISC-V — and a much larger pool of physical registers behind the scenes. The architectural registers are what the programmer sees; the physical pool is used for register renaming, which we will encounter in Chapter 25. From the programmer's viewpoint, registers are the only memory the CPU can use as a direct operand for arithmetic.

Two things make registers fast. First, they sit physically next to the ALU; the wires between them are short. Second, they are addressed by very small register numbers (5 bits for 32 registers), so the address-decoding step is essentially trivial. A read or write of a register is a single cycle, often less.

The cost is capacity. A register file with 32 entries of 64 bits each is only 256 bytes. Even the larger physical register files in modern out-of-order chips are kilobytes at most. Anything larger is stored elsewhere.

Compilers therefore work hard at register allocation: deciding which variables in a program live in registers and which spill to memory. A program that fits its hot working set into registers runs at the maximum speed the processor can deliver; a program that constantly spills runs much slower.

03.Caches

Below registers come caches: small, fast memories that sit between the CPU and main memory and hold copies of recently used data. Caches are the central object of the next chapter, but a few high-level points belong here.

The cache exists because main memory is too slow. A modern CPU running at, say, 4 GHz has a clock period of 0.25 nanoseconds; a DRAM access takes around 80 nanoseconds. If every memory operation went to DRAM, the CPU would spend over 300 cycles waiting for each load, and the whole pipeline would stall most of the time. A cache hit, by contrast, returns in a few cycles. As long as most loads and stores hit in the cache, the average memory latency stays low and the CPU runs near its peak throughput.

Caches are organized into levels — typically L1, L2, and L3 — each successively larger and slower. A modern desktop CPU has, per core, an L1 of about 32 KB instructions plus 32 KB data answering in roughly 4 cycles, an L2 of 1 MB answering in 12 to 15 cycles, and a shared L3 of tens of megabytes answering in 30 to 50 cycles. The numbers shift from chip to chip, but the pattern is universal.

A cache decides what to hold based on what the program has used recently. The hardware keeps no semantic understanding of the data; it simply caches whatever address is touched, in fixed-size chunks called cache lines (typically 64 bytes). If the program asks for an address whose line is in the cache, that is a cache hit; if not, a cache miss, and the line is fetched from the next level.

Why this strategy works at all is the next big topic.

04.Main Memory

Below the caches is main memory, almost always implemented in DRAM. Capacities run from gigabytes in mobile devices to hundreds of gigabytes or even terabytes in servers. DRAM access latency in real systems — not the published cell timing, but the latency the CPU actually sees — is roughly 50 to 100 nanoseconds, and bandwidth is tens to hundreds of gigabytes per second on a single chip's memory channels.

A memory request that misses every level of cache becomes a request to DRAM. The path goes through the on-chip memory controller, out across the package's pins, into the DIMMs, into the targeted DRAM chip, into the targeted bank, and eventually back. Chapter 18 traces this path in detail. The number to remember now is 80 nanoseconds: a typical DRAM access takes about that long, plus or minus a factor of two depending on the system.

DRAM is not just slow in latency; it is also granular. The smallest unit a DRAM device can deliver efficiently is a burst — typically 64 bytes — because of the way internal sense amplifiers and column accesses work. This granularity matches the cache-line size for a reason: a cache line is the natural unit of transfer between the cache and DRAM.

Main memory has another property that registers and caches do not: it is shared. Every core on a multi-core processor sees the same DRAM. This sharing makes inter-core communication possible (one core writes a value; another reads it) but introduces the coherence problem we will face in Chapter 31.

05.Secondary Storage

Below main memory lies secondary storage: persistent, byte-addressable-by-block media that hold data the CPU cannot keep in DRAM. Modern systems use either solid-state drives (SSDs, built from Flash) or, increasingly rarely, hard disk drives (HDDs, built from rotating magnetic platters).

The defining property of secondary storage is persistence. RAM loses its state when power is removed; disks and SSDs do not. Files, programs, operating systems, databases — anything that should survive a reboot — lives on secondary storage.

The other defining property is slowness, by the standards of registers and caches. An SSD answers a 4 KB read in roughly 50 to 100 microseconds, which is a thousand times longer than a DRAM access and a hundred thousand times longer than an L1 hit. A hard disk is slower still: about 10 milliseconds for a random access, dominated by mechanical seek and rotation.

Storage is generally accessed in blocks of 4 KB or larger, not in individual bytes. The operating system's file system reads and writes blocks; the application's view of byte-addressable files is an illusion built on top.

The role of storage in the hierarchy is twofold. First, it is the long-term home of data. Second, it backs virtual memory (Chapter 19): the operating system can pretend the system has more memory than the physical DRAM allows by paging cold data out to storage and bringing it back as needed. Demand paging hides the size limit of DRAM at the cost of occasional very slow accesses for data that has been swapped out.

06.Locality of Reference

The reason a hierarchy works is a deep empirical fact about real programs called locality of reference. The fact is observed, not derived: programs do not access memory uniformly; they access it in patterns. Two patterns dominate.

Temporal locality is the tendency of a program to access the same memory location repeatedly within a short interval. Consider a loop that increments a counter:

for (int i = 0; i < N; i++) {
    sum += a[i];
}

The variable sum is read and written on every iteration. The variable i is read three times per iteration. The first time sum is touched, it must be fetched (from wherever it lives). Every subsequent access can be answered from a much faster level — registers, ideally, or at worst the L1 cache. Recently used data is likely to be used again.

Spatial locality is the tendency of a program to access memory locations near each other. The same loop reads a[0], a[1], a[2], ... — successive addresses, four or eight bytes apart. The cache line that holds a[0] also holds a[1] through a[15] (if the line is 64 bytes and the elements 4 bytes). Fetching the line on the access to a[0] makes the next fifteen accesses cache hits. Memory near recently used memory is likely to be used soon.

These two forms of locality are not laws; programs that violate them do exist. A random walk over a large data structure has poor spatial locality. A streaming computation that touches each location only once has poor temporal locality. But a remarkable fraction of real software shows strong locality of both kinds, and the cache hierarchy is engineered to exploit it.

A simple quantitative statement of why this matters: suppose a program issues memory requests, of which a fraction $h$ hit in a fast cache with latency $t_{\text{cache}}$ and a fraction $1 - h$ miss and go to main memory with latency $t_{\text{mem}}$ . The average memory access time is

$\text{AMAT} = h \cdot t_{\text{cache}} + (1 - h) \cdot t_{\text{mem}}.$

With $t_{\text{cache}} = 1$ ns, $t_{\text{mem}} = 80$ ns, and a hit rate of $h = 0.95$ , the AMAT is

$0.95 \cdot 1 + 0.05 \cdot 80 = 0.95 + 4 = 4.95 \text{ ns}.$

Almost a sixteenfold improvement over the raw memory latency. With a hit rate of $h = 0.99$ , the AMAT drops to $0.99 + 0.8 = 1.79$ ns, more than fortyfold. With $h = 0.5$ , on the other hand, the AMAT is 40.5 ns, only a twofold improvement.

The lesson is that hit rates must be high. A cache that hits 99 % of the time is wonderful; one that hits 80 % is mediocre; one that hits 50 % is barely worth having. Most of the design effort that goes into caches — size, associativity, replacement policy, prefetching — is aimed at pushing the hit rate as close to 100 % as possible.

The same equation, recursively applied, generalizes to multi-level hierarchies. A miss in the L1 goes to the L2, where it may hit or miss; an L2 miss goes to the L3; an L3 miss goes to DRAM; a "DRAM miss" (in the virtual-memory sense) goes to disk. Each level has its own hit rate, and the average access time accumulates accordingly.

07.The Inclusion Property and Working Sets

A useful concept that makes hierarchy reasoning easier is the working set of a program: the collection of memory locations the program is likely to access in the near future. If the working set fits in the L1 cache, the L1 hit rate is essentially 100 %. If it fits in the L2 but not the L1, the L1 misses constantly while the L2 stays hot. If it fits nowhere on chip, every access goes to DRAM.

Working sets are not constant. A program in its initialization phase may have a small working set; a program in its main computation may have a large one; the working set may shift as the program transitions between phases. Real workloads exhibit dramatic working-set changes, which is why a fixed-size cache cannot serve all phases equally well.

A related property, often deliberately enforced by hardware, is inclusion: the contents of a smaller, faster level are a subset of the contents of a larger, slower level. If a line is in L1, it is also in L2; if it is in L2, it is also in L3. Inclusion makes coherence simpler — a write that needs to invalidate copies in other caches has to look only at the largest level — but at a cost: an L3 of, say, 32 MB cannot effectively hold 32 MB of unique data, because some of it has to be in L2 and L1 simultaneously. Modern designs use a mix of inclusive, non-inclusive, and exclusive policies; we will return to these in Chapter 50.

08.Memory-Level Parallelism

The AMAT formula in the previous section measures latency as if memory operations were issued one at a time, with each one finishing before the next begins. This is the right model for a simple in-order processor running a serial-dependency workload. It is dramatically wrong for a modern out-of-order processor running ordinary code.

At any given moment, a high-performance CPU may have dozens of memory operations in flight simultaneously: loads waiting on the L1, loads forwarded from earlier stores, loads issued speculatively past unresolved branches, prefetches initiated by the hardware on its own. The metric for what the system can do in parallel is memory-level parallelism, or MLP: the average number of outstanding memory accesses at any time. MLP is what lets the system absorb long DRAM latencies without stalling the pipeline.

Little's Law from Chapter 10 gives the precise relationship. If $\lambda$ is the rate at which the program issues memory operations, $W$ is the average latency per operation, and $L$ is the average number in flight, then $L = \lambda W$ . To deliver bandwidth $B$ on a system with average latency $W = 80$ ns and 64-byte cache lines, the program needs

$L = \lambda W = \frac{B}{64} \cdot 80 \text{ ns} = \frac{B \cdot 80 \text{ ns}}{64 \text{ B}}.$

For $B = 50$ GB/s, this gives $L \approx 60$ : the system must keep about sixty outstanding memory accesses in flight to use the full bandwidth. Most modern CPUs have load buffers of dozens of entries and per-cache-line miss-status holding registers (MSHRs) that can track several dozen outstanding misses at each cache level, exactly so that this much MLP can be sustained.

The practical importance of MLP is that the latency of a single memory access is largely irrelevant to throughput-bound workloads; what matters is whether the program (or the hardware on its behalf) can find enough independent accesses to saturate the pipeline. Pointer-chasing benchmarks see almost the full DRAM latency on every access, because each load depends on the previous one and MLP collapses to one. Streaming benchmarks see almost none, because hundreds of independent accesses are in flight at once. The architectural sophistication of out-of-order execution, hardware prefetching, and non-blocking caches all aims at the same target: raise MLP so that effective latency falls.

09.Non-Uniform Memory Access

The single-channel, single-controller picture we have been drawing breaks down on multi-socket servers. A two-socket server has two CPUs, each with its own integrated memory controller and its own bank of DIMMs. Memory attached to socket 0's controller is local to socket 0 and remote to socket 1; reaching it from socket 1 requires traversing the inter-socket interconnect (Intel's UPI, AMD's Infinity Fabric, ARM's CMN). The same is true on AMD's chiplet-based desktop and server processors, where each chiplet has its own memory controllers and IOD path.

The result is Non-Uniform Memory Access (NUMA): the latency to physical memory depends on which CPU is asking. Local accesses on a modern server might take 80 ns; remote accesses 120–150 ns or more. Bandwidth is similarly asymmetric: each socket has its own channels, and the inter-socket link is shared and lower-bandwidth than the per-socket memory channels combined.

Operating systems and runtimes are NUMA-aware. The Linux kernel exposes the topology through /sys/devices/system/node, allocates memory by default on the local node of the thread that first touches a page (the first-touch policy), and tries to schedule threads near the memory they use. Programs can query and influence the placement directly through numactl, libnuma, the Windows VirtualAllocExNuma API, or platform-specific Java and .NET facilities. A NUMA-oblivious program on a large server can lose a substantial fraction of its potential performance to remote-memory traffic.

NUMA is not just a server phenomenon. Apple's M-series chips, AMD's chiplet desktop CPUs, and many recent ARM server designs all expose multiple internal memory controllers with different latencies to different cores. The effect is smaller than a two-socket server's, but it is real, and it is part of why modern operating systems treat memory placement as a first-class concern.

We will return to NUMA when we discuss cache coherence in Chapter 31; the architectural fact worth keeping for now is that flat memory is increasingly a polite fiction even within a single machine.

10.Storage-Class Memory and the New Tiers

The traditional hierarchy of caches → DRAM → SSD → HDD has a yawning latency gap between DRAM (around 80 ns) and even the fastest SSD (around 10 µs) — more than two orders of magnitude. For decades, that gap has been bridged only by the operating system's page-cache and demand-paging machinery, which works well for cold data but adds substantial overhead for any access that would otherwise hit DRAM directly.

A new tier of memory technologies has been emerging to fill the gap. Storage-class memory (SCM, also called persistent memory or non-volatile memory) refers to byte-addressable, persistent media with latencies between DRAM and Flash. Intel's Optane, based on 3D XPoint, was the most prominent example, with latencies of a few hundred nanoseconds and per-byte access through the same memory bus as DRAM. Optane was discontinued in 2022, but the architectural niche it filled remains, and other technologies (MRAM, ReRAM, FeRAM, PCM) continue to be developed.

More recently, the CXL (Compute Express Link) interconnect has opened a different way to extend the hierarchy. CXL is a coherent, low-latency protocol layered on PCIe physical media, with three sub-protocols: CXL.io (legacy PCIe device access), CXL.cache (devices participate in CPU coherence), and CXL.mem (devices export memory that the CPU can access with cache-line granularity). A CXL.mem device can be a DRAM expansion module on a separate board, a pool of memory shared across multiple servers, or a tier of persistent media. The CPU sees it as ordinary memory, with somewhat higher latency than local DRAM (a few hundred nanoseconds) and access through the cache hierarchy as usual. The latency window is similar to remote-NUMA memory, and the operating-system handling is similar.

The practical effect is that the hierarchy is becoming richer rather than collapsing. A modern data-center node might have local DRAM at 80 ns, CXL-attached memory at 250 ns, persistent memory at 500 ns, NVMe SSD at 10 µs, and remote SSD across the network at 50 µs. Each tier has its own role: local DRAM for the hot working set, CXL for capacity expansion, persistent memory for durable structures that need to survive reboots without going through a file system, NVMe for cold data. Software increasingly has to be written with a specific tier in mind — the placement of data is a first-class architectural decision, not an implementation detail.

We will not pursue these technologies in detail here. The point is that the static picture of "DRAM and SSD" understates how layered the modern hierarchy actually is, and that the trend is for new technologies to fit into the gaps rather than to replace existing ones outright.

11.Software-Managed Memories and Scratchpads

Most of the hierarchy is transparent to software: caches decide for themselves what to hold, and the program experiences the result through the average access time. A few important systems take the opposite approach, exposing fast memory directly to software and asking the program to manage it explicitly. These are scratchpads or software-managed local memories.

The paradigmatic example is the shared memory of an Nvidia GPU streaming multiprocessor: a small (tens to hundreds of kilobytes), fast, software-managed buffer that threads in a block use cooperatively. The CUDA programmer explicitly copies tiles of data into shared memory, performs the computation, and writes results back. There is no hardware caching for shared memory; if the program does not put the data there, it is not there. The same memory hardware can be reconfigured at runtime as a hardware-managed L1 cache, illustrating that the architectural distinction is one of interface, not of physical structure.

Many DSPs (Texas Instruments C6x, Hexagon) and embedded processors offer similar tightly-coupled memories. Most modern microcontrollers have an instruction tightly-coupled memory (ITCM) and data tightly-coupled memory (DTCM) of a few kilobytes each, mapped at fixed addresses, used for hard-real-time code paths where cache miss latency would be unacceptable. Some Arm Cortex-R cores include them as standard.

The argument for scratchpads is predictability. A cache hit and a cache miss take wildly different times; for hard-real-time work, that variance is intolerable, and an explicitly placed scratchpad gives a guaranteed latency. The argument against is programmer effort: the program has to make decisions that hardware caches make automatically, and it has to make them well enough to outperform the hardware. For general-purpose workloads, hardware caches consistently win this argument; for specialized workloads (graphics, signal processing, certain real-time control), scratchpads remain alive and well.

The practical lesson is that the hierarchy of mainstream CPUs is not the only architectural option. When you read about a GPU's L1, an FPGA's block RAM, a DSP's L1P and L1D, or a cellular accelerator's internal memory, you should ask whether the structure is hardware-managed (a cache) or software-managed (a scratchpad); the question determines what the programmer needs to know to use it well.

12.Programming for the Hierarchy

The machinery we have described works in software's favour by default: programs with locality run fast, programs without it run slow. Programmers who care about performance, however, can do considerably better than the default by writing code with the hierarchy in mind. A few techniques recur often enough to be worth naming.

Cache-friendly data layout. A struct whose fields are accessed together should have those fields adjacent; a struct whose fields are accessed in different patterns should be split apart. The classic illustration is a linked list of objects each containing a 32-byte header and a 4-byte payload: traversing the list to read all the payloads brings in a 64-byte cache line for each header, of which only 4 bytes are wanted. Reorganizing the data into two parallel arrays (the struct-of-arrays idiom) makes the payload traversal fully cache-effective.

Loop blocking (also called tiling) restructures nested loops so that an inner subset of iterations operates on data that fits in a cache. The canonical example is matrix multiplication: a naive triple-nested loop has poor cache behaviour for any matrix larger than the cache; blocking the loops to operate on $b \times b$ tiles of each matrix brings the inner loop's working set down to roughly $3b^2$ words, which can be tuned to fit in any specific level of the hierarchy. Modern dense-linear-algebra libraries (BLAS) achieve their performance largely through carefully tuned multi-level blocking that targets each level of the cache hierarchy in turn.

Software prefetching. Where the access pattern is predictable but the prefetcher is not catching it, an explicit __builtin_prefetch or compiler intrinsic can pull lines into the cache before they are needed. The right prefetch distance depends on the latency to be hidden and the rate at which the loop consumes data; getting it wrong wastes bandwidth without saving cycles, and getting it right takes some experimentation.

Avoiding pointer chasing. A linked structure has at best one outstanding miss per pointer hop; an array-based structure can have many. Where data structures are read frequently and written rarely, the array form is almost always faster, even if it is logically less convenient.

Aligning hot data. A frequently-accessed datum that straddles two cache lines pays the cost of two line fetches per access. Aligning hot data to a 64-byte boundary, and padding small structures to occupy a whole line, eliminates this cost.

Avoiding false sharing. Two threads writing to different variables that happen to lie on the same cache line cause that line to ping-pong between cores, even though there is no logical sharing. Padding the variables apart eliminates the contention. We will discuss this in Chapter 31; for the hierarchy, it is enough to know that the cache-line granularity of the system can produce surprises across thread boundaries.

None of these techniques helps a program with no locality at all. But for the substantial majority of programs whose access patterns are nearly local with small disturbances, modest restructuring can produce factor-of-two or factor-of-ten speedups by aligning the program's behaviour with the hierarchy's expectations. Chapter 54 will return to performance analysis as a discipline; for now, the lesson is that the hierarchy is a friend to be programmed with, not a friction to be programmed against.

13.Hierarchy Tradeoffs

Several tradeoffs shape how a memory hierarchy is built.

Size versus speed. Bigger structures are slower because the access path is longer (more wire, more decoding, more sense amplification). A 32 KB L1 can be built to answer in 4 cycles; a 32 MB L3 cannot. Doubling the size of a level increases its access latency by some amount that depends on the technology and the layout, but is rarely zero. Designers therefore split a memory budget across multiple levels rather than putting all of it into one big level.

Latency versus bandwidth. A level can be optimized for low latency (short pipelines, simple lookup) or for high bandwidth (multiple ports, banking, queuing). The two often conflict. The L1 is usually latency-optimized; the L3 and DRAM are bandwidth-optimized. The mismatch is part of why the L1 is small: making it any larger would compromise its latency.

Hit rate versus access cost. A larger cache has a higher hit rate but a higher cost per access (more energy to read, more time, more area). A smaller cache costs less per access but misses more often. The right size depends on the program and the cost ratio between hits and misses.

Hardware versus software management. Caches are managed entirely by hardware; the program has no direct control over what is cached. Some systems also provide software-managed memories (called scratchpads in embedded systems and shared memory in GPUs) where the program explicitly moves data in and out. Hardware management is easier on the programmer; software management gives more predictable performance. Mainstream CPUs use hardware caches almost exclusively. GPUs and DSPs often use software-managed scratchpads. The two approaches have been argued back and forth for decades, and the current consensus is that for general-purpose workloads, hardware caches win — programmers cannot, in practice, manage memory better than the hardware can.

Coherence and consistency. As soon as a system has multiple caches that may hold copies of the same data — and every multi-core system does — the question of when changes propagate becomes architectural. We will leave this until Chapter 31, but it is worth flagging now: the hierarchy is not just a per-core feature. Across cores, the caches form an interlocked system that has to maintain a coherent view of memory.

Energy. Each level of the hierarchy has its own energy cost per access. An L1 access takes a few picojoules; an L3 access tens of picojoules; a DRAM access nanojoules — a thousand times more. For mobile and battery-powered systems, where every joule matters, keeping data in the upper levels of the hierarchy is not just a performance question but an energy question. We will return to this in Chapter 52.

14.A Concrete Hierarchy

To make all of this less abstract, here is a representative breakdown for a current high-end desktop processor (numbers approximate, and changing every year):

Level	Size	Latency	Bandwidth	Notes
Architectural registers	~1 KB	< 1 cycle	very high	per-core; visible to ISA
Physical register file	~1–4 KB	~1 cycle	very high	per-core; renaming pool
L1 instruction cache	32 KB	4 cycles	~500 GB/s	per-core
L1 data cache	32 KB	4–5 cycles	~500 GB/s	per-core
L2 cache	1–2 MB	12–15 cycles	~200 GB/s	per-core
L3 cache	16–96 MB	30–60 cycles	~100 GB/s	shared across cores
DRAM	16–512 GB	~80 ns	~50–100 GB/s	shared across the system
SSD	0.5–8 TB	~50 µs	~3–7 GB/s	NVMe over PCIe
HDD	1–20 TB	~10 ms	~150 MB/s	rare on new systems

The numbers vary substantially across products, but the orders of magnitude are stable. The crucial fact about this table is the gap between DRAM and SSD: roughly six hundred to one in latency. Any access that misses DRAM and goes to disk is, from the CPU's perspective, very nearly forever. This gap is why virtual memory (which can convert a load into a disk access, transparently) is such a fragile performance feature: as long as the working set fits in DRAM, paging is invisible; once it does not, performance falls off a cliff.

15.Summary

The memory hierarchy exists because no single technology gives both speed and capacity at acceptable cost. From small and fast (registers, caches) to large and slow (DRAM, SSD, HDD), each level fills a niche, and the system as a whole gives the program an average memory access time that is much closer to the fast end than the slow end. The reason this strategy works at all is locality of reference — the empirical fact that real programs touch the same data repeatedly (temporal locality) and touch nearby data together (spatial locality). The arithmetic of average access time, $\text{AMAT} = h \cdot t_{\text{cache}} + (1 - h) \cdot t_{\text{mem}}$ , makes the importance of high hit rates concrete: a few percent of misses cost as much as everything else combined. Memory-level parallelism extends the picture from latency to throughput: keeping enough independent accesses in flight is what lets a high-bandwidth memory system actually deliver its peak. Multi-socket and chiplet systems make memory access non-uniform in ways the program can feel, and emerging tiers — storage-class memory, CXL-attached pools — are filling the historical gap between DRAM and SSD with new latencies and new programming models. Some specialized systems abandon hardware caches in favour of software-managed scratchpads, trading programmer effort for predictability. And programmers who care about performance can do considerably better than the hardware's default by laying out data, blocking loops, prefetching explicitly, and avoiding pointer chasing and false sharing.

This is the high-level picture. Chapter 17 zooms in on caches, where the design choices that determine those hit rates are made.

Book mode