Part IIIMemory and Storage

Virtual Memory

May 16, 2026·30 min read·intermediate

So far, when we have spoken about memory addresses, we have been quietly conflating two different things: the addresses a program uses, and the addresses the hardware sends to DRAM. On any modern…

So far, when we have spoken about memory addresses, we have been quietly conflating two different things: the addresses a program uses, and the addresses the hardware sends to DRAM. On any modern operating system this conflation hides one of the most important architectural mechanisms in the entire system: virtual memory, the mapping that translates the addresses a program sees into the physical addresses where data actually lives.

Virtual memory does several jobs at once. It gives every process the illusion of its own private, contiguous address space, simplifying the programmer's mental model. It isolates processes from each other, so that a bug in one cannot corrupt another. It enables the operating system to use more memory than physically exists by paging cold data to disk. It supports memory-mapped files, shared libraries, copy-on-write, and other higher-level features that modern systems take for granted. Almost everything an operating system does about memory rests on the virtual-memory mechanism.

This chapter explains how the mechanism works: virtual addresses and how they are translated, page tables and the structures used to organize them, page sizes and the tradeoffs they force, the TLB that makes translation fast, and the higher-level features (page faults, demand paging, copy-on-write, protection) that the mechanism supports.

01. Virtual versus Physical Addresses

A physical address is the actual coordinate of a byte in DRAM (or, more precisely, in the address space that the memory controllers and I/O subsystems together expose). Physical addresses are unique within a machine: address 0x1A37 refers to one specific byte in the system, no matter who is asking.

A virtual address is the address a program uses. Each process has its own virtual address space, typically much larger than the physical memory installed in the system. A 64-bit machine offers a 48-bit (or 57-bit) virtual address space — 256 terabytes or 128 petabytes — which is far more than any current system has in physical RAM. The program can use virtual addresses freely without worrying about how its data maps to physical memory.

The mapping from virtual to physical is maintained by the operating system and enforced by hardware called the memory management unit (MMU), which sits on the path between the CPU and the cache hierarchy. Every memory access — every load, every store, every instruction fetch — is translated through the MMU before reaching memory.

Figure: MMU on the access path: every CPU memory request hits the MMU, which translates the virtual address to a physical address before reaching the cache or DRAM

LaTeX
\begin{tikzpicture}[font=\footnotesize, >=Stealth, line cap=round, blk/.style={draw, thick, fill=white, minimum height=0.7cm}] \node[blk, minimum width=1.6cm] (cpu) at (1, -0.5) {CPU}; \node[blk, minimum width=2.6cm] (va) at (3.7, -0.5) {virtual address}; \node[blk, minimum width=1.6cm] (mmu) at (6.3, -0.5) {MMU}; \node[blk, minimum width=2.6cm] (pa) at (9, -0.5) {physical address}; \node[blk, minimum width=2.6cm] (mem) at (12, -0.5) {cache / DRAM}; \draw[->] (cpu) -- (va); \draw[->] (va) -- (mmu); \draw[->] (mmu) -- (pa); \draw[->] (pa) -- (mem); \end{tikzpicture}

The translation has several useful properties.

Two processes can use the same virtual address for different data. Process A's address 0x4000_0000 might map to one physical page; process B's identical address might map to a completely different one. The processes are isolated from each other by construction.

Two virtual addresses (in the same or different processes) can map to the same physical address. This is how shared memory and shared libraries work: the operating system creates the mapping deliberately so that multiple processes see the same data without copying.

The mapping can be incomplete. A virtual address may have no physical mapping at all, in which case accessing it triggers a page fault. The operating system can then decide what to do: bring in a page from disk, allocate a new zero page, kill the process for accessing invalid memory, etc.

The mapping can be protected. Each mapping carries permission bits: readable, writable, executable, accessible to user mode or only kernel mode. The hardware enforces these on every access, and a violation triggers a fault.

A great deal of operating-system functionality follows from these properties. Process isolation, shared libraries, memory-mapped files, demand paging, swap, copy-on-write, ASLR, JIT compilation safety, sandboxing — all are built on top of the virtual-memory mechanism.

02. Address Translation: Pages

Translation does not happen byte by byte. The address space is divided into fixed-size pages, typically 4 kilobytes each, and translation maps an entire virtual page to a corresponding physical frame (also 4 KB). Within a page, the byte offset is the same in virtual and physical addresses; only the page number is translated.

A 4 KB page uses 12 bits of offset (log24096=12\log_2 4096 = 12). A 48-bit virtual address therefore splits into a 36-bit virtual page number and a 12-bit offset:

Figure: Virtual and physical address layouts for 4 KB pages: a 48-bit virtual address splits into a 36-bit VPN and a 12-bit offset; the physical address shares the same 12-bit offset

LaTeX
\begin{tikzpicture}[font=\footnotesize, line cap=round] \node[font=\small, anchor=west] at (0, 0) {virtual address (48 bits):}; \draw[thick] (0, -0.7) rectangle (5, 0); \node at (2.5, -0.35) {VPN (36 bits)}; \draw[thick] (5, -0.7) rectangle (7, 0); \node at (6, -0.35) {offset (12)}; \node[font=\small, anchor=west] at (0, -1.5) {physical address:}; \draw[thick] (0, -2.2) rectangle (5, -1.5); \node at (2.5, -1.85) {PFN (variable)}; \draw[thick] (5, -2.2) rectangle (7, -1.5); \node at (6, -1.85) {offset (12)}; \end{tikzpicture}

The MMU's job is to take the VPN (Virtual Page Number) and look up the PFN (Physical Frame Number) it maps to. The offset passes through unchanged.

This page-granular mapping is a deliberate compromise. Mapping individual bytes would require enormous tables; mapping multi-megabyte regions would waste memory by forcing big rounding. Four kilobytes turns out to be a reasonable size: small enough that internal fragmentation is acceptable, large enough that translation tables are manageable.

03. Page Tables

The mapping from VPN to PFN is stored in a data structure called a page table. The page table lives in main memory, owned and managed by the operating system. Each entry — a page table entry (PTE) — describes one page's translation: the physical frame number it maps to, the permission bits, a present bit indicating whether the mapping is currently valid, and a few other status bits.

A naive page table would be a single flat array indexed by VPN. With 36-bit VPNs, that would require 2362^{36} entries — 64 billion entries, taking many gigabytes of memory. Most of those entries would be unused, since most processes use only a small fraction of their virtual address space. A flat table is therefore impractical.

The standard solution is a multi-level page table: a tree of tables, in which higher-level entries point to lower-level tables, and only the parts of the tree that are actually used are allocated. The root of the tree is held in a dedicated CPU register (called CR3 on x86, TTBR0/TTBR1 on AArch64, satp on RISC-V), pointing at the top-level table for the current process.

A typical four-level page table on x86-64 with 4 KB pages and 48-bit virtual addresses divides the VPN into four 9-bit fields:

Plain Text
bits [47:39] PML4 index (top-level)
bits [38:30] PDPT index
bits [29:21] PD index
bits [20:12] PT index (lowest level)
bits [11:0] offset

Each level's table holds 512 entries (292^9), one per index value, and each entry is 8 bytes. So each table is exactly one page (4 KB), which is convenient: the operating system can allocate page tables in the same units it allocates everything else.

Translation is a page table walk: starting from the root pointer, the MMU uses each 9-bit slice of the VPN to index into the corresponding level's table, follow the entry to the next level, and eventually reach the PTE that gives the PFN.

Figure: Four-level x86-64 page-table walk: CR3 names the PML4, each 9-bit VPN slice indexes the next table down through PDPT and PD to the leaf PT entry holding the PFN

LaTeX
\begin{tikzpicture}[font=\footnotesize, >=Stealth, line cap=round, blk/.style={draw, thick, fill=white, minimum width=5cm, minimum height=0.7cm}] \node[blk, draw=none, fill=none] (cr3) at (0, -0.5) {CR3 (root pointer)}; \node[blk] (pml4) at (0, -1.7) {PML4: entry [PML4 index]}; \node[blk] (pdpt) at (1.5, -2.9) {PDPT: entry [PDPT index]}; \node[blk] (pd) at (3, -4.1) {PD: entry [PD index]}; \node[blk, minimum width=6cm] (pt) at (4.5, -5.3) {PT: entry $\to$ PFN, perms, present}; \draw[->] (cr3) -- (pml4); \draw[->] (pml4.south) -- (pdpt.west); \draw[->] (pdpt.south) -- (pd.west); \draw[->] (pd.south) -- (pt.west); \end{tikzpicture}

Each step is a memory access. A four-level page table walk costs four memory accesses to resolve a single translation. If those accesses miss in the cache and go to DRAM, a single translation can take hundreds of nanoseconds. This is unacceptable for ordinary loads and stores; we need a way to amortize the cost. That is what the TLB is for.

Five-level paging (with 57-bit virtual addresses) is supported on recent x86-64 chips and on AArch64; it adds another layer of indirection and gives 128 PB of virtual address space. Few systems use it yet, but it is available.

04. Page Sizes

The 4 KB page size is the historical default and remains the most common. But several other sizes are supported by modern architectures.

Larger pages. Most architectures support pages of 2 MB or 4 MB ("large pages") and 1 GB ("huge pages") in addition to 4 KB. Large pages are useful when a single mapping covers a contiguous region of memory: they reduce the number of page table entries needed, save space in the TLB, and skip levels of page table walking.

The mechanism is straightforward. A page table entry at, say, the PD level can be marked as terminal — meaning "this 2 MB region is one mapping, do not descend further." The walk stops at that level and uses the entry's PFN as the PFN for the entire 2 MB. Similarly, a PDPT entry can describe a 1 GB mapping.

Why not always use large pages? Three reasons.

First, allocation. A 2 MB page must be physically contiguous in DRAM. As memory becomes fragmented over time, the system may not be able to find a contiguous 2 MB chunk.

Second, granularity. If a process needs to protect a small region differently from its surroundings, it cannot do so within a large page; the unit of protection is the page.

Third, efficiency. If a program touches only a few hundred bytes of a 2 MB page, the operating system has wasted nearly 2 MB of physical memory.

The right answer is workload-dependent. Database servers, virtual machines, and other consumers of huge contiguous regions benefit greatly from large pages. General-purpose desktop workloads do well with 4 KB pages. Most operating systems support both, with mechanisms (such as Linux's transparent huge pages) that promote regions to large pages opportunistically.

Smaller pages. Some embedded systems support smaller pages (1 KB, 512 B) for memory-constrained workloads. They are not common in general-purpose systems, since 4 KB is already small enough for most purposes.

05. TLBs

The page table walk is too expensive to perform on every memory access. The TLB (Translation Lookaside Buffer) is a cache of recent translations: a small, fast structure that holds the most recently used VPN-to-PFN mappings. When a memory access is issued, the MMU first checks the TLB; if the translation is there, the access proceeds in a cycle or two. If not, the MMU performs a page table walk to fill the TLB entry.

The TLB is, structurally, a cache. It has lines, sets, associativity, and a replacement policy, just like the data cache. The differences are size and contents.

A typical L1 TLB has 64 to 128 entries, fully associative or highly set-associative. An L2 TLB might have a few thousand entries with lower associativity. A TLB hit completes in a single cycle; a miss requires a walk that, even with help from the cache (page tables themselves can be cached), takes 5 to 30 cycles in the best case and hundreds in the worst.

A subtlety: TLBs are typically separate for instructions and data, and separate for different page sizes. The hardware partitions its TLB resources accordingly.

A simplified picture of how a load proceeds:

Plain Text
1. CPU issues a load with virtual address VA.
2. MMU looks up VPN(VA) in the TLB.
3. TLB hit:
- PFN, permissions returned in 1 cycle.
- Permissions checked. If OK, build PA = PFN | offset.
- Pass PA to the cache hierarchy.
4. TLB miss:
- MMU initiates page-table walk.
- Each level's PTE is read (possibly through the cache).
- Final PTE gives PFN; TLB is filled.
- Translation completes; load proceeds.

The cost of a TLB miss is significant: tens to hundreds of cycles. For workloads with poor TLB locality — random access through a large data structure spanning many pages — TLB misses can dominate the cost of memory access. Large pages help here by extending the coverage of each TLB entry: a TLB entry for a 4 KB page covers 4 KB of virtual address space, while a TLB entry for a 2 MB page covers 512 times as much. A 64-entry TLB covers 256 KB with 4 KB pages but 128 MB with 2 MB pages — a huge difference for memory-intensive workloads.

The TLB is also a per-process structure, in the sense that translations belong to a specific address space. When the operating system context-switches between processes, the new process has different translations, and the old TLB entries are stale. Two strategies address this.

Flush on context switch. The simplest approach. Every context switch invalidates the entire TLB; the new process starts cold and refills as it runs. Costly, because TLB-miss costs accumulate before the new process gets going.

Address Space IDentifiers (ASIDs). Each TLB entry is tagged with an ASID, a small number that identifies the address space. The hardware compares both the VPN and the ASID; entries from one process do not match accesses from another. Modern x86 calls these PCIDs (Process-Context Identifiers); ARM and RISC-V use ASIDs. With ASIDs, a context switch does not flush the TLB; it just changes the current ASID register.

Most modern operating systems use ASIDs to avoid the cost of TLB flushes. There is some bookkeeping when ASIDs run out and have to be recycled, but the win is large.

06. The TLB Hierarchy and Page-Walk Caching

The single "TLB" of textbook diagrams is, on real hardware, a small hierarchy in its own right. A modern x86-64 or AArch64 core typically has:

  • A small L1 instruction TLB (iTLB), perhaps 64 entries, on the instruction-fetch path.
  • A small L1 data TLB (dTLB), perhaps 64–128 entries, in parallel with the L1 data cache.
  • A larger L2 TLB (sometimes called the STLB, shared TLB), perhaps 1024–4096 entries, that catches misses from both L1 TLBs.
  • Separate TLB partitions or structures for large pages (2 MB, 1 GB), since their tag and offset structure differs.

The hierarchy works like the data cache hierarchy: L1 hits are nearly free, L2 hits cost a few cycles, and only L2 misses force a page-table walk. The structure is invisible to software but visible to performance: a workload whose footprint exceeds the L1 TLB's coverage but fits in the L2 TLB sees a small, steady cost of L1 TLB misses; one whose footprint exceeds the L2 TLB falls into the much more expensive page-walk regime.

A page-table walk reads up to four page-table entries (five with 5-level paging) from memory to translate one virtual address. To make these reads fast, modern processors include a page-walk cache (also called a page-structure cache or paging-structure cache) that holds recently-used intermediate-level entries: PML4, PDPT, and PD entries on x86-64. When a walk needs an upper-level entry that is in the page-walk cache, it skips the corresponding memory access; in the best case, only the lowest-level PTE has to be fetched from the data-cache hierarchy.

The net effect is that even a TLB miss is usually fast — perhaps 10 to 30 cycles — thanks to the structures that cache every layer of the walk. Pathological cases (cold page-walk cache, memory pressure that has evicted page tables from the data cache) can take hundreds of cycles, but they are rare in steady state.

07. TLB Shootdown

The TLB caches translations, but it is not part of the coherence machinery that keeps data caches in sync. When the operating system changes a page table entry — unmapping a page, changing its permissions, swapping it out — every CPU's TLB that has a cached translation of the affected page must be told to invalidate it. Hardware does not do this automatically.

The protocol is called a TLB shootdown, and it is one of the more expensive operations in a multi-core system. The CPU that performs the page table change holds a lock on the page table, modifies the entry, and then sends an inter-processor interrupt (IPI) to every other CPU that might have the translation. Each receiving CPU runs an interrupt handler that executes a TLB-invalidate instruction (invlpg on x86, tlbi on AArch64, sfence.vma on RISC-V) for the affected virtual address. The originating CPU waits for all the IPIs to be acknowledged before completing the operation.

The round-trip cost on a busy system is on the order of microseconds — thousands of cycles — because the IPI has to traverse the inter-core interconnect and each receiving core has to interrupt whatever it was doing. Workloads that perform many mmap/munmap or mprotect calls on a multi-core system, or that fork/exec rapidly with many cores, can spend a substantial fraction of their time in TLB shootdowns.

Recent hardware reduces the cost. ARM's broadcast TLB invalidates (tlbi with is shareability) propagate the invalidation through the coherent interconnect without requiring an IPI; the originating CPU still waits for completion (dsb), but the receiving cores do not have to interrupt. Intel's (newer) INVLPGB instruction broadcasts a range invalidation to other cores in one operation. RISC-V's hardware-defined TLB invalidation extension is in active development. The architectural trend is toward making shootdowns a hardware operation rather than an OS-orchestrated one, but on most current systems the OS protocol is what runs.

The lesson for software is that page table changes are not free, and that VM-intensive workloads benefit substantially from batching them: one large mprotect is much cheaper than many small ones.

08. Inverted and Hashed Page Tables

The multi-level radix-tree page tables we have described are not the only design. An alternative, used in IBM's PowerPC and POWER architectures and (historically) in some MIPS variants, is the inverted page table: a single hash table indexed by physical frame number whose entries record the virtual address mapped to that frame.

The trade-offs are interesting. An inverted page table has size proportional to physical memory, not to the size of the virtual address space, which is a significant saving on systems where each process has a 64-bit address space but only a few gigabytes of physical memory. The walk is constant-time — a hash lookup — rather than logarithmic. The drawback is that finding a virtual-to-physical translation requires a hash lookup with collision handling, which is more complex than a fixed-depth radix walk; sharing pages across processes is awkward; and the structure must be searched on every TLB miss, complicating hardware page-walkers.

Most current architectures use multi-level radix trees instead. PowerPC's hash page tables exist in older systems and in IBM mainframes; the architectural minority is shrinking as multi-level paging has become the standard.

09. Nested Page Tables for Virtualization

A virtual machine adds another layer to translation. The guest operating system maintains its own page tables, mapping guest virtual addresses to guest physical addresses; the hypervisor then maps guest physical to host physical. Without hardware help, every memory access in the guest would require the hypervisor to intercept and translate, which is impossibly slow.

The hardware solution is nested page tables — also called two-stage translation, EPT (Intel's Extended Page Tables), NPT (AMD's Nested Page Tables), or Stage-2 (ARM). The MMU performs two walks in succession: the guest's page tables to translate GVA → GPA, and a second set of page tables (provided by the hypervisor) to translate GPA → HPA. The TLB caches the composed translation GVA → HPA, so once a translation is in the TLB, it is as fast as a single-stage translation.

The cost is in walks. A two-stage walk on x86-64 reads up to 24 memory locations in the worst case (each level of the four-level guest walk requires a four-level walk through the host's tables). The page-walk caches handle most of this in steady state, but cold walks are expensive, and TLB pressure is correspondingly higher under virtualization. We will return to the broader virtualization picture in Chapter 48; for the VM hierarchy, the architectural fact is that translation has become a layered operation, and the hardware structures (TLB, page-walk cache) are sized accordingly.

10. IOMMUs and DMA Translation

Devices that perform DMA into system memory issue physical addresses on the bus, bypassing the CPU's MMU. On a system without protection, a misbehaving or malicious device can read or write any physical address. The IOMMU (I/O Memory Management Unit, called VT-d by Intel, AMD-Vi by AMD, SMMU by ARM) extends MMU-style translation and protection to device-issued accesses.

The IOMMU sits between devices and the memory system. Each device (or device group) has a page table that maps the device virtual addresses the device issues into physical memory addresses. The IOMMU translates and permission-checks every DMA on the fly. The result is that a device sees only the memory the operating system has explicitly mapped for it; any other address produces a translation fault.

IOMMUs are used for two distinct purposes. The first is security and isolation: limiting what each device can touch, particularly important for hot-plug devices like Thunderbolt and PCIe peripherals from untrusted vendors. The second is device pass-through under virtualization: a virtual machine can be given direct access to a device without violating the hypervisor's memory protection, because the IOMMU enforces the guest's view of physical addresses.

We revisit IOMMUs from the I/O side in Chapter 20; here, the architectural point is that every DMA-capable device on a modern system is, in effect, performing virtual-memory translation for its own accesses, and the operating system has to keep its mappings consistent with the rest of the VM machinery.

11. Pinned Pages, Locked Memory, and Hugetlbfs

The normal VM machinery assumes pages can be moved or evicted at the operating system's discretion. A few use cases require the opposite: pages whose physical frame is fixed for as long as the application needs them.

The most common is DMA. A device performing DMA issues a physical address; if the kernel evicts the page in the middle of the transfer, the device writes to the wrong frame and corrupts memory. The kernel therefore pins (Linux: get_user_pages, sometimes wrapped as mlock) the pages a DMA buffer occupies for the duration of the transfer, marking them ineligible for eviction or migration.

A second use case is predictable performance. Page faults introduce unpredictable latency that real-time and high-frequency-trading workloads cannot tolerate. The mlock system call locks specific pages into RAM, guaranteeing they will not page-fault later. mlockall does the same for the entire address space.

A third use case is explicit huge pages. Linux's hugetlbfs is a pseudo-filesystem that allocates physical memory in 2 MB or 1 GB chunks reserved at boot, separate from the normal page allocator. Applications that map files from hugetlbfs get explicit huge-page mappings rather than waiting for transparent huge pages to opportunistically promote them. Database engines (PostgreSQL, Oracle, MySQL with InnoDB tuning) commonly use hugetlbfs to ensure their large shared buffers are backed by huge pages from the start.

These mechanisms are not part of the architectural specification, but they sit at the interface between the architecture's VM features (large pages, page tables) and the operating system's policies. A working knowledge of them is essential for understanding why production systems sometimes look configured very differently from a textbook setup.

12. Page Faults

When the MMU walks the page table and finds that the requested page is not present (the present bit is zero), or that the access violates the permissions, it raises a page fault exception. We met this in Chapter 15; here we look at what the operating system actually does.

The fault hands control to the kernel's page-fault handler, with information about the faulting virtual address, the kind of fault (read, write, instruction fetch), and the privilege level of the faulting access. The handler examines this information and decides what to do. Several outcomes are common.

The faulting page is part of the process's address space but has not yet been allocated. This happens when memory is allocated lazily: the operating system promised the process the address range but did not back it with physical memory until needed. The handler allocates a frame, possibly initializes it (zero-fills, or copies from a file), updates the page table to make the mapping valid, and resumes the faulting instruction.

The faulting page has been swapped out to disk. The process's working set was larger than physical memory, and the operating system reclaimed the page by writing it to a swap file or partition. The handler reads the page back from disk into a free frame, updates the page table, and resumes. This is enormously slow — a page-in from disk can take milliseconds — but invisible from the program's perspective.

The page is being loaded from a memory-mapped file. The process has mapped a file into its address space; the first access to a page of the mapping triggers a fault, which the handler resolves by reading the file's content into a frame and creating the mapping. Subsequent accesses to the page are fast.

The page is shared, and this is a write to a copy-on-write page. This is the next topic.

The access is invalid. The address is outside the process's mapped regions, or the access type violates the page's permissions (write to read-only, execute from non-executable, user-mode access to kernel page). The handler delivers a signal (SIGSEGV on Unix-like systems, an access violation on Windows) to the process, which usually terminates as a result.

The page fault is the single most important hook the operating system has into the memory subsystem. Almost every interesting feature of a modern memory manager is implemented through page faults: lazy allocation, demand paging, swap, mmap, copy-on-write, JIT compilation, debugger watchpoints. The mechanism is delicate but powerful.

13. Demand Paging

Demand paging is the policy of bringing pages into memory only when they are first referenced, rather than at process start. Rather than load an entire program into memory before running it, the operating system loads only the data structures necessary to start: a stack, a few pages of code around the entry point, the loader's data structures. As the program runs and touches more memory, page faults bring in pages on demand.

The benefits are substantial. Programs start faster, because they do not need to wait for unneeded code and data to load. Memory is used efficiently, because pages that are never accessed are never brought in. The same mechanism allows the kernel to load shared libraries lazily, expand stacks dynamically, and back any of these with files on disk.

Demand paging works because real programs have working sets much smaller than their total size. A typical large application — a web browser, a game, a compiler — might have hundreds of megabytes of code and data on disk but reference only tens of megabytes during any given second of execution. Without demand paging, the system would have to keep the whole thing in memory; with it, only the working set is resident.

The tradeoff is that the first access to each page is slow (a page fault) and that under memory pressure, the system may swap pages out and back in, with substantial latency. Thrashing is the pathological condition in which the working set does not fit in physical memory and the system spends most of its time paging in and out without making forward progress. A thrashing system is essentially unusable; the operating system has policies (priority, quotas, the OOM killer on Linux) to avoid this state.

14. Copy-on-Write

A particularly clever use of the page fault mechanism is copy-on-write (COW). When a process forks (creates a child by duplicating itself), the naive implementation would copy the entire address space, which on a large process could be gigabytes. COW avoids the copy.

When the parent calls fork(), the kernel does not copy the pages. Instead, it sets up the child's page table to share the parent's pages, marking them all as read-only in both page tables. Both processes can read all their pages without any further work. As long as neither process writes, no copy is needed.

When either process attempts to write to a shared page, the hardware raises a page fault (because the page is read-only). The handler recognizes the COW situation, allocates a new frame, copies the contents of the original page into it, updates the writing process's page table to point to the new frame and mark it read-write again, and resumes the write. The other process still sees the original page. The two have diverged on this single page; the rest of the address space remains shared.

The benefit is that fork() is fast, even on processes with huge address spaces, and that pages that are never modified after fork (which is most of them, if the child quickly calls exec() to replace itself with a different program) are never copied at all. Modern Unix-like systems rely heavily on COW to make process creation cheap.

The same mechanism powers many other features: efficient file system snapshots, container layered file systems, live migration of virtual machines, even some forms of garbage collection. Anywhere two parties share data and want to diverge cheaply on writes, COW is the answer.

15. Memory Protection

The page table's permission bits — readable, writable, executable, user-or-kernel — give the operating system fine-grained control over what each part of an address space can do. The hardware enforces these on every access.

The user/kernel distinction prevents user programs from accessing kernel pages. Pages mapped only at the kernel's privilege level (typically marked as supervisor-only) are invisible to user programs; an attempt to access them faults. This is what keeps the kernel's data safe from user-mode interference.

Read-only pages prevent modification. Code pages of programs are typically mapped read-only and executable, so that a buffer overflow in a stack page cannot rewrite the program's code. Constant data is also mapped read-only.

No-execute pages (sometimes called NX or DEP) prevent execution. Writable data pages are mapped non-executable, so that an attacker who manages to write attacker-controlled bytes into the stack or heap cannot then jump to them as code. NX is one of the most important hardening techniques in modern operating systems; it eliminates an entire category of exploits.

Per-page attributes for caching control let the operating system mark device-mapped pages as non-cacheable (so that MMIO accesses are not cached, as we discussed in Chapter 9), or write-combining (where writes are buffered and combined before reaching the device).

These permissions interact with the privilege model of the ISA. We saw in Chapter 15 that user mode and kernel mode are distinct privilege levels, and that the trap mechanism is the only path between them. The page table extends that boundary into memory: even within a single physical address space, different pages can be accessible to different privilege levels. The MMU enforces this on every access, with the same precise-exception semantics as any other fault.

16. ASLR and Other Modern Tricks

A few features built on virtual memory are worth a brief mention.

Address Space Layout Randomization (ASLR) randomizes the virtual addresses at which a process's code, libraries, stack, and heap are loaded. Without ASLR, every running instance of a program has the same layout; an attacker who finds a vulnerability in one instance can exploit it predictably on every other. With ASLR, the addresses are different on every run, making exploitation much harder. ASLR is a property of the operating system's loader, but it relies on virtual memory: the kernel can place segments anywhere in the address space without affecting the program.

Kernel ASLR (KASLR) does the same for the kernel, randomizing the virtual address at which kernel code and data are loaded.

Memory protection keys are a relatively new feature on x86 and ARM that allow user-mode programs to change page permissions efficiently for groups of pages. They are useful for sandboxing inside a process — for example, in a JavaScript engine that wants to make some pages writable only briefly during code generation.

Hardware page-table walks versus software-managed TLBs. Most modern architectures (x86, ARM, RISC-V) have hardware page-table walkers — the MMU itself walks the table on a TLB miss, with no software involvement. Older architectures (MIPS, original SPARC) had software-managed TLBs: a TLB miss raised an exception, and the operating system's handler walked the page table and inserted the translation. Hardware walkers are faster for the common case; software-managed TLBs are more flexible. Modern designs use hardware walkers exclusively for general-purpose memory.

17. Summary

Virtual memory is the mechanism by which each process sees a private, contiguous, byte-addressable view of memory whose addresses are translated by the MMU into the physical addresses where data actually lives. The translation is page-granular (typically 4 KB pages, with optional 2 MB and 1 GB large pages); the mapping is held in a multi-level page table (sometimes inverted, on POWER) managed by the operating system; a hierarchy of TLBs together with a page-walk cache catches the great majority of translations without going to memory.

Maintaining the structure across cores requires explicit cooperation: TLB shootdowns synchronize translations after page-table changes, with hardware help on recent ARM and x86 to broadcast invalidations through the coherent interconnect. Virtualization adds a second stage of translation \u2014 EPT, NPT, Stage-2 \u2014 so that guest virtual addresses traverse two page table layers before reaching host physical memory. IOMMUs extend the same translation and protection model to DMA-capable devices, isolating them and enabling safe device pass-through to virtual machines. Mechanisms like mlock, page pinning, and hugetlbfs let applications opt out of the normal eviction and large-page policies when their workloads demand it.

The mechanism enables a great deal beyond simple isolation. Demand paging keeps physical memory from being wasted on unused pages. Copy-on-write makes process creation cheap. Memory-mapped files give programs uniform access to file data. Page-level permissions enforce read-only code and non-executable data, eliminating large classes of security bugs. ASLR randomizes layouts to make exploitation harder. Each of these features is a few lines of operating-system code on top of the hardware-supported page-fault mechanism, which is itself a small extension of the exception-handling machinery from Chapter 15.

A working programmer can usually pretend none of this exists. The operating system maintains the illusion of a flat address space well enough that most programs never see a page fault directly. But the costs of breaking the illusion — TLB misses, page-walk overhead, swap, thrashing — are real, and a programmer who understands the mechanism can write code that respects it: avoiding random access through huge data structures, packing related data into the same pages, allocating large pages for memory-intensive workloads. The hierarchy continues to reward locality, only now at the granularity of pages rather than cache lines.

Chapter 20 closes Part IV by looking at the bottom of the hierarchy — the storage and I/O subsystems that hold persistent data and that virtual memory occasionally has to reach.

Book mode
computer-architecturememorycache
Was this helpful?