x86-64 System Architecture
May 16, 2026·23 min read·advanced
The previous chapter looked at x86-64 as an application programmer sees it: registers, instructions, calling conventions, idioms. This chapter steps up to what the operating system kernel sees: the…
The previous chapter looked at x86-64 as an application programmer sees it: registers, instructions, calling conventions, idioms. This chapter steps up to what the operating system kernel sees: the system-level architecture. The kernel is the code that initializes the processor, manages virtual memory, handles interrupts and exceptions, services system calls, switches between processes, and orchestrates everything below the application layer.
The system architecture is large. We will cover the most important pieces: virtual memory and paging in long mode, the descriptor tables (GDT, IDT, LDT), interrupt and exception handling, the system-call mechanism, control registers and MSRs, the boot sequence from reset to operating system, and a brief look at virtualization (VMX). The treatment is conceptual — implementing a kernel from scratch is a book in itself — but should give a clear picture of what an x86-64 OS interacts with.
01. Long Mode Memory Model
In long mode, x86-64 uses a flat 64-bit virtual address space. Segmentation, the cornerstone of 16-bit and 32-bit memory management, is mostly suppressed: the segment bases are forced to zero, the limits are ignored, and segments effectively span the entire address space. The exceptions are fs and gs, which retain their bases for thread-local storage and per-CPU data.
The 64-bit address space is not actually 64 bits wide. Current implementations support either:
- 48-bit canonical addresses: bits 47-63 must all be 0 or all be 1. Effective virtual address space: bytes = 256 TiB.
- 57-bit canonical addresses (5-level paging, in newer Intel and AMD chips): bits 56-63 must all be 0 or all be 1. Effective: bytes = 128 PiB.
The "canonical" requirement means there is a wide gap between the lower and upper halves of the address space. Conventionally, the lower half (0x0000_0000_0000_0000 to 0x0000_7FFF_FFFF_FFFF for 48-bit) is user space; the upper half (0xFFFF_8000_0000_0000 to 0xFFFF_FFFF_FFFF_FFFF) is kernel space. Non-canonical addresses (the gap) cause a #GP fault.
02. Paging: 4-Level (and 5-Level)
Long mode uses paging exclusively (segmentation is mostly disabled). The page-table structure is hierarchical, with 4 or 5 levels.
4-Level Paging (48-bit Virtual Addresses)
A 48-bit virtual address is split into five fields:
| | 9 bits | 9 bits | 9 bits | 9 bits | 12 bits | | |
| PML4 PDPT PD PT offset |
- PML4 (Page-Map Level 4): top-level table, 512 entries indexed by bits 47-39.
- PDPT (Page-Directory Pointer Table): 512 entries, bits 38-30.
- PD (Page Directory): 512 entries, bits 29-21.
- PT (Page Table): 512 entries, bits 20-12.
- Offset: byte within the 4 KiB page, bits 11-0.
Each level is a 4 KiB table containing 512 entries of 8 bytes each. Each entry points to the next-level table (a 4 KiB-aligned physical address) or, at lower levels, to a huge page directly.
Huge pages. Instead of mapping a 4 KiB page, an entry can map a larger contiguous region by setting the PS (page size) bit:
- A PD entry with PS=1 maps a 2 MiB page (skipping the PT level).
- A PDPT entry with PS=1 maps a 1 GiB page (skipping PD and PT).
Huge pages reduce TLB pressure (one TLB entry covers more memory) and skip page-walk levels. The kernel uses 2 MiB pages liberally for kernel mappings and page caches; 1 GiB pages for very large mappings (huge databases, mmap of giant files).
5-Level Paging (57-bit Virtual Addresses)
5-level paging adds the PML5 table on top:
| | 9 bits | 9 bits | 9 bits | 9 bits | 9 bits | 12 bits | | |
| PML5 PML4 PDPT PD PT offset |
Used for very large memory systems (hundreds of TiB to many PiB). Available in Intel Ice Lake and later, and AMD's recent server chips. The OS opts in via a control register bit; older kernels and applications continue to work in 4-level mode.
Page Table Entry Format
Each PTE is 8 bytes (64 bits) with the following key bits:
- P (present): 1 if the entry is valid.
- R/W (read/write): 1 to allow writes.
- U/S (user/supervisor): 1 if user-mode can access; 0 for kernel-only.
- PWT, PCD (caching policy bits).
- A (accessed): hardware sets this on access.
- D (dirty): hardware sets this on write.
- PS (page size): 1 for huge page (where applicable).
- G (global): if set, TLB entry is preserved across CR3 reloads.
- PAT (page attribute table): one of three bits selecting a memory type.
- Bits 12-51: physical address of next level or page (52-bit physical addresses).
- NX (no-execute, bit 63): if set, fetching from this page faults.
The NX bit is critical for security: kernel marks data pages NX so that an exploit cannot inject and execute shellcode there. AMD pioneered NX in its 64-bit chips; Intel followed.
TLB and Page-Walk Caches
The TLB caches recent virtual-to-physical translations. We covered TLBs in Chapter 19. Modern x86-64 chips have:
- A small, fast L1 TLB (split into iTLB and dTLB), typically 64-128 entries each.
- A unified L2 TLB with 1024-4096 entries.
- Separate TLBs for 4 KiB, 2 MiB, and 1 GiB pages.
- Page-walk caches that hold the upper-level page-table entries to skip levels of the walk on a TLB miss.
A TLB miss triggers a hardware page walk. The page walker reads the four (or five) levels from memory, traversing the table hierarchy, and installs the result in the TLB. The walk can take 10-100 cycles depending on cache locality. For large working sets, page walks are a significant cost.
The OS invalidates TLB entries explicitly:
INVLPG addr— invalidate TLB entry foraddron this core.MOV cr3, ...— reload page-table base (causes full TLB flush except for global entries).- TLB shootdown — cross-core IPI to invalidate the same address on other cores. The most expensive consequence of the kernel changing a mapping.
Process-Context Identifiers (PCID)
Without PCID, switching processes (via cr3 reload) invalidates the entire TLB, costing thousands of cycles in subsequent misses. PCID tags TLB entries with a 12-bit process identifier, so multiple processes' translations coexist in the TLB.
When the kernel switches process, it changes the PCID; the TLB entries from the new process (if any are still cached) are immediately usable. Hugely accelerates context switches.
PCID is enabled by setting CR4.PCIDE. Intel has supported it since Westmere (2010); modern Linux kernels use it.
03. Descriptor Tables
Long mode dramatically simplifies the descriptor-table machinery from 32-bit protected mode, but the tables still exist.
Global Descriptor Table (GDT)
The GDT is an array of 8-byte (sometimes 16-byte) descriptors. In long mode, most segment descriptors are essentially fixed: code segments and data segments have base 0 and limit max. The GDT contains:
- A null descriptor (entry 0, required).
- A 64-bit code descriptor (kernel cs).
- A 64-bit data descriptor (kernel ds/ss).
- A user-mode 64-bit code descriptor.
- A user-mode 64-bit data descriptor.
- A TSS (Task State Segment) descriptor (16 bytes in long mode).
- Possibly: 32-bit code/data descriptors for compatibility-mode user code.
The CPU finds the GDT via the GDTR register, loaded by LGDT. The cs/ss/ds/es selectors index into the GDT.
Interrupt Descriptor Table (IDT)
The IDT is an array of 16-byte gate descriptors, indexed by interrupt vector number. Each entry specifies:
- The handler's code segment (a selector into the GDT).
- The handler's offset (64-bit virtual address).
- The privilege level required to invoke this gate via software.
- The IST (Interrupt Stack Table) index, optionally selecting a per-vector kernel stack.
When an interrupt or exception occurs, the CPU:
- Looks up vector in the IDT.
- Switches to the kernel stack (if coming from user mode), possibly using IST.
- Pushes a stack frame: ss, rsp, rflags, cs, rip, error code (for some vectors).
- Loads cs and rip from the IDT entry.
- Begins executing the handler.
The IDT has 256 entries, vectored as follows:
- 0-31: Architecturally defined exceptions (described below).
- 32-47: Traditional hardware IRQs (modern systems use APIC and assign more flexibly).
- 48-255: Available for OS use, including software interrupts (
INT n), IPIs, MSI vectors.
Task State Segment (TSS)
In 32-bit protected mode, the TSS was a context-save region for hardware task switching. In long mode, hardware task switching is dropped, but the TSS still exists for one important purpose: it holds the kernel-mode stack pointers for ring transitions and the IST table.
When user-mode code triggers an interrupt or syscall, the CPU needs to switch to a kernel stack. The TSS provides:
- rsp0, rsp1, rsp2: stacks for ring 0, 1, 2 (only rsp0 used in practice).
- IST1-IST7: separate kernel stacks for specific exception vectors (NMI, double fault, machine check, etc.) so that a corrupted main kernel stack doesn't prevent handling those critical vectors.
The TSS descriptor in the GDT points to the TSS. LTR loads the TR register, telling the CPU which TSS to use.
Local Descriptor Table (LDT)
The LDT was for per-process segment definitions. In long mode it is largely vestigial: most kernels do not use it. Some niche programs (Wine, certain embedded systems) use it for compatibility-mode code with custom segments.
04. Exceptions
x86-64 architecturally defines 32 exception vectors (0-31):
| Vector | Mnemonic | Description |
|---|---|---|
| 0 | #DE | Divide error |
| 1 | #DB | Debug exception |
| 2 | NMI | Non-maskable interrupt |
| 3 | #BP | Breakpoint (INT3) |
| 4 | #OF | Overflow |
| 5 | #BR | Bound range exceeded |
| 6 | #UD | Undefined opcode |
| 7 | #NM | Device not available (FPU) |
| 8 | #DF | Double fault |
| 10 | #TS | Invalid TSS |
| 11 | #NP | Segment not present |
| 12 | #SS | Stack segment fault |
| 13 | #GP | General protection |
| 14 | #PF | Page fault |
| 16 | #MF | x87 FPU floating-point error |
| 17 | #AC | Alignment check |
| 18 | #MC | Machine check |
| 19 | #XF | SIMD floating-point exception |
| 21 | #CP | Control protection (CET) |
The most common in normal operation:
- #PF (page fault): the most frequent. Triggered on access to a non-present page (demand paging) or write to a read-only page (COW), or NX violation, etc. The error code on the stack distinguishes the cause; CR2 holds the faulting address.
- #GP: privileged instruction in user mode, segment limit violation, non-canonical address use, etc. Common cause of SIGSEGV in user programs.
- #UD: undefined opcode. Used by debuggers (BYE-style breakpoints), kernel panics (
UD2instruction triggers it deliberately), and detection of unsupported instructions. - #BP: software breakpoint (
INT3, encoded as 0xCC). Debuggers replace instructions with 0xCC bytes to set breakpoints. - NMI: non-maskable. Used for watchdog, hardware errors, profiling. Even with interrupts disabled (
CLI), NMI fires. - #MC: hardware error. Memory parity, ECC failure, internal CPU error. The OS typically panics or marks pages dead.
- #DF (double fault): an exception occurred while handling another exception. Very dangerous; if a double fault itself faults, the CPU triple-faults and resets.
Some exceptions push an error code on the stack; others don't. Handlers must know the convention for each vector.
05. Interrupts and the APIC
Hardware interrupts come through the APIC (Advanced Programmable Interrupt Controller). Each core has a local APIC; the system has one or more I/O APICs. The local APIC handles inter-processor interrupts (IPIs), the local timer, the thermal monitor, and (for x2APIC mode) all I/O interrupts routed by the I/O APIC.
The local APIC delivers interrupts to specific vectors in the IDT. When a core receives an interrupt, it acknowledges via the local APIC's EOI (end-of-interrupt) register before returning. Without EOI, further interrupts at the same priority are blocked.
x86-64 has MSI (Message Signaled Interrupts) as the dominant delivery mechanism. PCI devices generate interrupts by writing to a special memory address (the local APIC's interrupt window), which directly delivers the interrupt without traversing legacy interrupt lines.
x2APIC is a more recent mode where the local APIC is accessed via MSRs rather than memory-mapped I/O, supporting more cores (up to 2^32 logical processors) and faster interaction. Modern systems prefer x2APIC.
06. System Calls
User mode invokes the kernel through system calls. x86-64 provides a fast mechanism: SYSCALL / SYSRET.
| ; user mode | |
| mov rax, syscall_number | |
| mov rdi, arg0 | |
| mov rsi, arg1 | |
| mov rdx, arg2 | |
| mov r10, arg3 ; r10 not rcx — rcx is clobbered by syscall | |
| mov r8, arg4 | |
| mov r9, arg5 | |
| syscall | |
| ; rax has result |
What SYSCALL does:
- Saves rip into rcx and rflags into r11.
- Loads cs from MSR
IA32_STAR[47:32]. - Loads ss from MSR
IA32_STAR[47:32] + 8(no actual MSR read; derived). - Sets rip to MSR
IA32_LSTAR(the syscall entry point). - Masks rflags with
IA32_FMASK(typically clearing IF to disable interrupts initially). - Switches privilege level to ring 0.
Note that SYSCALL does not switch stacks. The kernel's syscall entry must do so manually, typically using a per-CPU stack pointer stored in the kernel gs base via SWAPGS.
| ; kernel syscall entry | |
| swapgs ; swap gs to kernel gs | |
| mov gs:[saved_rsp], rsp ; save user rsp | |
| mov rsp, gs:[kernel_rsp] ; load kernel rsp | |
| ; ... save user registers, dispatch syscall ... |
SYSRET reverses the process, returning to user mode. Linux uses SYSCALL/SYSRET exclusively for syscalls; legacy int 0x80 is supported but slow.
07. Control Registers
Long mode has 9 architectural control registers, plus various CPU-specific ones.
- CR0: protection enable (PE), paging enable (PG), write protect (WP), and other mode bits.
- CR2: faulting linear address (set on #PF).
- CR3: page-table base address (and PCID in low 12 bits).
- CR4: extension enables (PAE, PSE, OSXSAVE, OSXMMEXCPT, LA57 for 5-level, SMEP, SMAP, UMIP, CET, etc.).
- CR8: task priority register (TPR), used by the local APIC for interrupt priority.
CR4 grew enormously over time; it now controls dozens of features. Modern security features show up as CR4 bits:
- SMEP: kernel cannot execute user-mode code.
- SMAP: kernel cannot read/write user-mode pages without explicit annotation (
STAC/CLACinstructions to gate access). - UMIP: user mode cannot read certain registers (GDT, IDT, etc.) that leak kernel addresses.
- CET: Control-flow enforcement (shadow stacks, indirect-branch tracking).
08. Model-Specific Registers (MSRs)
MSRs are 64-bit registers accessed via RDMSR (read) and WRMSR (write). There are hundreds. Critical ones include:
- EFER (0xC0000080): extended feature enable register. Bit LME (Long Mode Enable) starts the transition into long mode.
- STAR, LSTAR, CSTAR, FMASK: SYSCALL configuration.
- FS_BASE, GS_BASE, KERNEL_GS_BASE: thread-local segment bases.
- IA32_PAT: page attribute table for memory typing.
- IA32_TSC: time-stamp counter (read by
RDTSC). - Many performance-counter MSRs.
- Many vendor-specific MSRs for power management, microcode patching, debug, etc.
RDMSR/WRMSR are privileged. User mode cannot directly access MSRs (with rare exceptions like FSGSBASE for fs/gs base).
09. Memory Types and Caching Attributes
Long mode supports several memory types, controlling how the CPU caches and orders accesses to specific physical-memory regions:
- WB (Write-Back): default. Cached, write-back, normal coherent memory. RAM is WB.
- WT (Write-Through): cached, but writes go through to the next level immediately. Rare.
- WC (Write-Combining): writes accumulate in a buffer and may merge before flushing. Used for video framebuffers and similar streaming write targets.
- WP (Write-Protected): cached for reads; writes go to memory but invalidate cache.
- UC (Uncached): no caching, no buffering, strict ordering. Used for memory-mapped I/O.
- UC- (Uncached, weak): like UC but with some flexibility for ordering.
The memory type for a given page is determined by combining bits in the PTE (PAT, PCD, PWT) with the IA32_PAT MSR's lookup table and with MTRRs (Memory Type Range Registers) that define the type for ranges of physical memory.
The OS configures MTRRs at boot to match firmware settings (RAM = WB, MMIO = UC, framebuffer = WC) and uses page-level overrides for special cases (DMA buffers might be UC; certain device buffers might be WC).
10. Boot Sequence
A modern x86-64 system boots roughly as follows.
Reset State
On hardware reset, the CPU starts in real mode with cs:rip = F000:FFF0 — the top of the 1 MiB real-mode address space. This points into the firmware ROM's reset vector, which jumps to the firmware initialization code.
The CPU is in 16-bit real mode. Caches are disabled. Paging is off. Almost no devices are initialized.
Firmware (UEFI or Legacy BIOS)
The firmware:
- Initializes the CPU's basic state, switches early to 32-bit protected mode (or later, long mode).
- Initializes the memory controllers, runs DRAM training, and discovers the memory map.
- Initializes the chipset, the PCI/PCIe bus, and basic peripherals.
- Configures the local APIC and starts other cores (which initially park).
- Loads firmware drivers.
- UEFI: locates the EFI system partition on a disk, loads the bootloader (e.g., GRUB, Windows Boot Manager) from there, hands off control. UEFI provides services (memory map, console, file system) to the bootloader.
- Legacy BIOS: loads the first sector (MBR) from the boot disk and jumps to it, in real mode.
Modern systems are UEFI; legacy BIOS is mostly retired.
Bootloader
The bootloader:
- Loads the OS kernel and initial RAM disk from disk.
- Sets up basic page tables and switches to long mode if not already.
- Sets up command-line arguments, memory map, and other boot info.
- Jumps to the kernel's entry point.
Kernel Initialization
The kernel:
- Sets up its own GDT, IDT, page tables (replacing the bootloader's).
- Probes hardware (CPU features via CPUID, ACPI tables for system topology).
- Sets up the per-CPU data, scheduler, memory management, file systems.
- Starts the other cores (sending an INIT IPI followed by a STARTUP IPI to each).
- Mounts the root filesystem and starts user space (typically
initorsystemd).
The transition from firmware to kernel is heavily orchestrated, with multiple handoffs, mode changes, and configuration. By the time user space starts, the CPU has been through several mode transitions, and dozens of MSRs have been configured.
11. Multi-Core Initialization
A multi-core x86-64 system has one bootstrap processor (BSP) and several application processors (APs). At reset, the BSP runs the firmware and kernel boot path; the APs sit in a halted state.
The kernel starts each AP by sending the local APIC's INIT IPI (resets the AP) followed by a STARTUP IPI with a vector pointing to a trampoline page. The trampoline is real-mode code that:
- Switches to protected mode, then long mode.
- Loads the kernel's GDT and page tables.
- Sets up the AP's per-CPU stack and data.
- Jumps to the kernel's AP entry point.
Once the kernel takes control of an AP, it adds the AP to the scheduler's runnable-CPU set. From this point, the AP is just another available core.
12. Virtualization (VMX / SVM)
Both Intel (VT-x / VMX) and AMD (SVM) added hardware virtualization in 2005-2006. The basic idea: a hypervisor runs in a special privilege level (sometimes called ring -1) above ring 0, and guests believe they are running on bare metal.
Key features:
- VMCS (Intel) or VMCB (AMD): a control structure that captures the guest's state and the conditions under which control is returned to the hypervisor.
- VMX root mode vs. VMX non-root mode: hypervisor runs in root mode; guests in non-root mode.
- VM entry: hypervisor loads guest state from VMCS, jumps to guest. Guest runs at full speed for as long as the configured exit conditions don't trigger.
- VM exit: certain instructions or events (CPUID, RDMSR/WRMSR of certain MSRs, EPT violations, external interrupts, etc.) cause an exit back to the hypervisor.
- EPT (Extended Page Tables) / NPT (Nested Page Tables): a second-level page-table structure. The guest's page tables map guest-virtual to guest-physical; EPT maps guest-physical to host-physical. The hardware walks both on a TLB miss.
- VPID: tags TLB entries with a virtual processor ID, so guest entries don't collide.
Virtualization is now ubiquitous: nearly every modern x86 chip supports it, and nearly every cloud server uses it. KVM (Linux), Hyper-V (Windows), VMware ESXi, Xen, and others all use VMX/SVM.
We will not develop virtualization in detail; entire books are written on it. The point: x86-64's system architecture includes a complete second tier of privilege management for hypervisors.
13. Modern Security Features
Long mode has accumulated many security features over the years:
- NX (no-execute pages): mark data pages non-executable.
- SMEP (Supervisor-Mode Execution Prevention): kernel cannot execute user-mode pages.
- SMAP (Supervisor-Mode Access Prevention): kernel cannot access user pages without explicit annotation.
- CET (Control-Flow Enforcement Technology): shadow stacks (each call pushes a return address to a separate, hardware-managed stack; any mismatch on return triggers a fault) and indirect-branch tracking (indirect branches must land at instructions marked as valid targets, via the
ENDBR64instruction). - MPK (Memory Protection Keys): per-page key tags allowing user-mode programs to mask off access to selected pages without TLB flushes.
- Intel CET, AMD Shadow Stack: implementations of shadow-stack ROP defense.
- MKTME / SME / SEV: memory encryption. Pages are encrypted in DRAM, optionally per-VM.
- TDX / SEV-SNP: confidential computing — full guest VM memory and state encrypted and integrity-protected against the hypervisor.
These features mostly require both hardware and OS cooperation. Modern Linux and Windows take advantage of most of them.
14. SMM and the Hidden Privilege Level
The ring-based privilege model described above (rings 0–3, plus VMX root/non-root) misses one further mode that has been part of x86 since the 386SL: System Management Mode (SMM). SMM is an even-more-privileged execution context, invisible to the operating system, used by firmware for power management, hardware emulation, and various platform-specific tasks.
The entry mechanism is a System Management Interrupt (SMI), a special interrupt signal driven by the chipset (today, the platform controller hub or SoC). On receipt, the CPU saves its full state into a region of memory called SMRAM (whose physical location is configured at boot and locked thereafter), switches to SMM, and starts executing the SMI handler that firmware placed there. SMM has its own address space, its own protection, and is not reachable by the OS in any normal way; the RSM (Resume from System Management Mode) instruction returns to the previous context.
SMRAM is hidden from the OS by chipset configuration: once the BIOS closes and locks SMRAM near the end of boot, the OS cannot read or write the region, even from ring 0. The locking is one-way until the next reset. This is the architectural basis on which SMM remains trustworthy even when the OS is fully compromised: an attacker with kernel privileges still cannot tamper with SMI handler code.
The practical uses of SMM include: legacy USB keyboard emulation as PS/2 (the firmware traps keyboard interrupts in SMM so a non-USB-aware OS sees a PS/2 device), thermal management (when temperature crosses a threshold, an SMI raises and the firmware adjusts fans), error logging (machine-check events sometimes route through SMM for platform-specific handling), and the recovery paths for various exotic hardware conditions. Modern hardware tries to minimize SMI duration because every SMI stalls the CPU for the entire handler's runtime, with no OS visibility; long SMIs cause unexplained latency spikes and break real-time guarantees.
From a security perspective, SMM is doubly important. SMI handlers run with full physical-memory access and effectively unrestricted hardware control, so a vulnerability in firmware can become a permanent rootkit. The 2009 Loic Duflot attacks and many subsequent exploits showed this in detail; modern firmware uses SMM lockbox and Intel Boot Guard / AMD Platform Secure Boot mechanisms to restrict who can write to SMRAM and to verify the firmware itself at reset. We will revisit firmware and the boot chain in Chapter 47.
A related, slightly less hidden mode is the Intel Management Engine (ME) and AMD's Platform Security Processor (PSP) — dedicated co-processors on the same package as the main CPU, running their own firmware and providing services like remote management (Intel AMT), DRM, and Trusted Platform Module emulation. These run independently of the main CPU and outside its execution environment entirely; they are out of scope for this chapter but worth being aware of.
15. Performance Monitoring
Every modern x86-64 chip has a rich performance-monitoring unit (PMU): dozens of programmable counters, fixed-function counters, and event types. Software (perf, VTune, AMD uProf) configures the PMU to count specific events (cache misses, branch mispredicts, retired instructions, port utilization, etc.) and read the counters.
The counters are accessed via MSRs, generally privileged but exposed through the OS.
Intel and AMD both define a small set of architectural performance events whose meanings are stable across generations (cycles, instructions retired, branch instructions, branch mispredictions, last-level-cache references, last-level-cache misses), accessed through IA32_PERF_GLOBAL_* and per-counter MSRs. A much larger set of non-architectural events varies by generation and is documented in vendor-specific manuals (Intel SDM Volume 3 chapter 19 and AMD's PPR documents). The RDPMC instruction reads counter values from user mode when the OS has enabled it; otherwise the OS reads the counters via RDMSR and exposes them through interfaces like Linux's perf_event_open and Windows' performance-counter API.
A newer feature is Last Branch Records (LBRs) on Intel and Branch Sampling on AMD: a small per-core ring buffer that records the last 16 to 32 taken branches, capturing source, target, and timing. Profilers use LBRs to reconstruct call paths from samples without walking the stack at runtime, which is essential for sampling-based profiling on long-running production workloads. Intel's Processor Trace (PT) goes further, emitting a compressed control-flow trace through dedicated hardware that can be decoded offline to reconstruct every executed instruction with cycle-level timing. We will return to performance analysis as a discipline in Chapter 54.
17. Summary
17. Summary
x86-64's system architecture is dense. Long mode replaces segmentation with paging — 4-level (or 5-level) page tables with 4 KiB, 2 MiB, or 1 GiB pages, NX and various security bits, PCID-tagged TLBs. The descriptor tables (GDT, IDT, TSS) are still present but simplified, the IDT being the most active in normal operation. Exceptions and hardware interrupts are vectored through the IDT; the local APIC delivers IPIs and most modern device interrupts. System calls use the fast SYSCALL/SYSRET mechanism with kernel state set up via MSRs and gs-base swapping.
Control registers and MSRs configure dozens of subsystems: paging, security, virtualization, performance, power management. The boot sequence walks from real-mode firmware through protected and long modes into the OS kernel, brings up other cores via APIC IPIs, and hands off to user space. Virtualization extensions add a second privilege tier for hypervisors, with VMCS-driven context switches and second-level page tables.
This is the substrate every x86-64 OS sits on. The next chapter goes back to the application side and looks at floating-point and SIMD, where x86-64's most performance-relevant instructions live.