Part V·ISA Case Studies·Chapter 32 of 62

Part VISA Case Studies

x86-64 Overview

May 16, 2026·21 min read·advanced

The x86 architecture is the most commercially important ISA in computing history. It started as Intel's 8086 in 1978, a 16-bit microprocessor designed for home computers and embedded systems. Almost fifty years later, its descendant, x86-64, dominates personal computers, laptops, and servers. Hundreds of billions of devices have shipped running x86 in some form. The ISA's history is a story of accreted complexity: every revision added new instructions, new modes, new register sets, while preserving backward compatibility with everything before. The result is one of the most baroque and fascinating ISAs in computing.

This chapter is an orientation. It tells the historical story of how x86 grew, identifies the major modes and modes-of-operation, and surveys the structural features (instruction encoding, operating modes, register set) that define what x86-64 is. The next four chapters treat x86-64 in detail: the programmer-visible model (Chapter 33), the system-level architecture (Chapter 34), floating-point and SIMD (Chapter 35), and the micro-architecture of modern x86 implementations (Chapter 36).

01.A Short History

The earliest ancestor of x86-64 is the Intel 8086 (1978). It was a 16-bit processor with 20-bit physical addressing (1 MiB of memory), 8 general-purpose registers, and a small instruction set. It used segment registers to extend the 16-bit logical addresses to 20-bit physical addresses. The IBM PC (1981) used the closely-related 8088 (8-bit external bus), which is what first made x86 a mass-market success.

The 80286 (1982) added a 24-bit physical address space and a new protected mode with hardware memory protection, separate privilege levels, and segment-based virtual memory. Real mode (the 8086's mode) remained available for backward compatibility.

The 80386 (1985) was the first 32-bit x86: 32-bit registers, 32-bit linear addresses, and a 32-bit segment-augmented protected mode. It also added paging on top of segmentation, an FPU on the same chip (in some variants), and a virtual-8086 mode for running real-mode programs inside protected mode. The 386's basic structure — what we now call IA-32 — became the foundation that all subsequent 32-bit x86 chips built on.

The 80486 (1989) integrated the FPU on the same die for the first time as standard, added an L1 cache, and pipelined execution. The instruction set grew slightly but the architecture was largely the same.

The Pentium (1993) added superscalar execution (two parallel pipelines), a larger floating-point unit, and the new MMX SIMD instructions in a 1997 refresh.

The Pentium Pro (1995) introduced out-of-order execution and the µop-based internal architecture that all subsequent x86 cores would use. The decoder turns x86 instructions into RISC-like internal µops; the back end is essentially a RISC machine. From this point on, x86's micro-architecture diverges sharply from its architecture.

The Pentium III (1999) added SSE, the first 128-bit floating-point SIMD. The Pentium 4 (2000) added SSE2 (128-bit integer SIMD), SSE3 and SSSE3 (refinements), and the legendary deep Netburst pipeline that ultimately failed.

The AMD Opteron (2003) was the first x86-64 processor. AMD, not Intel, designed the 64-bit extension to x86. Intel's own 64-bit project (Itanium) had failed in the desktop and server market, and Intel adopted AMD's extension under the name Intel 64 (originally EM64T).

x86-64 added:

64-bit registers (rax, rbx, etc., extending the 32-bit eax, ebx).
8 new general-purpose registers (r8 through r15).
64-bit virtual addresses (with implementation limits, currently 48 or 57 bits).
64-bit linear addresses for instructions.
A new long mode in which the processor runs 64-bit code, with a compatibility submode for running 32-bit code under a 64-bit OS.

Subsequent additions:

SSE4 (2007), refining SSE.
AVX (2011): 256-bit SIMD with three-operand encoding.
AVX2 (2013): integer SIMD widened to 256 bits.
BMI1/BMI2 (2013): bit-manipulation instructions.
TSX (2013): hardware transactional memory (later disabled on most cores due to errata and security issues).
AVX-512 (2016): 512-bit SIMD with 32 register, masking, embedded broadcasts.
AMX (2021): tile-based matrix instructions.
APX (in development): more registers, three-operand integer instructions, and other improvements.

Each addition layered on top of the previous, preserving backward compatibility. The result is an instruction set with thousands of distinct opcodes spread across many encoding styles.

02.Architectural Modes

x86-64 supports several operating modes, each with different semantics. The mode the processor is in determines which features are available, what instructions mean, and how memory is addressed.

Real mode. The original 8086 mode. 16-bit registers, segment-based 20-bit addressing, no memory protection. Modern processors boot into real mode for compatibility with legacy BIOS, then quickly switch to a more capable mode.

Protected mode. The 32-bit mode introduced with the 286 and elaborated with the 386. 32-bit linear addresses, four privilege levels (rings 0-3), segment-based memory protection plus optional paging. Most 32-bit operating systems (Windows XP/7, 32-bit Linux) ran in protected mode.

Virtual-8086 mode. A submode of protected mode that runs real-mode programs in a sandbox. Used by old MS-DOS programs running under Windows or DOS extenders.

System Management Mode (SMM). A high-privilege mode entered via a system management interrupt. Used by firmware for power management, hardware emulation, and other low-level tasks. The OS is unaware of SMM activity; the firmware handles transitions transparently.

Long mode. The 64-bit mode introduced with x86-64. Two submodes: 64-bit submode (full 64-bit registers and addresses) and compatibility submode (32-bit code running under a 64-bit OS, with 32-bit registers but the 64-bit OS context). Long mode dropped some legacy features: the segmentation limits are mostly ignored (segments are flat), virtual-8086 mode is unavailable, and the FPU's task switching uses different mechanisms.

A modern x86-64 system spends nearly all its life in long mode. Real and protected modes are visited only briefly during boot. Programmers writing modern application code see only long mode's 64-bit submode (or, for legacy 32-bit applications, compatibility submode).

The rest of this chapter and Part VII focus on long mode, especially 64-bit submode, which is what modern programs run.

03.Register Set

Long mode's general-purpose register set is 16 64-bit registers:

64-bit	32-bit (low half)	16-bit (low quarter)	8-bit (lowest byte)	Conventional Use
rax	eax	ax	al / ah	accumulator, return value
rbx	ebx	bx	bl / bh	base, callee-saved
rcx	ecx	cx	cl / ch	counter
rdx	edx	dx	dl / dh	data, multiplier high
rsi	esi	si	sil	source index
rdi	edi	di	dil	destination index
rbp	ebp	bp	bpl	base pointer (frame ptr)
rsp	esp	sp	spl	stack pointer
r8	r8d	r8w	r8b	(no historical name)
r9	r9d	r9w	r9b
r10	r10d	r10w	r10b
r11	r11d	r11w	r11b
r12	r12d	r12w	r12b
r13	r13d	r13w	r13b
r14	r14d	r14w	r14b
r15	r15d	r15w	r15b

Each register has a 64-bit name (rax) and partial-width names that access subsets. Operations on the 32-bit name (eax) zero-extend the result into the full 64-bit register; this is a deliberate choice that simplifies the rename hardware. Operations on the 16-bit and 8-bit names leave the upper bits unchanged, which can cause partial-register stalls and other micro-architectural issues.

The "ah, bh, ch, dh" registers are the high byte of the 16-bit names; they exist only for the original four registers (rax, rbx, rcx, rdx) and are not accessible alongside REX-prefix encoded forms (which is why r8b, etc., use a single byte encoding).

In addition to the GPRs:

rip — the instruction pointer (the program counter). Not directly accessible as a normal register, but PC-relative addressing makes it visible.
rflags — the flags register, holding condition codes (ZF, CF, OF, SF, PF, AF) and various control bits.

XMM/YMM/ZMM registers. SSE/AVX/AVX-512 registers, 16/32/32 in number depending on extension. We will see these in Chapter 35.

FPU stack registers (st(0) through st(7)). 80-bit floating-point registers from the x87 FPU, organized as a stack. Mostly obsolete; modern code uses SSE/AVX for FP.

Segment registers (cs, ss, ds, es, fs, gs). Used heavily in protected mode for segmentation. In long mode, cs, ss, ds, and es are mostly ignored (the system uses a flat address space). fs and gs survive as bases for thread-local storage and per-CPU data; their hidden base addresses, set via MSRs, are used in addressing.

Control registers (cr0 through cr8). Configuration of the processor: paging enable, protection enable, page-table base, etc. Privileged.

Debug registers (dr0 through dr7). Hardware breakpoints and watchpoints.

Model-specific registers (MSRs). A large set of 64-bit registers accessed via RDMSR and WRMSR for configuration of various features. The exact set varies by model.

The original 8086 had 4 16-bit GPRs (ax, bx, cx, dx) and 4 16-bit index registers (si, di, bp, sp). The 386 widened them to 32 bits. x86-64 added 8 more (r8-r15) and widened them all to 64 bits. The accumulated naming is what we live with today.

04.Instruction Encoding

x86 instructions are variable length, from 1 to 15 bytes. The encoding has several optional components, most of which are present only when needed:

Plain Text

[Legacy prefixes] [REX prefix] [Opcode (1-3 bytes)] [ModR/M] [SIB] [Displacement (0-4 bytes)] [Immediate (0-8 bytes)]

Legacy prefixes. Up to 4 single-byte prefixes that modify operation: operand-size override (66), address-size override (67), segment overrides (26, 2e, 36, 3e, 64, 65), repeat prefix (f2, f3), lock prefix (f0).

REX prefix. A single byte (40 to 4f) that extends the instruction to 64-bit mode. The REX prefix's bits select between 32-bit and 64-bit operand sizes (W bit), extend the register fields by one bit (R, X, B bits) so they can address r8-r15 and the new SIMD registers, and so on. Required for any instruction that uses 64-bit operand size or the new registers.

VEX and EVEX prefixes. Two- or three-byte prefixes used by AVX (VEX) and AVX-512 (EVEX). They subsume the REX prefix's role and add encoding for vector-length, mask register, and other AVX-specific fields. Cleaner than the legacy + REX combination; introduced because AVX needed too many fields to fit in the legacy encoding.

Opcode. 1, 2, or 3 bytes specifying the operation. Some opcodes embed register fields (e.g., the eight 1-byte forms of INC eax through INC edi); others use later bytes for register encoding.

ModR/M. A single byte that specifies addressing mode and register operands. The byte is split into three fields: mod (2 bits, addressing mode), reg (3 bits, register), r/m (3 bits, register or memory). Combined with REX prefix bits, this gives access to all 16 GPRs.

SIB (scale-index-base). A second byte present when ModR/M selects an indexed memory addressing mode. SIB encodes the scale (1, 2, 4, or 8), index register, and base register for addresses of the form base + index*scale.

Displacement. 0, 1, 2, 4, or 8 bytes added to the computed address. The size is determined by the addressing mode encoded in ModR/M.

Immediate. 0, 1, 2, 4, or 8 bytes of constant operand. Used by instructions that operate on a register and a constant.

The result: a single instruction can range from 1 byte (RET, NOP) to 15 bytes (a full MOV reg64, imm64 with prefixes). The decoder's job is to walk the byte stream, identify boundaries, and decode each instruction's components. We saw the implementation challenges in Chapter 27.

05.Addressing Modes

x86-64 has a rich set of memory addressing modes, generally of the form:

$\text{effective address} = \text{base} + \text{index} \times \text{scale} + \text{displacement}$

Each component is optional. Common forms:

[rax] — register indirect.
[rax + 8] — base + displacement.
[rax + rbx] — base + index.
[rax + rbx*4] — base + scaled index.
[rax + rbx*8 + 16] — full form.
[rip + label] — RIP-relative (PC-relative addressing). Essential for position-independent code in long mode.

Scale must be 1, 2, 4, or 8. Index register can be any GPR except rsp (rsp's encoding in the SIB index field means "no index"). Base register can be any GPR.

RIP-relative addressing is a long-mode addition. In 32-bit protected mode, code references absolute addresses, which is awkward for shared libraries. Long mode added RIP-relative addressing for nearly all instructions, making position-independent code easy.

A famous instruction: LEA (Load Effective Address). Despite its name, it does not load anything from memory: it computes an effective address and writes the result to a register, just as if it were a memory access without the actual access. Useful for arithmetic:

Assembly

lea rax, [rbx + rcx*4 + 8] ; rax = rbx + rcx*4 + 8

LEA is widely used by compilers as a fast 3-input integer addition, taking advantage of the AGU (which has its own port) rather than the ALU.

06.Two-Operand Form

x86 instructions traditionally take two operands, with the destination being one of the source operands:

Assembly

add rax, rbx ; rax = rax + rbx; rax is both source and dest

The destination overwrites one of the sources. This means that to compute c = a + b without overwriting a, the program must first move:

Assembly

mov rax, rbx       ; rax = rbx
add rax, rcx       ; rax = rax + rcx (which is rbx + rcx)

The two-operand form was fine in the 1980s but is awkward today. AVX (with its VEX prefix) introduced three-operand forms for SIMD instructions:

Assembly

vaddps ymm0, ymm1, ymm2 ; ymm0 = ymm1 + ymm2 (no overwrite)

x86-64 integer instructions, however, remain two-operand. The proposed APX extension (in 2024-2026) adds three-operand integer forms via a new prefix; this would close the gap, but APX is not yet shipped in mainstream CPUs.

The two-operand form contributes to the move-elimination machinery in modern decoders (Chapter 27): mov instructions before two-operand operations are common, and decoders eliminate them at rename time.

07.Flags Register

The flags register (rflags) holds condition codes set by arithmetic and logical operations:

CF (carry): unsigned overflow.
PF (parity): parity of low byte of result.
AF (auxiliary carry): carry between bits 3 and 4 (used in BCD arithmetic).
ZF (zero): result was zero.
SF (sign): high bit of result was 1.
OF (overflow): signed overflow.
DF (direction): controls direction of string operations.
IF (interrupt enable): enables/disables interrupts.

Most arithmetic and logical instructions set the flags as a side effect. Conditional branches, conditional moves (CMOVcc), and conditional sets (SETcc) read them. The flags create implicit dependencies between adjacent instructions: a cmp followed by a branch, an add followed by a jc (jump if carry), and so on.

The flags register also holds various control bits used by the OS (interrupt flag, direction flag, I/O privilege level, alignment-check enable). User mode can read all flags but can only modify some.

08.Privilege Levels and Rings

x86 defines four privilege levels, called rings 0 through 3:

Ring 0 — most privileged. The OS kernel runs here.
Ring 1, 2 — intermediate. Rarely used; some hypervisors used them.
Ring 3 — least privileged. User-mode applications run here.

Rings 1 and 2 are essentially abandoned in modern OSes. The kernel runs in ring 0; user code runs in ring 3. Hypervisors initially used ring -1 (a special VMX-root mode added by Intel VT-x and AMD-V), but most modern OSes treat ring 1 and 2 as equivalent to ring 0 in software.

Long mode preserves the four-ring model but in practice uses only rings 0 and 3.

Transitions between rings happen through specific instructions:

SYSCALL / SYSRET — fast user-to-kernel and back.
INT n — software interrupt; older mechanism.
SYSENTER / SYSEXIT — alternative fast mechanism (used in 32-bit mode mostly).

We will look at the system-call mechanism more in Chapter 34.

09.The x87 FPU

For historical reasons, x86 has two distinct floating-point models:

x87 (1980s) — the original FPU. Eight 80-bit FP registers organized as a stack. Operations like fadd push and pop the stack. The 80-bit extended precision format gives more accuracy than IEEE 754 double precision.

SSE/AVX — modern SIMD-based FP. 16 (or 32 in AVX-512) 128/256/512-bit registers, addressed directly. Single and double precision (no 80-bit support). The default for modern code.

Modern compilers emit SSE for scalar floating-point unless -mfpmath=387 or similar is specified. The x87 FPU is still implemented for backward compatibility but is a deprecated path. Long mode mostly hides the x87 FPU; programs default to SSE.

The historical x87 FPU's stack-based design is unusual among RISC-style instruction sets. It was a clever way to encode FP operations in a small instruction set, but it makes register allocation hard for compilers and is mostly considered an artifact of late-1970s constraints.

10.Memory Model

x86 uses TSO (Total Store Ordering, Chapter 31). The memory model guarantees:

Loads from one core happen in program order with respect to that core's other loads.
Stores from one core happen in program order with respect to that core's other stores.
A store is visible to other cores in program order.
A core's load can pass an older store to a different address (store-to-load reordering).

Programs that need full sequential consistency insert MFENCE instructions. Most code does not need this; TSO is "strong enough" for typical lock-based and atomic-based programming idioms.

x86's TSO is actually one of the strongest memory models among modern architectures. ARM and RISC-V are weaker; programmers porting code from x86 to ARM often discover hidden dependencies on TSO that need explicit fences on the new platform.

11.Instruction Set Size

A modern x86-64 implementation supports:

The base 64-bit ISA (a few hundred core instructions).
x87 (about 80 instructions, mostly historical).
MMX (about 60 instructions, deprecated but still implemented).
SSE through SSE4.2 (about 200 instructions across versions).
AVX, AVX2 (about 250 additional instructions).
AVX-512 (varies by sub-extension, hundreds of additional instructions).
BMI1, BMI2, ADX, RDRAND, RDSEED (assorted bit-manipulation and crypto helpers).
VT-x, SVM (virtualization).
AES-NI, SHA, GFNI (cryptography acceleration).
AMX (matrix tile operations).
Various other small extensions.

The total count is well over 2000 distinct mnemonic forms, and many of those have multiple encoding variants. The instruction-set manual (Intel SDM Volume 2) is over 2,500 pages.

The complexity is one of the reasons the x86 front end is intricate. Decoding any instruction requires consulting (logically) a large table of formats; the decoder has to handle every legal combination of prefixes, opcode bytes, and addressing modes.

12.Compatibility and Persistence

x86's defining feature is its near-perfect backward compatibility. A binary compiled for the 80386 in 1985 still runs on a modern x86-64 chip in 32-bit compatibility mode. Real-mode 8086 code from 1981 runs in real mode at boot. The ISA has accumulated 45+ years of features, and almost nothing has been removed (the FPU's 80-bit format is still implemented; the segment registers still exist; the rep prefixes still work).

The cost of this compatibility is the ISA's complexity. Each preserved feature requires hardware to implement (or at least to honor). Each new feature has to coexist with the old. The decoder, the privileged-mode switching, the address-translation pipeline — all carry the weight of decades.

The benefit, of course, is that x86 software stays valuable. Operating systems, application binaries, and entire ecosystems built on x86 over decades continue to work. The vast Windows software base, the Linux distributions, the Steam game library, the entire enterprise software stack: all of it keeps working, generation after generation.

Newer ISAs (ARM, RISC-V) explicitly limit how much legacy they carry forward. They are simpler and (often) more efficient as a result. But x86's commercial success is rooted in the network effect of its software ecosystem, which is itself rooted in compatibility.

13.CPUID and Feature Discovery

With dozens of optional extensions and forty-plus years of accumulation, software running on x86-64 cannot assume any particular feature is present. The mechanism for asking is the CPUID instruction, introduced on late-486 and Pentium parts and extended in essentially every generation since.

CPUID takes a leaf number in EAX (and sometimes a sub-leaf in ECX) and returns up to four 32-bit values in EAX, EBX, ECX, EDX. Different leaves expose different facts: leaf 0 returns the maximum supported leaf and the vendor string (GenuineIntel, AuthenticAMD, etc.); leaf 1 returns the family/model/stepping plus a bitmap of common features (FPU, MMX, SSE, SSE2, MONITOR, ...); leaf 7 returns the more recent feature bits (AVX-512 sub-extensions, BMI, ADX, SHA, AMX); leaves 0x12 and 0x14 expose SGX and PT features; the extended leaves (0x80000000 and above) expose AMD-style and 64-bit-mode features. Several hundred feature bits are defined in total.

The Linux kernel's /proc/cpuinfo flags field is essentially a human-readable rendering of CPUID output; the cpuid and lscpu userland utilities expose more. Compilers consult CPUID through runtime function multi-versioning (__attribute__((target_clones(...))) in GCC) so that a single binary can ship multiple versions of a hot function and dispatch at startup to the one matching the host.

A closely related question is what to do when a feature is absent. Older code often assumed at compile time which extensions were present, producing binaries that crashed with #UD (invalid opcode) on older hardware. Modern compilers separate the baseline the binary requires (typically x86-64-v1 through x86-64-v4 psABI levels: v1 is original AMD64; v2 adds SSE3/SSSE3/SSE4; v3 adds AVX/AVX2/BMI/FMA; v4 adds AVX-512 baseline) from the optional dispatched paths. Linux distributions in 2024-2026 are increasingly shipping x86-64-v3-baseline binaries, leaving v1/v2 as fallback for very old hardware. The CPUID interface is the runtime mechanism that makes the strategy possible.

The instruction is also the vendor identification mechanism. Software that wants to special-case Intel versus AMD (a reasonable thing to do for performance tuning, an unreasonable one for compatibility) reads the leaf-0 vendor string. The mostly-irrelevant family/model/stepping triplet from leaf 1 is sometimes used to identify specific generations for the same purpose. Both Intel and AMD provide CPUID-feature whitepapers documenting the exact bit assignments for each generation.

14.Implementations and Manufacturers

x86-64 is implemented by several companies, primarily:

Intel. The original x86 designer. Major core families: P6 (Pentium Pro through Pentium III), Netburst (Pentium 4), Core (Core 2 onward), Core/Atom hybrids in recent generations. Modern Intel cores trace lineage to the P6 family.

AMD. The second-source-turned-rival. Major core families: K6, K7 (Athlon), K8 (Athlon 64, Opteron — first x86-64), K10, Bulldozer, Zen (Zen 1 through Zen 5). Modern AMD cores are based on the Zen lineage, redesigned from scratch in 2017.

Via, Centaur, Cyrix, Transmeta — historical alternatives, now largely defunct.

Both Intel and AMD design and fabricate their own chips (Intel mostly in-house; AMD using TSMC since 2017). Microcode updates from both vendors can patch shipped chips.

The two companies design independently; their internal micro-architectures are different in every detail. Both implement the same ISA; software written for either runs on both. Performance and power characteristics differ — sometimes one is faster, sometimes the other — but compatibility is essentially perfect.

15.Looking Ahead

The next four chapters develop x86-64 in depth.

Chapter 33 covers the programmer-visible model: the registers, instruction categories, common idioms, addressing patterns, and how compilers use the ISA. This is the level at which an assembly programmer or compiler backend writer thinks.

Chapter 34 covers the system-level architecture: paging, virtual memory, system calls, interrupts and exceptions, control registers, MSRs, and the boot process. This is the level the operating system kernel works at.

Chapter 35 covers floating-point and SIMD: the x87 legacy, SSE through AVX-512, AMX, the math behind IEEE 754, and how SIMD code is written.

Chapter 36 covers the micro-architecture: how Intel and AMD have built fast x86-64 implementations, what the front-end and back-end pipelines look like, what makes modern x86 chips fast, and how their performance compares to ARM and RISC-V.

By the end of Part VII, x86-64 should feel familiar — not just as a set of instructions but as a complete system, with its history, its quirks, and its strengths.

16.Summary

x86-64 is the dominant ISA on PCs and servers. Its history runs from the 1978 Intel 8086 through the 1985 i386 (which set the basic 32-bit shape) to the 2003 AMD Opteron (which extended the architecture to 64 bits) and to the present. Each generation added capabilities while preserving backward compatibility, leading to an ISA with multiple operating modes, an enormous instruction set, variable-length instruction encoding, and 16 general-purpose registers.

In long mode (64-bit operation), x86-64 has 16 GPRs, RIP-relative addressing, a flag register driving conditional branches, the x87 FPU and SSE/AVX SIMD, and a TSO memory model. Two-operand integer instructions, four privilege rings (mostly using only 0 and 3), and a baroque encoding tradition complete the picture. The next chapter looks at the programmer's view of all this in detail.

Book mode

	mov rax, rbx ; rax = rbx
	add rax, rcx ; rax = rax + rcx (which is rbx + rcx)