x86-64 Programming Model
May 16, 2026·18 min read·advanced
This chapter is the programmer's-eye view of x86-64. The previous chapter sketched the ISA's history and structural shape; this one develops what the programmer actually sees: how instructions are…
This chapter is the programmer's-eye view of x86-64. The previous chapter sketched the ISA's history and structural shape; this one develops what the programmer actually sees: how instructions are categorized, how operands are specified, what idioms compilers emit, and how a typical program's machine code is organized. The treatment is concrete and example-driven, on the assumption that x86-64 is best learned by reading and writing it.
We will use AT&T syntax for some examples and Intel syntax for others, since both appear in real-world tools (GCC and the Linux kernel use AT&T; Windows tools and the Intel manuals use Intel). The key visual difference: AT&T puts the destination on the right (movq %rax, %rbx means rbx ← rax), while Intel puts the destination on the left (mov rbx, rax means rbx ← rax). Operands in AT&T have % register prefixes and $ immediate prefixes; Intel has none. Both are common; we will note which is which.
01. Instruction Categories
x86-64's many hundreds of instructions cluster into a handful of categories.
Data Movement
MOV. The basic load/store/copy. Many forms:
| mov rax, rbx ; register to register | |
| mov rax, [rbx] ; load from memory | |
| mov [rbx], rax ; store to memory | |
| mov rax, 0x1234 ; immediate to register | |
| mov rax, [rip + label] ; PC-relative load |
MOV is single-cycle on the back end and is often eliminated entirely at rename (Chapter 27).
PUSH / POP. Push or pop a 64-bit value on the stack:
| push rax ; rsp -= 8; [rsp] = rax | |
| pop rax ; rax = [rsp]; rsp += 8 |
LEA. Load Effective Address. Computes an address without dereferencing:
| lea rax, [rbx + rcx*4 + 16] ; rax = rbx + rcx*4 + 16 |
MOVZX / MOVSX. Move with zero or sign extension:
| movzx rax, byte ptr [rbx] ; load byte, zero-extend to 64 bits | |
| movsx rax, word ptr [rbx] ; load 16-bit, sign-extend to 64 bits |
CMOVcc. Conditional move; copies only if the condition is true. Avoids a branch:
| cmp rax, rbx | |
| cmovl rax, rcx ; if rax < rbx (signed), rax = rcx |
CMOVcc is widely used by compilers in branchless code. Modern OoO cores execute it as a normal data dependency without a branch.
XCHG. Atomic exchange. With a memory operand, it has implicit lock semantics — the bus is locked for the duration. Common in older lock implementations, mostly superseded by CMPXCHG.
Arithmetic and Logic
ADD, SUB, INC, DEC, NEG. Integer arithmetic. All set flags.
| add rax, rbx ; rax = rax + rbx | |
| sub rax, 1 ; rax = rax - 1 (or use 'dec rax') | |
| inc rax ; rax = rax + 1; sets ZF, SF, OF, but not CF | |
| neg rax ; rax = -rax |
MUL, IMUL. Multiplication. MUL is unsigned; IMUL is signed. The two-operand and three-operand forms of IMUL are common:
| imul rax, rbx ; rax = rax * rbx (signed, low 64 bits) | |
| imul rax, rbx, 7 ; rax = rbx * 7 (signed, low 64 bits) | |
| mul rbx ; rdx:rax = rax * rbx (unsigned, 128-bit result) |
The single-operand form (MUL rbx) writes a double-width result into rdx:rax, useful for big-integer arithmetic.
DIV, IDIV. Division. The single-operand form takes rdx:rax as the 128-bit dividend and produces a 64-bit quotient (rax) and remainder (rdx).
| xor rdx, rdx ; clear high half of dividend | |
| div rbx ; rax = rdx:rax / rbx; rdx = remainder |
Division is the slowest common arithmetic operation: 20-100 cycles depending on operand size and value, and not pipelined or only partially pipelined.
AND, OR, XOR, NOT. Bitwise logic.
| xor rax, rax ; rax = 0 (a common zeroing idiom) | |
| and rax, 0xff ; rax = low 8 bits |
The xor reg, reg zeroing idiom is recognized by every modern x86 decoder as a constant-zero generator with no input dependencies — a special case that avoids serializing on the prior value of the register.
SHL, SHR, SAR, ROL, ROR, RCL, RCR. Shifts and rotates.
| shl rax, 3 ; rax <<= 3 (multiply by 8) | |
| sar rax, 2 ; rax >>= 2 (signed right shift, divide by 4) |
SHR is logical (zero-fill); SAR is arithmetic (sign-fill).
BT, BTS, BTR, BTC. Bit test, set, reset, complement. Used for individual bit manipulation in flags or bitmaps.
POPCNT, LZCNT, TZCNT. Population count, count leading zeros, count trailing zeros. Useful for bit-twiddling.
BMI1/BMI2. Various advanced bit-manipulation: BLSR (clear lowest bit), PDEP/PEXT (parallel deposit/extract), SHLX/SHRX (shift without modifying flags), MULX (multiply without flags), and others.
Comparisons and Tests
CMP. Subtract without storing the result; sets flags based on the difference.
| cmp rax, rbx ; compute rax - rbx; set ZF, SF, OF, CF; discard result |
Followed by a conditional branch or conditional move that reads the flags.
TEST. AND without storing; sets flags based on the AND.
| test rax, rax ; sets ZF if rax is zero, SF based on high bit | |
| jz skip ; jump if rax was zero |
The test reg, reg pattern for "is this register zero?" is universal. It is one byte shorter than cmp reg, 0 and avoids the immediate.
Control Flow
JMP. Unconditional jump. Direct (immediate target) or indirect (register/memory target).
| jmp label ; direct | |
| jmp rax ; indirect through register | |
| jmp [rax] ; indirect through memory |
Jcc. Conditional jump. The "cc" is a condition code:
| Mnemonic | Meaning | Flags |
|---|---|---|
| JE / JZ | equal / zero | ZF=1 |
| JNE / JNZ | not equal | ZF=0 |
| JL / JNGE | less (signed) | SF≠OF |
| JLE / JNG | less or equal (signed) | ZF=1 or SF≠OF |
| JG / JNLE | greater (signed) | ZF=0 and SF=OF |
| JGE / JNL | greater or equal (signed) | SF=OF |
| JB / JC / JNAE | below (unsigned) | CF=1 |
| JA / JNBE | above (unsigned) | CF=0 and ZF=0 |
| JS | sign (negative) | SF=1 |
| JO | overflow | OF=1 |
Sixteen conditions in total. Compilers pick the right one based on the comparison's signedness and operator.
CALL / RET. Function call and return.
| call function ; push next-instr address, jump to function | |
| ret ; pop return address, jump to it |
LOOP. Decrement rcx and jump if not zero. Mostly historical; modern compilers prefer dec rcx; jnz.
SETcc. Set a byte to 1 or 0 based on a condition; useful for materializing booleans:
| cmp rax, rbx | |
| setl al ; al = 1 if rax < rbx else 0 |
String Operations
x86 has built-in instructions for memcpy/memset/strcmp-like loops:
MOVS(with rep prefix): copy memory.STOS(with rep prefix): fill memory with a value.CMPS(with repe/repne prefix): compare memory.SCAS(with repe/repne prefix): scan memory.
| mov rcx, length | |
| mov rdi, dest | |
| mov rsi, source | |
| rep movsb ; copy rcx bytes from rsi to rdi |
These instructions were classic on early x86. Modern implementations have enhanced REP MOVSB / STOSB, special microcoded paths that can be faster than scalar copy loops on large buffers. The compiler's memcpy may use them. For small or fixed-size copies, vectorized SSE/AVX loads/stores are usually faster.
Atomic and Synchronization
LOCK (prefix). Makes the next instruction atomic with respect to other cores. Adds bus-locking semantics for read-modify-write instructions.
| lock add [rax], 1 ; atomically increment [rax] | |
| lock cmpxchg [rax], rbx ; compare-and-swap |
The CAS form (lock cmpxchg) is the cornerstone of lock-free programming on x86: it atomically compares the destination with rax, and if equal, replaces it with the source operand.
PAUSE. A hint instruction used in spin-wait loops: tells the processor that this is a busy-wait, allowing it to back off and reduce wasted execution slots.
MFENCE / LFENCE / SFENCE. Memory fences. MFENCE is a full barrier; SFENCE orders stores; LFENCE orders loads (and acts as a speculation barrier in some contexts).
XACQUIRE / XRELEASE and the TSX family — historical hardware transactional memory; mostly disabled in current chips due to errata.
System Instructions
A small number of privileged or semi-privileged instructions: SYSCALL, SYSRET, RDTSC (read timestamp counter), RDPMC (read performance counter), CPUID (query feature flags), RDMSR, WRMSR, and various others. We will see these in Chapter 34.
02. Calling Conventions
A calling convention specifies how function arguments and return values are passed, and which registers are caller- vs. callee-saved.
System V AMD64 ABI (Linux, macOS, BSD)
The dominant convention on Unix-like systems.
Arguments. First six integer/pointer arguments in registers, in order:
| rdi, rsi, rdx, rcx, r8, r9 |
First eight floating-point arguments in xmm0-xmm7. Additional arguments on the stack.
Return value. Integer/pointer returns in rax (with rdx for second 64 bits if 128-bit return). FP returns in xmm0.
Caller-saved (volatile). rax, rcx, rdx, rsi, rdi, r8-r11, and all xmm registers. The caller must save these before a call if it needs them after.
Callee-saved (non-volatile). rbx, rbp, r12-r15. The callee must preserve these or save and restore them.
Stack. Grows downward. rsp must be aligned to 16 bytes at the call instruction (so 8 mod 16 inside the callee, before pushing rbp). The "red zone" — 128 bytes below rsp — can be used by leaf functions without adjusting rsp.
A typical function prologue/epilogue:
| function: | |
| push rbp | |
| mov rbp, rsp | |
| sub rsp, 32 ; allocate locals | |
| ; ... body ... | |
| leave ; mov rsp, rbp; pop rbp | |
| ret |
Modern compilers often skip the rbp save and use [rsp + offset] directly, freeing rbp as a general-purpose register. The push rbp / mov rbp, rsp pair is preserved when frame pointers are needed for debugging or unwinding.
Microsoft x64 (Windows)
Windows uses a different convention.
Arguments. First four integer/pointer arguments in: rcx, rdx, r8, r9. First four FP in xmm0-xmm3. Additional arguments on stack. The caller reserves "shadow space" (32 bytes) on the stack for the callee to spill those four register arguments to.
Return value. Integer in rax. FP in xmm0.
Caller-saved. rax, rcx, rdx, r8-r11, xmm0-xmm5.
Callee-saved. rbx, rbp, rdi, rsi, r12-r15, xmm6-xmm15.
The two conventions are incompatible. Code that crosses the boundary (e.g., calling a Windows DLL from Unix-style code, or vice versa) must adapt. Cross-platform libraries often have function-pointer wrappers that translate.
03. Common Idioms and Patterns
A few idioms appear in nearly every compiled binary.
Zeroing a Register
| xor eax, eax ; rax = 0 (zeroing idiom) |
The xor reg, reg form is recognized by the decoder as a zeroing idiom: it produces zero with no input dependency. Using mov rax, 0 would work but is longer (7 bytes vs. 2) and creates a dependency on the immediate.
Note: xor eax, eax zeros all 64 bits because writes to the 32-bit name zero-extend the upper 32 bits.
Negation
| neg rax ; rax = -rax | |
| not rax ; rax = ~rax (bitwise NOT) |
Absolute Value (branchless)
| mov rdx, rax | |
| sar rdx, 63 ; rdx = sign-extended (-1 if negative, 0 if non-negative) | |
| xor rax, rdx ; flip bits if negative | |
| sub rax, rdx ; add 1 if negative |
Compilers emit this trick for abs(x) on signed ints.
Branchless Min/Max
| cmp rax, rbx | |
| cmovg rax, rbx ; rax = max(rax, rbx) |
A conditional move avoids a branch and its potential mispredict cost.
Multiply by Constant
For multiplication by small constants, compilers often use LEA and shifts:
| ; rax * 5 | |
| lea rax, [rax + rax*4] ; rax = rax + 4*rax = 5*rax |
For non-trivial divisions by constants, compilers use the Magic Number or Reciprocal Multiplication technique: replace x / c with a multiplication by a precomputed constant followed by a shift. Vastly faster than DIV.
Stack Probing
Functions with large local variables sometimes "probe" the stack — touch each page — to avoid skipping the guard page. Windows and Linux differ in conventions here.
Tail Calls
A tail call (last operation in a function is a call to another function) can be optimized to a jump:
| ; Instead of: call other_func; ret | |
| jmp other_func |
The other function returns directly to the original caller. Saves a stack frame and a return.
04. Compiler Output Walk-Through
Consider this small C function:
| int sum_array(const int* a, size_t n) { | |
| int s = 0; | |
| for (size_t i = 0; i < n; i++) | |
| s += a[i]; | |
| return s; | |
| } |
A modern compiler with -O2 for x86-64 SysV ABI might produce:
sum_array:
test rsi, rsi ; n == 0?
je .return_zero
xor eax, eax ; s = 0; clear rcx for loop counter
xor ecx, ecx ; i = 0
.loop:
add eax, [rdi + rcx*4] ; s += a[i]
inc rcx
cmp rcx, rsi
jb .loop
ret
.return_zero:
xor eax, eax ; s = 0
retNotice:
- Argument
ais in rdi,nis in rsi (SysV). - Return value is in rax (eax).
- The accumulator s is in eax for narrowness; the upper 32 bits of rax are zeroed by writes to eax.
- The loop is tight: 4 instructions (add, inc, cmp, jb).
-O3or-O2 -ftree-vectorizewould vectorize this with SSE/AVX, processing 4 or 8 ints per iteration.
Reading compiler output (e.g., via gcc -O2 -S -masm=intel) is one of the best ways to learn x86-64. The compiler's idioms reveal both the ISA's strengths and the optimizer's tricks.
05. Position-Independent Code
Modern executables and shared libraries are position-independent (PIC/PIE): they can be loaded at any virtual address. x86-64 makes this easy via RIP-relative addressing.
Function call (within the same module):
| call function ; encoded as 32-bit RIP-relative offset |
Global variable access (within the same module):
| mov rax, [rip + global_var] |
Function call (across modules, through PLT):
| call function@plt ; jumps to PLT stub, which goes via GOT to the real address |
Global variable access (across modules, through GOT):
| mov rax, [rip + var@GOTPCREL] ; load var's address from GOT | |
| mov rbx, [rax] ; deref |
The GOT (Global Offset Table) and PLT (Procedure Linkage Table) are runtime structures the dynamic linker fills in when loading the binary.
Pre-x86-64 (i.e., 32-bit x86) PIC was much more painful: each function had to compute its own address and use it as a base for relative references. RIP-relative addressing in x86-64 made shared libraries about 30% faster on i386 → x86-64 ports.
06. Thread-Local Storage
Per-thread variables (declared __thread in C, thread_local in C++) are allocated in a TLS region. x86-64 uses the fs or gs segment register's hidden base to address this region:
| mov rax, fs:[var@TPOFF] ; access thread-local 'var' |
The OS (or thread library) sets the fs base for each thread to point at that thread's TLS area. Linux uses fs for user-mode TLS; Windows uses gs. The kernel uses gs for per-CPU data.
This is one of the few places where x86's segmentation has survived in long mode. The fs/gs base addresses are stored in MSRs (FS_BASE, GS_BASE) and can be read or written via the RDFSBASE/WRFSBASE/RDGSBASE/WRGSBASE instructions (added late in x86-64's history; previously you had to use a syscall).
07. Vector Code (Quick Glimpse)
Floating-point and SIMD have their own chapter (35), but here's a taste of how they appear in compiled code.
Scalar double-precision FP:
| movsd xmm0, qword ptr [rdi] ; load double from [rdi] into xmm0 | |
| addsd xmm0, qword ptr [rsi] ; add double from [rsi] | |
| movsd qword ptr [rdx], xmm0 ; store double to [rdx] |
Vectorized addition of 4 doubles:
| vmovapd ymm0, [rdi] ; load 4 doubles (256 bits) from [rdi] | |
| vaddpd ymm0, ymm0, [rsi] ; add 4 doubles from [rsi] | |
| vmovapd [rdx], ymm0 ; store 4 doubles |
Compilers auto-vectorize loops when alignment, dependence, and size allow. Programmers can use intrinsics (_mm256_add_pd and similar) for explicit control.
08. String and REP-Prefixed Instructions
A distinctive corner of the integer ISA is the string instructions and their REP prefixes. These predate the modern era — they go back to the 8086 — but they remain in active use because modern hardware optimizes them aggressively.
The basic string instructions operate on memory addressed by rsi (source) and rdi (destination), advancing or retreating the index registers by the operand size:
MOVS{B,W,D,Q}— copy[rsi]to[rdi], advance both.STOS{B,W,D,Q}— storeal/ax/eax/raxto[rdi], advance.LODS{B,W,D,Q}— load[rsi]intoal/ax/eax/rax, advance.CMPS{B,W,D,Q}— compare[rsi]to[rdi], set flags, advance both.SCAS{B,W,D,Q}— compareal/ax/eax/raxto[rdi], set flags, advance.
The direction (advance vs. retreat) is governed by the direction flag DF: 0 advances forward, 1 retreats. CLD clears it; STD sets it. Most code keeps DF clear; the SysV ABI requires it on function entry.
The REP/REPE/REPNE prefixes turn each into a loop:
| mov rcx, rdx ; count | |
| rep movsb ; copy rdx bytes from rsi to rdi |
This single line copies a buffer. On a modern Intel or AMD core, fast string operations (FSO) and enhanced REP MOVSB (ERMS, plus newer FSRM — Fast Short REP MOV — and FZRM) make rep movsb competitive with hand-tuned SIMD memcpy for medium-to-large copies; the microcode dispatches to specialized hardware that streams data through wide load/store paths, automatically handling alignment and prefetching. Both glibc and the Microsoft CRT use rep movsb as the inner loop of memcpy on processors that report ERMS in CPUID.
The equivalent rep stosb is the standard memset inner loop. repne scasb searches for a byte in memory (the heart of strlen, although modern implementations use SSE/AVX comparisons that are faster on long strings). repe cmpsb is the heart of memcmp.
A related family is the string-comparison SIMD instructions of SSE 4.2: PCMPESTRI, PCMPISTRI, PCMPESTRM, PCMPISTRM. Each takes two 16-byte vectors and compares them according to a small immediate-encoded operation (find any matching byte, find any byte in a range, find a substring), returning the position of the first match in rcx or a mask in xmm0. These instructions are used in some strchr/strstr/strspn implementations and in JSON and HTML parsers; the win over byte-by-byte loops on long strings is substantial.
The broader pattern is that x86's most ancient instructions have been kept alive and made fast because the use cases (string copies, comparisons, fills) are so common that microcoding them into highly-tuned hardware paths pays off. A simple rep movsb today executes at 32 or 64 bytes per cycle on modern cores, far better than a software loop could achieve.
09. Privileged vs. Unprivileged
User-mode code (ring 3) can run nearly all of the integer, FP, and SIMD instructions described above. A few instructions are privileged:
- Anything that loads or modifies control registers (cr0-cr8).
- I/O instructions (
IN,OUT) when the I/O privilege level is restrictive. HLT,INVD,WBINVD,INVLPG,LGDT,LIDT,LTR, etc.RDMSR,WRMSR(some MSRs are accessible from user mode but most are not).CLI,STI(interrupt flag — sometimes user-accessible if I/O privilege allows).SYSCALL,SYSRETare designed for transitions; user mode uses SYSCALL but never SYSRET.
User mode encountering a privileged instruction triggers a #GP (general protection) exception, which the OS converts to a SIGSEGV or equivalent.
10. Practical Tools
A few tools every x86-64 programmer should know:
objdump -d -M intel binary — disassemble a binary in Intel syntax.
gcc -S -O2 -masm=intel src.c — produce assembly from C.
Godbolt Compiler Explorer (godbolt.org) — interactive web tool to see compiler output for many languages and compilers. Indispensable for learning.
perf annotate (Linux) — annotate binaries with sample counts from a perf record profile, showing hot instructions.
Intel SDM Volume 2 — the canonical instruction reference. Heavy reading but the definitive source.
Felix Cloutier's x86 reference (felixcloutier.com/x86) — a more navigable web version of the SDM.
11. Summary
x86-64's programming model is a 16-register, two-operand integer ISA with rich addressing modes, RIP-relative addressing for position independence, integrated SIMD via SSE/AVX, and TSO memory ordering. Calling conventions (SysV on Unix, Microsoft on Windows) define how arguments, return values, and saved registers are passed. Common idioms — xor reg, reg for zeroing, branchless CMOVcc, LEA for arithmetic, RIP-relative for PIC — recur throughout compiled code.
Reading compiler output is the best way to internalize the ISA. Each idiom reveals a piece of how x86-64 fits together: the flag-driven branches, the implicit dependencies, the RISC-like internal structure beneath a CISC-like surface.
The next chapter steps up to the system level: paging and virtual memory, system calls and interrupts, control registers and MSRs, the boot sequence, and how the OS kernel wields all this machinery.