AArch64 Programming Model
May 16, 2026·23 min read·advanced
This chapter is the AArch64 programmer's-eye view. Where Chapter 33 walked through x86-64 as an application programmer sees it, this chapter does the same for AArch64. The treatment is parallel and…
This chapter is the AArch64 programmer's-eye view. Where Chapter 33 walked through x86-64 as an application programmer sees it, this chapter does the same for AArch64. The treatment is parallel and concrete, with assembly examples and references to compiler output. Comparisons to x86-64 are made along the way to highlight where the two ISAs differ in spirit.
AArch64 is the AArch64 execution state of ARMv8-A and ARMv9-A. We use "AArch64" and "ARM64" interchangeably in this chapter (both terms are common; Apple favors "ARM64", ARM Ltd. favors "AArch64"; Linux uses both).
The chapter covers: the register file, instruction encoding, addressing modes, instruction categories (data movement, arithmetic, logic, comparisons, control flow, atomics), calling conventions, and common compiler-emitted patterns. Privileged and system-level features are deferred to Chapter 39.
01. Register File
AArch64 has 31 general-purpose 64-bit registers, named x0 through x30, plus the dedicated stack pointer sp.
Each register has two sizes:
- xN — 64-bit access. Operations on xN read/write all 64 bits.
- wN — 32-bit access. Operations on wN read the low 32 bits and zero-extend writes into the upper 32 bits.
So wN is to xN what eax is to rax on x86-64: a 32-bit alias that zero-extends. There are no 16-bit or 8-bit register names; sub-32-bit operations use load/store with size suffix or use 32-bit operations and rely on data layout.
By convention:
| Register | Role |
|---|---|
| x0-x7 | First 8 argument registers; x0-x1 hold return values |
| x8 | Indirect result location (large struct return); also syscall number on Linux |
| x9-x15 | Caller-saved (volatile / temporary) |
| x16, x17 | IP0, IP1 — intra-procedure call scratch (used by linker stubs) |
| x18 | Platform register (TLS base on iOS/Windows; reserved on Linux per ABI) |
| x19-x28 | Callee-saved (preserved across calls) |
| x29 | Frame pointer (FP) |
| x30 | Link register (LR) — return address |
| sp | Stack pointer |
xzr / wzr is the zero register: a virtual register that always reads as 0, and writes to it are discarded. Any instruction that takes a register operand can use xzr/wzr in its place. This eliminates the need for a separate "zero this register" idiom — mov x0, xzr is the conventional zeroing form, and the rename hardware recognizes the zero source.
The stack pointer is not x31. The encoding bits that would select x31 mean either xzr or sp depending on the instruction context. Specifically:
- In most arithmetic/logical instructions, x31 in the source means xzr.
- In stack-based instructions (load/store with sp-relative addressing, etc.), x31 means sp.
- A few instructions can use sp as a general operand.
This dual interpretation saves an encoding bit at the cost of slight complexity. In practice, programmers (and compilers) use the assembler mnemonics (xzr, sp) and don't worry about the encoding.
Program Counter (PC). AArch64 has an architectural PC, but it is not a general-purpose register (unlike AArch32, where the PC was r15). Instructions cannot directly read or write PC. PC-relative loads and the ADR/ADRP instructions are the way to get PC's value or compute PC-relative addresses.
FP/SIMD registers (V0-V31). 32 128-bit registers used by NEON SIMD and scalar floating-point. Each has multiple aliased forms:
- vN.16b, vN.8h, vN.4s, vN.2d: as a vector of 16 bytes, 8 halfwords, 4 words, or 2 doublewords (NEON).
- qN: full 128-bit access (e.g., for moves).
- dN: low 64 bits as a double-precision FP scalar.
- sN: low 32 bits as a single-precision FP scalar.
- hN: low 16 bits as a half-precision FP scalar.
- bN: low 8 bits.
Writes to a sub-width view (sN, dN, etc.) zero the upper bits of the 128-bit register. This is consistent zero-extension semantics: there is no per-element preservation surprise.
SVE registers (Z0-Z31, P0-P15). When SVE/SVE2 is implemented, the V registers extend to scalable Z registers (typically 128, 256, 512, 1024, or 2048 bits depending on the implementation), and there are 16 predicate registers P0-P15. We treat these in Chapter 40.
System registers. Hundreds of system control registers, accessed via MRS (move from system reg) and MSR (move to system reg). Examples: TPIDR_EL0 (thread pointer), MIDR_EL1 (CPU ID), PMCCNTR_EL0 (cycle counter), CNTVCT_EL0 (virtual counter). Most are privileged; some are accessible from EL0.
02. Condition Flags
AArch64 has a PSTATE register holding processor state, including the four condition flags:
- N (negative): result was negative.
- Z (zero): result was zero.
- C (carry): unsigned overflow.
- V (overflow): signed overflow.
Unlike x86 where almost every arithmetic instruction sets flags, AArch64 instructions set flags only when an explicit flag-setting variant is used:
| add x0, x1, x2 ; x0 = x1 + x2; flags unchanged | |
| adds x0, x1, x2 ; x0 = x1 + x2; flags updated |
The S suffix (adds, subs, ands) on most instructions makes them flag-setting. Compare instructions (cmp, cmn, tst) always set flags (they don't have an explicit S form because they exist only for the flags).
This explicit flag-setting model has two advantages. First, it reduces serial dependencies between adjacent instructions: if a sequence of arithmetic doesn't need flags, the flag bits are never written and never become a renamed bottleneck. Second, it simplifies OoO execution; the rename logic for flags only fires when an instruction explicitly says it should.
03. Instruction Encoding
Every AArch64 instruction is exactly 32 bits long. Instructions are grouped into encoding categories (data-processing immediate, data-processing register, branches, loads/stores, etc.) by the high bits of the encoding. Within each category, sub-fields select the specific operation, register operands, and immediate values.
The fixed width has several advantages:
- Trivial instruction boundary detection. No prefix walking, no length-decoding. The decoder knows where each instruction starts simply by indexing.
- Parallel decode. All decoders see complete instructions as they fetch; there is no inter-instruction dependence in length determination. Wider decoders are easier.
- Branch prediction. Branch targets are 4-byte aligned, simplifying the BTB.
- Disassembly. Tools never get confused; there is exactly one valid disassembly for any byte sequence.
The trade-off: less code density. AArch64 binaries tend to be ~10-20% larger than x86-64 binaries for equivalent functionality. Thumb mode in AArch32 was specifically designed to recover code density (16-bit instructions for common cases), but AArch64 dropped Thumb in favor of a single uniform encoding.
For mobile devices where storage and memory bandwidth matter, the larger code size is a real cost — but cache hierarchies and storage have grown to where it's manageable.
04. Instruction Categories
Data Movement
MOV — move register to register or immediate to register:
| mov x0, x1 ; x0 = x1 | |
| mov x0, #42 ; x0 = 42 (small immediate) | |
| mov x0, #0x1234 ; x0 = 0x1234 (16-bit immediate) |
For larger immediates, AArch64 uses MOVZ (move with zero), MOVN (move with negation), and MOVK (move with keep), composing multi-instruction sequences:
| movz x0, #0x5678, lsl #0 ; x0 = 0x5678 | |
| movk x0, #0x1234, lsl #16 ; x0 = 0x12345678 (keep low 16, set bits 16-31) | |
| movk x0, #0xabcd, lsl #32 ; x0 = 0xabcd_12345678 (set bits 32-47) |
To load arbitrary 64-bit constants, the compiler may emit up to 4 instructions (16 bits per MOVK). RIP-relative loading from a literal pool is an alternative:
| ldr x0, =0x123456789abcdef0 ; pseudo-instruction; assembler emits literal pool load |
LDR / STR — load and store. The fundamental memory operations:
| ldr x0, [x1] ; x0 = [x1] (64-bit load) | |
| str x0, [x1] ; [x1] = x0 (64-bit store) | |
| ldr w0, [x1] ; 32-bit load with zero-extension to 64 bits | |
| ldrb w0, [x1] ; 8-bit load with zero-extension | |
| ldrsh x0, [x1] ; 16-bit load with sign-extension to 64 bits |
Load size and signedness are encoded in the mnemonic: ldr (full size), ldrb/ldrh (byte/halfword unsigned), ldrsb/ldrsh/ldrsw (signed byte/halfword/word).
Stores have only str/strb/strh (no signedness for stores, since they just write the bits).
LDP / STP — load pair / store pair. Two registers in one instruction, useful for prologue/epilogue:
| stp x29, x30, [sp, #-16]! ; push fp, lr; pre-decrement sp | |
| ldp x29, x30, [sp], #16 ; pop fp, lr; post-increment sp |
Function prologues nearly always use STP to save FP and LR together; this is more efficient than two separate stores.
Addressing Modes
AArch64's load/store instructions support several addressing modes:
| ldr x0, [x1] ; register indirect | |
| ldr x0, [x1, #8] ; base + immediate offset | |
| ldr x0, [x1, x2] ; base + register | |
| ldr x0, [x1, x2, lsl #3] ; base + scaled register (here scale = 8 = 2^3) | |
| ldr x0, [x1, x2, sxtw #2] ; base + sign-extended w-reg, scaled | |
| ldr x0, [x1, #8]! ; pre-indexed: x1 += 8, then load from new x1 | |
| ldr x0, [x1], #8 ; post-indexed: load from x1, then x1 += 8 | |
| ldr x0, [pc, #offset] ; PC-relative (assembled from a label) |
Pre- and post-indexed forms are particularly useful for traversing arrays:
| loop: | |
| ldr x0, [x1], #8 ; load from x1, advance x1 by 8 | |
| cbnz x0, loop ; loop if non-zero |
The compiler emits these addressing modes idiomatically.
PC-relative addressing uses ADR (form an address relative to PC, ±1 MiB range) and ADRP (form a 4 KiB-aligned page address relative to PC, ±4 GiB range). Standard pattern for a global variable:
| adrp x0, mygvar ; x0 = page address of mygvar | |
| ldr x1, [x0, :lo12:mygvar] ; x1 = mygvar (low 12 bits within page) |
This gives PIC (position-independent code) without needing a separate GOT for nearby symbols. For symbols outside the ±4 GiB range, the linker arranges a different sequence.
Arithmetic and Logic
Add and subtract.
| add x0, x1, x2 ; x0 = x1 + x2 | |
| add x0, x1, #100 ; x0 = x1 + 100 | |
| sub x0, x1, x2 ; x0 = x1 - x2 | |
| neg x0, x1 ; x0 = -x1 (alias for sub x0, xzr, x1) | |
| adds x0, x1, x2 ; flag-setting add | |
| subs x0, x1, x2 ; flag-setting sub (used by cmp) |
ADD and SUB accept shifted register operands, allowing one instruction to compute x1 + (x2 << 4):
| add x0, x1, x2, lsl #4 ; x0 = x1 + (x2 << 4) | |
| add x0, x1, x2, lsr #4 ; x0 = x1 + (x2 >> 4) [logical] | |
| add x0, x1, x2, asr #4 ; x0 = x1 + (x2 >> 4) [arithmetic] |
This is the AArch64 equivalent of x86's LEA: a fast scaled addition. Compilers use it for index computations.
Multiply.
| mul x0, x1, x2 ; x0 = x1 * x2 (low 64 bits) | |
| umulh x0, x1, x2 ; x0 = high 64 bits of unsigned x1*x2 | |
| smulh x0, x1, x2 ; x0 = high 64 bits of signed x1*x2 | |
| madd x0, x1, x2, x3 ; x0 = x3 + x1*x2 (multiply-add) | |
| msub x0, x1, x2, x3 ; x0 = x3 - x1*x2 |
Multiply-accumulate is a single instruction (MADD/MSUB), useful in numerical code. There is no flag-setting variant: multiplication does not produce overflow flags (signed/unsigned distinction is in the size variant).
Divide.
| udiv x0, x1, x2 ; x0 = x1 / x2 (unsigned) | |
| sdiv x0, x1, x2 ; x0 = x1 / x2 (signed) |
Division by zero produces 0 (no exception in AArch64). To get the remainder, compute q*d and subtract:
| sdiv x3, x1, x2 ; x3 = x1 / x2 | |
| msub x4, x3, x2, x1 ; x4 = x1 - x3*x2 = x1 mod x2 |
Compilers know this pattern and emit it for %.
Bitwise.
| and x0, x1, x2 ; bitwise AND | |
| orr x0, x1, x2 ; bitwise OR (note: ORR not OR) | |
| eor x0, x1, x2 ; bitwise XOR (note: EOR not XOR) | |
| mvn x0, x1 ; bitwise NOT (alias for orn x0, xzr, x1) | |
| ands x0, x1, x2 ; flag-setting AND | |
| tst x1, x2 ; AND but discard result, set flags (alias for ands xzr, x1, x2) |
Bitwise instructions also accept shifted/extended register operands, like add/sub.
Shifts and bitfield.
| lsl x0, x1, #4 ; logical shift left | |
| lsr x0, x1, #4 ; logical shift right | |
| asr x0, x1, #4 ; arithmetic shift right | |
| ror x0, x1, #4 ; rotate right | |
| ubfx x0, x1, #4, #8 ; unsigned bitfield extract (extract 8 bits starting at bit 4) | |
| sbfx x0, x1, #4, #8 ; signed bitfield extract | |
| bfi x0, x1, #4, #8 ; bitfield insert | |
| ubfm, sbfm, bfm ; underlying generic forms |
The bitfield instructions are particularly powerful: UBFX/SBFX/BFI make bitfield manipulation a single instruction, which is awkward in x86 (requiring shift-mask-or sequences).
CLZ counts leading zeros; RBIT reverses bit order; REV/REV16/REV32 byte-reverse for endianness conversion. Single-instruction primitives that compilers and intrinsics use heavily.
Comparisons
| cmp x0, x1 ; flags = x0 - x1, discard result (alias for subs xzr, x0, x1) | |
| cmp x0, #100 ; flags = x0 - 100 | |
| cmn x0, x1 ; flags = x0 + x1 ("compare negative") | |
| tst x0, x1 ; flags = x0 & x1 |
Following a comparison, conditional branches and conditional selects use the flags.
Control Flow
Unconditional branches.
| b label ; branch (jump) to label | |
| bl label ; branch and link (function call): x30 = pc+4, then jump | |
| br x0 ; branch register (indirect jump) | |
| blr x0 ; branch with link to register (indirect call) | |
| ret ; return: jump to x30 (link register); alias for br x30 |
Note bl is the function-call instruction (saves return address in x30); ret returns by jumping to x30. No PUSH/POP in the prologue/epilogue is required; the link register provides a register-based return mechanism. The compiler explicitly stores x30 to the stack only when the function calls another function (since x30 would otherwise be clobbered).
Conditional branches.
| b.eq label ; branch if equal (Z=1) | |
| b.ne label ; branch if not equal (Z=0) | |
| b.lt label ; branch if less than, signed (N!=V) | |
| b.le label ; branch if less or equal, signed | |
| b.gt label ; branch if greater than, signed (Z=0 and N=V) | |
| b.ge label ; branch if greater or equal, signed (N=V) | |
| b.lo / b.cc ; lower (unsigned less than) (C=0) | |
| b.hi ; higher (unsigned greater than) | |
| b.ls ; lower or same (unsigned) | |
| b.hs / b.cs ; higher or same (unsigned) | |
| b.mi / b.pl ; minus / plus (sign) | |
| b.vs / b.vc ; overflow set / clear |
Conditional branches have ±1 MiB range. For longer ranges, the compiler emits a conditional branch around an unconditional branch.
Compare-and-branch combinations. AArch64 has fused compare-branch instructions for common patterns:
| cbz x0, label ; if x0 == 0, branch | |
| cbnz x0, label ; if x0 != 0, branch | |
| tbz x0, #5, label ; if bit 5 of x0 is 0, branch (test bit and branch) | |
| tbnz x0, #5, label ; if bit 5 of x0 is 1, branch |
These are single instructions, no separate compare needed. They save an instruction (no cmp) and don't pollute the flags register, simplifying OoO execution.
Conditional select.
| csel x0, x1, x2, eq ; x0 = x1 if eq else x2 | |
| csinc x0, x1, x2, eq ; x0 = x1 if eq else x2+1 | |
| csinv x0, x1, x2, eq ; x0 = x1 if eq else ~x2 | |
| csneg x0, x1, x2, eq ; x0 = x1 if eq else -x2 |
These are AArch64's branchless-select primitives, equivalent to x86's CMOVcc. Compilers use them aggressively for branchless code.
Conditional set/inc/inv/neg with one operand:
| cset x0, eq ; x0 = 1 if eq else 0 (alias of csinc x0, xzr, xzr, ne) | |
| csetm x0, eq ; x0 = -1 if eq else 0 |
CSET is the equivalent of x86's SETcc, materializing a condition into a 0/1 register value.
Atomic Operations
ARMv8.0 has only the load-exclusive / store-exclusive mechanism for atomics:
| loop: | |
| ldxr x0, [x1] ; load-exclusive | |
| add x0, x0, #1 ; modify | |
| stxr w2, x0, [x1] ; store-exclusive; w2 = 0 on success, 1 on failure | |
| cbnz w2, loop ; retry if failed |
This is the LR/SC pattern (Chapter 30): load-exclusive marks the line as monitored; store-exclusive succeeds only if the line has not been modified since.
ARMv8.1 added LSE (Large System Extension), single-instruction atomics:
| mov w0, #1 | |
| ldadd w0, w1, [x2] ; atomically: w1 = [x2]; [x2] = old + w0 | |
| ldset, ldclr, ldeor ; atomic OR, AND-NOT, XOR | |
| swp ; atomic exchange | |
| cas ; compare-and-swap | |
| casa, casl, casal ; CAS with acquire/release/both |
LSE is now nearly universal in modern AArch64 chips. Compilers prefer LSE on systems that support it (often controlled via -march=armv8.1-a or -moutline-atomics).
Memory ordering. AArch64 is a weak memory model. To enforce ordering, use:
DMB ISH(Data Memory Barrier, Inner Shareable): full memory barrier across cores.DMB ISHLD: order earlier loads before later memory ops.DMB ISHST: order earlier stores before later stores.DSB: synchronization barrier (waits for completion).ISB: instruction synchronization barrier.
Or use load-acquire / store-release instructions:
| ldar x0, [x1] ; load-acquire (subsequent ops don't move before) | |
| stlr x0, [x1] ; store-release (preceding ops don't move after) | |
| ldaxr, stlxr ; LR/SC variants with acquire/release |
ldar/stlr are the cheapest way to express acquire/release semantics. They map directly to C++ memory_order_acquire/memory_order_release. Compilers emit them when source code uses std::atomic operations with these orderings.
05. Calling Convention (AAPCS64)
The standard calling convention on AArch64 is the AAPCS64 (Procedure Call Standard for AArch64), used by Linux, macOS, iOS, Android, and Windows on ARM (with minor variants).
Argument passing.
- Integer / pointer args 1-8: in x0-x7.
- FP / vector args 1-8: in v0-v7.
- Additional args: on the stack.
- Return value: in x0 (and x1 if 128-bit). FP return in v0.
- Indirect result location (for large struct returns): x8 holds the address.
Caller-saved (volatile). x0-x18, v0-v7, v16-v31. The caller must save these before a call if it wants them preserved.
Callee-saved. x19-x28, v8-v15 (only the lower 64 bits of v8-v15, technically). The callee must preserve or save/restore these.
Stack. Grows downward, must be 16-byte aligned at any public function call. SP must be 16-byte aligned at all times when SP-relative addressing is used.
Frame pointer. x29 is the frame pointer; x30 is the link register. The standard prologue:
| function: | |
| stp x29, x30, [sp, #-16]! ; push fp, lr | |
| mov x29, sp ; new fp | |
| sub sp, sp, #N ; allocate locals (N must be 16-byte multiple) | |
| ; ... body ... | |
| add sp, sp, #N ; deallocate locals | |
| ldp x29, x30, [sp], #16 ; pop fp, lr | |
| ret |
For leaf functions (functions that don't call others), x30 doesn't need to be saved, and the prologue can be omitted entirely if locals fit in registers.
The convention is straightforward and consistent. Apple's ABI and Microsoft's ABI for Windows on ARM diverge in small details (e.g., x18 is reserved on Apple/Windows but available on Linux, vector argument layout for vararg differs), but the core is shared.
06. Common Idioms
A few idioms appear frequently in compiled AArch64 code.
Zeroing. Use xzr/wzr:
| mov x0, xzr ; x0 = 0 | |
| mov w0, wzr ; w0 = 0 (zeros all 64 bits of x0) | |
| str xzr, [x1] ; store zero |
The zero register source is recognized by rename hardware; no false dependencies.
Comparing to zero. Use cbz/cbnz if the result is for a branch:
| cbz x0, .Lzero ; if x0 == 0, branch |
If the result feeds something else, use tst:
| tst x0, x0 ; sets flags (alias for ands xzr, x0, x0) | |
| b.eq .Lzero |
Branchless absolute value.
| cmp x0, #0 | |
| cneg x0, x0, lt ; if lt (negative), negate; else keep |
Single instruction cneg (conditional negate) is a clean expression.
Sign extension.
| sxtw x0, w1 ; sign-extend 32-bit w1 to 64-bit x0 | |
| sxth x0, w1 ; sign-extend 16-bit | |
| sxtb x0, w1 ; sign-extend 8-bit |
Or use the S form of a load directly: ldrsw x0, [x1] loads 32 bits with sign extension to 64.
Loop counter.
| .Lloop: | |
| ldr w0, [x1], #4 ; load and post-increment | |
| add w2, w2, w0 ; accumulate | |
| subs x3, x3, #1 ; decrement counter | |
| b.ne .Lloop ; loop if not zero |
The subs ... b.ne pattern is the canonical decrementing loop.
07. Compiler Output Walk-Through
The same example as in Chapter 33: array sum.
| int sum_array(const int* a, size_t n) { | |
| int s = 0; | |
| for (size_t i = 0; i < n; i++) | |
| s += a[i]; | |
| return s; | |
| } |
Compiled with clang -O2 --target=aarch64-linux-gnu:
| sum_array: | |
| cbz x1, .Lret_zero | |
| mov x8, #0 ; i = 0 | |
| mov w0, #0 ; s = 0 | |
| .Lloop: | |
| ldr w9, [x0, x8, lsl #2] ; load a[i] | |
| add w0, w0, w9 ; s += a[i] | |
| add x8, x8, #1 ; i++ | |
| cmp x8, x1 | |
| b.ne .Lloop | |
| ret | |
| .Lret_zero: | |
| mov w0, #0 |
Notice:
- Argument
ais in x0,nis in x1. - Return value is in w0/x0.
- The scaled-register addressing
[x0, x8, lsl #2]encodesa[i]as*(a + i*4)in one instruction. - The loop is 4 instructions, similar density to the x86 version.
cbzhandles the early exit cleanly.
For very simple loops, AArch64 and x86 produce similar instruction counts. AArch64 instructions are uniform 4 bytes; x86 instructions average ~3-4 bytes; the code-size difference is small at this scale.
08. Position-Independent Code
PIC on AArch64 uses ADRP/ADD/LDR sequences:
| ; Reading a global int 'g' (with ARMv8 small code model) | |
| adrp x0, :got:g ; page address of g's GOT entry | |
| ldr x0, [x0, :got_lo12:g] ; load g's actual address | |
| ldr w0, [x0] ; load g's value |
For local symbols (defined in the same module), a simpler ADRP/ADD without GOT works:
| adrp x0, mylocal | |
| add x0, x0, :lo12:mylocal | |
| ldr w0, [x0] ; load mylocal's value |
The ADRP/ADD pattern is similar to x86-64's RIP-relative addressing, just with the address computation explicit (one instruction for the page, one for the offset). This is needed because a 32-bit AArch64 instruction can't fit a full 32- or 64-bit displacement.
09. Thread-Local Storage
AArch64 uses TPIDR_EL0 (Thread Pointer ID Register, EL0) as the TLS base:
| mrs x0, tpidr_el0 ; x0 = thread pointer | |
| ldr w1, [x0, #:tpoff:var] ; read thread-local 'var' |
The TLS base is set up by the OS or runtime when creating each thread. Reading TPIDR_EL0 is unprivileged (cheaper than a syscall, and not needing fs/gs games as in x86).
10. Privileged vs. Unprivileged
User mode (EL0) can run nearly all integer/FP/SIMD instructions. Privileged operations are restricted:
- System register access (
MRS/MSR): most system regs are EL1+ only; some are accessible from EL0 (TPIDR_EL0, CNTVCT_EL0, etc.). - Cache maintenance (
DC,IC,TLBI): mostly EL1+, with some user-accessible operations likeDC CVAU(clean cache to point of unification). HVC,SMC: hypervisor and secure-monitor calls; trap to higher EL.ERET: exception return; EL1+ only.
Attempting a privileged operation from EL0 traps to EL1 (typically delivering SIGILL).
11. The Weak Memory Model in Practice
The AArch64 memory model is fundamentally weaker than x86's TSO, and this difference is the single most common source of subtle bugs when porting concurrent code from x86 to ARM. A complete formal treatment belongs in Chapter 31 (Cache Coherence and Consistency); this section gives the practical view from the programmer's seat.
Under AArch64's weakly-ordered memory model, the hardware is free to reorder essentially any pair of memory accesses to different locations — load-load, load-store, store-load, and store-store — from the perspective of other observers, except where the program explicitly forbids reordering. Code that worked correctly on x86 because of the TSO guarantee that older stores precede younger loads to different addresses can fail on ARM without warning.
Four mechanisms restrict reordering:
-
Address dependencies. A load whose result feeds into the address of a subsequent load creates a data dependency the hardware must respect. Code that traverses linked lists or reads pointer-flag-data tuples relies on this implicitly.
-
Acquire/release accesses.
LDAR/LDAPR(load-acquire) prevent any later access from being reordered before them;STLR(store-release) prevents any earlier access from being reordered after it. These are the same one-way fences C11/C++11 expose asmemory_order_acquireandmemory_order_release. They are cheap on AArch64 — essentially free in straight-line code on modern Apple and Cortex cores — and are the right tool for almost all lock and message-passing code. -
Atomic read-modify-write. ARMv8.1 LSE instructions (
LDADD,CAS,SWP,STADD, ...) come with optional acquire (-A), release (-L), and acquire-release (-AL) suffixes; the suffix-free form is relaxed (no ordering). The acquire-release form (e.g.CASAL) is the AArch64 equivalent of x86'sLOCK CMPXCHG. -
Explicit barriers.
DMB ISH(data memory barrier, inner shareable) is the full bidirectional fence between memory accesses;DSBwaits for completion of prior accesses;ISBis an instruction synchronization barrier used after self-modifying code or after changing system state.DMB ISHLDandDMB ISHSTare weaker partial barriers (load- or store-only). Barriers are more expensive than acquire/release; correct concurrent code prefers acquire/release where possible.
The pre-LSE atomics built from LDXR/STXR (load-exclusive / store-exclusive) implement compare-and-swap as a software loop: load the value with reservation, compare, conditionally store, retry on failure. This style remains valid and is what older ARMv8.0-A targets use, but LSE is faster and simpler on contended atomics. Linux distributions targeting modern AArch64 increasingly mandate LSE through the arm64-v8.1-a baseline; user-mode runtimes patch atomic primitives at startup based on AT_HWCAP to use whichever path the host supports.
The practical advice for AArch64 concurrent programming: use <stdatomic.h> or std::atomic with explicit memory orders, prefer memory_order_acquire/memory_order_release over memory_order_seq_cst (which compiles to a barrier-bracketed access on AArch64), and treat any code that worked on x86 without atomics as suspect when porting. The cost of getting this wrong is rare, hard-to-reproduce data races that manifest only on specific cores and only at certain speculation depths.
12. Practical Tools
objdump -d -m aarch64 binary— disassemble.gcc -S -O2 --target=aarch64-linux-gnu— compile to assembly.- Compiler Explorer (godbolt.org) — supports AArch64 across many compilers.
perf annotateon ARM Linux — instruction-level profiling.- ARM Architecture Reference Manual — the canonical reference (massive PDF; thousands of pages).
- Felix Cloutier-style references — third-party AArch64 references exist; ARM's official docs are comprehensive but heavy.
13. Summary
AArch64 is a clean RISC ISA: 31 general-purpose 64-bit registers plus xzr and sp, fixed 32-bit instruction width, large register file with structured encoding, weak memory model, explicit flag-setting (S-suffix), and a uniform set of addressing modes including pre/post-indexed forms and shifted/extended register operands. Common patterns — function prologues with STP, branchless selects with CSEL/CSET, compare-and-branch with CBZ/CBNZ/TBZ, atomic LSE operations, acquire/release with LDAR/STLR — are central to typical compiler output.
Compared with x86-64, AArch64's encoding is far simpler, the register file is twice as large, conditional execution is explicit and limited, the memory model is weaker (requiring more programmer awareness), and SIMD is its own first-class subsystem (NEON and SVE) rather than a layered set of extensions on a base. The next chapter steps up to the system level: exception levels, MMU, interrupts (GIC), system registers, boot.