Part V·ISA Case Studies·Chapter 38 of 62

Part VISA Case Studies

AArch64 Programming Model

May 16, 2026·23 min read·advanced

This chapter is the AArch64 programmer's-eye view. Where Chapter 33 walked through x86-64 as an application programmer sees it, this chapter does the same for AArch64. The treatment is parallel and concrete, with assembly examples and references to compiler output. Comparisons to x86-64 are made along the way to highlight where the two ISAs differ in spirit.

AArch64 is the AArch64 execution state of ARMv8-A and ARMv9-A. We use "AArch64" and "ARM64" interchangeably in this chapter (both terms are common; Apple favors "ARM64", ARM Ltd. favors "AArch64"; Linux uses both).

The chapter covers: the register file, instruction encoding, addressing modes, instruction categories (data movement, arithmetic, logic, comparisons, control flow, atomics), calling conventions, and common compiler-emitted patterns. Privileged and system-level features are deferred to Chapter 39.

01.Register File

AArch64 has 31 general-purpose 64-bit registers, named x0 through x30, plus the dedicated stack pointer sp.

Each register has two sizes:

xN — 64-bit access. Operations on xN read/write all 64 bits.
wN — 32-bit access. Operations on wN read the low 32 bits and zero-extend writes into the upper 32 bits.

So wN is to xN what eax is to rax on x86-64: a 32-bit alias that zero-extends. There are no 16-bit or 8-bit register names; sub-32-bit operations use load/store with size suffix or use 32-bit operations and rely on data layout.

By convention:

Register	Role
x0-x7	First 8 argument registers; x0-x1 hold return values
x8	Indirect result location (large struct return); also syscall number on Linux
x9-x15	Caller-saved (volatile / temporary)
x16, x17	IP0, IP1 — intra-procedure call scratch (used by linker stubs)
x18	Platform register (TLS base on iOS/Windows; reserved on Linux per ABI)
x19-x28	Callee-saved (preserved across calls)
x29	Frame pointer (FP)
x30	Link register (LR) — return address
sp	Stack pointer

xzr / wzr is the zero register: a virtual register that always reads as 0, and writes to it are discarded. Any instruction that takes a register operand can use xzr/wzr in its place. This eliminates the need for a separate "zero this register" idiom — mov x0, xzr is the conventional zeroing form, and the rename hardware recognizes the zero source.

The stack pointer is not x31. The encoding bits that would select x31 mean either xzr or sp depending on the instruction context. Specifically:

In most arithmetic/logical instructions, x31 in the source means xzr.
In stack-based instructions (load/store with sp-relative addressing, etc.), x31 means sp.
A few instructions can use sp as a general operand.

This dual interpretation saves an encoding bit at the cost of slight complexity. In practice, programmers (and compilers) use the assembler mnemonics (xzr, sp) and don't worry about the encoding.

Program Counter (PC). AArch64 has an architectural PC, but it is not a general-purpose register (unlike AArch32, where the PC was r15). Instructions cannot directly read or write PC. PC-relative loads and the ADR/ADRP instructions are the way to get PC's value or compute PC-relative addresses.

FP/SIMD registers (V0-V31). 32 128-bit registers used by NEON SIMD and scalar floating-point. Each has multiple aliased forms:

vN.16b, vN.8h, vN.4s, vN.2d: as a vector of 16 bytes, 8 halfwords, 4 words, or 2 doublewords (NEON).
qN: full 128-bit access (e.g., for moves).
dN: low 64 bits as a double-precision FP scalar.
sN: low 32 bits as a single-precision FP scalar.
hN: low 16 bits as a half-precision FP scalar.
bN: low 8 bits.

Writes to a sub-width view (sN, dN, etc.) zero the upper bits of the 128-bit register. This is consistent zero-extension semantics: there is no per-element preservation surprise.

SVE registers (Z0-Z31, P0-P15). When SVE/SVE2 is implemented, the V registers extend to scalable Z registers (typically 128, 256, 512, 1024, or 2048 bits depending on the implementation), and there are 16 predicate registers P0-P15. We treat these in Chapter 40.

System registers. Hundreds of system control registers, accessed via MRS (move from system reg) and MSR (move to system reg). Examples: TPIDR_EL0 (thread pointer), MIDR_EL1 (CPU ID), PMCCNTR_EL0 (cycle counter), CNTVCT_EL0 (virtual counter). Most are privileged; some are accessible from EL0.

02.Condition Flags

AArch64 has a PSTATE register holding processor state, including the four condition flags:

N (negative): result was negative.
Z (zero): result was zero.
C (carry): unsigned overflow.
V (overflow): signed overflow.

Unlike x86 where almost every arithmetic instruction sets flags, AArch64 instructions set flags only when an explicit flag-setting variant is used:

Assembly

add  x0, x1, x2     ; x0 = x1 + x2; flags unchanged
adds x0, x1, x2     ; x0 = x1 + x2; flags updated

The S suffix (adds, subs, ands) on most instructions makes them flag-setting. Compare instructions (cmp, cmn, tst) always set flags (they don't have an explicit S form because they exist only for the flags).

This explicit flag-setting model has two advantages. First, it reduces serial dependencies between adjacent instructions: if a sequence of arithmetic doesn't need flags, the flag bits are never written and never become a renamed bottleneck. Second, it simplifies OoO execution; the rename logic for flags only fires when an instruction explicitly says it should.

03.Instruction Encoding

Every AArch64 instruction is exactly 32 bits long. Instructions are grouped into encoding categories (data-processing immediate, data-processing register, branches, loads/stores, etc.) by the high bits of the encoding. Within each category, sub-fields select the specific operation, register operands, and immediate values.

The fixed width has several advantages:

Trivial instruction boundary detection. No prefix walking, no length-decoding. The decoder knows where each instruction starts simply by indexing.
Parallel decode. All decoders see complete instructions as they fetch; there is no inter-instruction dependence in length determination. Wider decoders are easier.
Branch prediction. Branch targets are 4-byte aligned, simplifying the BTB.
Disassembly. Tools never get confused; there is exactly one valid disassembly for any byte sequence.

The trade-off: less code density. AArch64 binaries tend to be ~10-20% larger than x86-64 binaries for equivalent functionality. Thumb mode in AArch32 was specifically designed to recover code density (16-bit instructions for common cases), but AArch64 dropped Thumb in favor of a single uniform encoding.

For mobile devices where storage and memory bandwidth matter, the larger code size is a real cost — but cache hierarchies and storage have grown to where it's manageable.

04.Instruction Categories

Data Movement

MOV — move register to register or immediate to register:

Assembly

mov  x0, x1            ; x0 = x1
mov  x0, #42           ; x0 = 42 (small immediate)
mov  x0, #0x1234       ; x0 = 0x1234 (16-bit immediate)

For larger immediates, AArch64 uses MOVZ (move with zero), MOVN (move with negation), and MOVK (move with keep), composing multi-instruction sequences:

Assembly

movz x0, #0x5678, lsl #0     ; x0 = 0x5678
movk x0, #0x1234, lsl #16    ; x0 = 0x12345678 (keep low 16, set bits 16-31)
movk x0, #0xabcd, lsl #32    ; x0 = 0xabcd_12345678 (set bits 32-47)

To load arbitrary 64-bit constants, the compiler may emit up to 4 instructions (16 bits per MOVK). RIP-relative loading from a literal pool is an alternative:

Assembly

ldr  x0, =0x123456789abcdef0  ; pseudo-instruction; assembler emits literal pool load

LDR / STR — load and store. The fundamental memory operations:

Assembly

ldr  x0, [x1]              ; x0 = [x1] (64-bit load)
str  x0, [x1]              ; [x1] = x0 (64-bit store)
ldr  w0, [x1]              ; 32-bit load with zero-extension to 64 bits
ldrb w0, [x1]              ; 8-bit load with zero-extension
ldrsh x0, [x1]             ; 16-bit load with sign-extension to 64 bits

Load size and signedness are encoded in the mnemonic: ldr (full size), ldrb/ldrh (byte/halfword unsigned), ldrsb/ldrsh/ldrsw (signed byte/halfword/word).

Stores have only str/strb/strh (no signedness for stores, since they just write the bits).

LDP / STP — load pair / store pair. Two registers in one instruction, useful for prologue/epilogue:

Assembly

stp  x29, x30, [sp, #-16]!  ; push fp, lr; pre-decrement sp
ldp  x29, x30, [sp], #16    ; pop fp, lr; post-increment sp

Function prologues nearly always use STP to save FP and LR together; this is more efficient than two separate stores.

Addressing Modes

AArch64's load/store instructions support several addressing modes:

Assembly

ldr  x0, [x1]              ; register indirect
ldr  x0, [x1, #8]          ; base + immediate offset
ldr  x0, [x1, x2]          ; base + register
ldr  x0, [x1, x2, lsl #3]  ; base + scaled register (here scale = 8 = 2^3)
ldr  x0, [x1, x2, sxtw #2] ; base + sign-extended w-reg, scaled
ldr  x0, [x1, #8]!         ; pre-indexed: x1 += 8, then load from new x1
ldr  x0, [x1], #8          ; post-indexed: load from x1, then x1 += 8
ldr  x0, [pc, #offset]     ; PC-relative (assembled from a label)

Pre- and post-indexed forms are particularly useful for traversing arrays:

Assembly

loop:
    ldr  x0, [x1], #8       ; load from x1, advance x1 by 8
    cbnz x0, loop           ; loop if non-zero

The compiler emits these addressing modes idiomatically.

PC-relative addressing uses ADR (form an address relative to PC, ±1 MiB range) and ADRP (form a 4 KiB-aligned page address relative to PC, ±4 GiB range). Standard pattern for a global variable:

Assembly

adrp x0, mygvar                ; x0 = page address of mygvar
ldr  x1, [x0, :lo12:mygvar]    ; x1 = mygvar (low 12 bits within page)

This gives PIC (position-independent code) without needing a separate GOT for nearby symbols. For symbols outside the ±4 GiB range, the linker arranges a different sequence.

Arithmetic and Logic

Add and subtract.

Assembly

add  x0, x1, x2            ; x0 = x1 + x2
add  x0, x1, #100          ; x0 = x1 + 100
sub  x0, x1, x2            ; x0 = x1 - x2
neg  x0, x1                ; x0 = -x1 (alias for sub x0, xzr, x1)
adds x0, x1, x2            ; flag-setting add
subs x0, x1, x2            ; flag-setting sub (used by cmp)

ADD and SUB accept shifted register operands, allowing one instruction to compute x1 + (x2 << 4):

Assembly

add  x0, x1, x2, lsl #4    ; x0 = x1 + (x2 << 4)
add  x0, x1, x2, lsr #4    ; x0 = x1 + (x2 >> 4) [logical]
add  x0, x1, x2, asr #4    ; x0 = x1 + (x2 >> 4) [arithmetic]

This is the AArch64 equivalent of x86's LEA: a fast scaled addition. Compilers use it for index computations.

Multiply.

Assembly

mul   x0, x1, x2           ; x0 = x1 * x2 (low 64 bits)
umulh x0, x1, x2           ; x0 = high 64 bits of unsigned x1*x2
smulh x0, x1, x2           ; x0 = high 64 bits of signed x1*x2
madd  x0, x1, x2, x3       ; x0 = x3 + x1*x2 (multiply-add)
msub  x0, x1, x2, x3       ; x0 = x3 - x1*x2

Multiply-accumulate is a single instruction (MADD/MSUB), useful in numerical code. There is no flag-setting variant: multiplication does not produce overflow flags (signed/unsigned distinction is in the size variant).

Divide.

Assembly

udiv x0, x1, x2            ; x0 = x1 / x2 (unsigned)
sdiv x0, x1, x2            ; x0 = x1 / x2 (signed)

Division by zero produces 0 (no exception in AArch64). To get the remainder, compute q*d and subtract:

Assembly

sdiv x3, x1, x2            ; x3 = x1 / x2
msub x4, x3, x2, x1        ; x4 = x1 - x3*x2 = x1 mod x2

Compilers know this pattern and emit it for %.

Bitwise.

Assembly

and  x0, x1, x2            ; bitwise AND
orr  x0, x1, x2            ; bitwise OR (note: ORR not OR)
eor  x0, x1, x2            ; bitwise XOR (note: EOR not XOR)
mvn  x0, x1                ; bitwise NOT (alias for orn x0, xzr, x1)
ands x0, x1, x2            ; flag-setting AND
tst  x1, x2                ; AND but discard result, set flags (alias for ands xzr, x1, x2)

Bitwise instructions also accept shifted/extended register operands, like add/sub.

Shifts and bitfield.

Assembly

lsl  x0, x1, #4            ; logical shift left
lsr  x0, x1, #4            ; logical shift right
asr  x0, x1, #4            ; arithmetic shift right
ror  x0, x1, #4            ; rotate right
ubfx x0, x1, #4, #8        ; unsigned bitfield extract (extract 8 bits starting at bit 4)
sbfx x0, x1, #4, #8        ; signed bitfield extract
bfi  x0, x1, #4, #8        ; bitfield insert
ubfm, sbfm, bfm            ; underlying generic forms

The bitfield instructions are particularly powerful: UBFX/SBFX/BFI make bitfield manipulation a single instruction, which is awkward in x86 (requiring shift-mask-or sequences).

CLZ counts leading zeros; RBIT reverses bit order; REV/REV16/REV32 byte-reverse for endianness conversion. Single-instruction primitives that compilers and intrinsics use heavily.

Comparisons

Assembly

cmp  x0, x1                ; flags = x0 - x1, discard result (alias for subs xzr, x0, x1)
cmp  x0, #100              ; flags = x0 - 100
cmn  x0, x1                ; flags = x0 + x1 ("compare negative")
tst  x0, x1                ; flags = x0 & x1

Following a comparison, conditional branches and conditional selects use the flags.

Control Flow

Unconditional branches.

Assembly

b    label                 ; branch (jump) to label
bl   label                 ; branch and link (function call): x30 = pc+4, then jump
br   x0                    ; branch register (indirect jump)
blr  x0                    ; branch with link to register (indirect call)
ret                        ; return: jump to x30 (link register); alias for br x30

Note bl is the function-call instruction (saves return address in x30); ret returns by jumping to x30. No PUSH/POP in the prologue/epilogue is required; the link register provides a register-based return mechanism. The compiler explicitly stores x30 to the stack only when the function calls another function (since x30 would otherwise be clobbered).

Conditional branches.

Assembly

b.eq label              ; branch if equal (Z=1)
b.ne label              ; branch if not equal (Z=0)
b.lt label              ; branch if less than, signed (N!=V)
b.le label              ; branch if less or equal, signed
b.gt label              ; branch if greater than, signed (Z=0 and N=V)
b.ge label              ; branch if greater or equal, signed (N=V)
b.lo / b.cc             ; lower (unsigned less than) (C=0)
b.hi                    ; higher (unsigned greater than)
b.ls                    ; lower or same (unsigned)
b.hs / b.cs             ; higher or same (unsigned)
b.mi / b.pl             ; minus / plus (sign)
b.vs / b.vc             ; overflow set / clear

Conditional branches have ±1 MiB range. For longer ranges, the compiler emits a conditional branch around an unconditional branch.

Compare-and-branch combinations. AArch64 has fused compare-branch instructions for common patterns:

Assembly

cbz  x0, label          ; if x0 == 0, branch
cbnz x0, label          ; if x0 != 0, branch
tbz  x0, #5, label      ; if bit 5 of x0 is 0, branch (test bit and branch)
tbnz x0, #5, label      ; if bit 5 of x0 is 1, branch

These are single instructions, no separate compare needed. They save an instruction (no cmp) and don't pollute the flags register, simplifying OoO execution.

Conditional select.

Assembly

csel  x0, x1, x2, eq    ; x0 = x1 if eq else x2
csinc x0, x1, x2, eq    ; x0 = x1 if eq else x2+1
csinv x0, x1, x2, eq    ; x0 = x1 if eq else ~x2
csneg x0, x1, x2, eq    ; x0 = x1 if eq else -x2

These are AArch64's branchless-select primitives, equivalent to x86's CMOVcc. Compilers use them aggressively for branchless code.

Conditional set/inc/inv/neg with one operand:

Assembly

cset  x0, eq            ; x0 = 1 if eq else 0 (alias of csinc x0, xzr, xzr, ne)
csetm x0, eq            ; x0 = -1 if eq else 0

CSET is the equivalent of x86's SETcc, materializing a condition into a 0/1 register value.

Atomic Operations

ARMv8.0 has only the load-exclusive / store-exclusive mechanism for atomics:

Assembly

loop:
    ldxr x0, [x1]           ; load-exclusive
    add  x0, x0, #1         ; modify
    stxr w2, x0, [x1]       ; store-exclusive; w2 = 0 on success, 1 on failure
    cbnz w2, loop           ; retry if failed

This is the LR/SC pattern (Chapter 30): load-exclusive marks the line as monitored; store-exclusive succeeds only if the line has not been modified since.

ARMv8.1 added LSE (Large System Extension), single-instruction atomics:

Assembly

mov   w0, #1
ldadd w0, w1, [x2]      ; atomically: w1 = [x2]; [x2] = old + w0
ldset, ldclr, ldeor     ; atomic OR, AND-NOT, XOR
swp                     ; atomic exchange
cas                     ; compare-and-swap
casa, casl, casal       ; CAS with acquire/release/both

LSE is now nearly universal in modern AArch64 chips. Compilers prefer LSE on systems that support it (often controlled via -march=armv8.1-a or -moutline-atomics).

Memory ordering. AArch64 is a weak memory model. To enforce ordering, use:

DMB ISH (Data Memory Barrier, Inner Shareable): full memory barrier across cores.
DMB ISHLD: order earlier loads before later memory ops.
DMB ISHST: order earlier stores before later stores.
DSB: synchronization barrier (waits for completion).
ISB: instruction synchronization barrier.

Or use load-acquire / store-release instructions:

Assembly

ldar  x0, [x1]          ; load-acquire (subsequent ops don't move before)
stlr  x0, [x1]          ; store-release (preceding ops don't move after)
ldaxr, stlxr            ; LR/SC variants with acquire/release

ldar/stlr are the cheapest way to express acquire/release semantics. They map directly to C++ memory_order_acquire/memory_order_release. Compilers emit them when source code uses std::atomic operations with these orderings.

05.Calling Convention (AAPCS64)

The standard calling convention on AArch64 is the AAPCS64 (Procedure Call Standard for AArch64), used by Linux, macOS, iOS, Android, and Windows on ARM (with minor variants).

Argument passing.

Integer / pointer args 1-8: in x0-x7.
FP / vector args 1-8: in v0-v7.
Additional args: on the stack.
Return value: in x0 (and x1 if 128-bit). FP return in v0.
Indirect result location (for large struct returns): x8 holds the address.

Caller-saved (volatile). x0-x18, v0-v7, v16-v31. The caller must save these before a call if it wants them preserved.

Callee-saved. x19-x28, v8-v15 (only the lower 64 bits of v8-v15, technically). The callee must preserve or save/restore these.

Stack. Grows downward, must be 16-byte aligned at any public function call. SP must be 16-byte aligned at all times when SP-relative addressing is used.

Frame pointer. x29 is the frame pointer; x30 is the link register. The standard prologue:

Assembly

function:
    stp  x29, x30, [sp, #-16]!   ; push fp, lr
    mov  x29, sp                  ; new fp
    sub  sp, sp, #N               ; allocate locals (N must be 16-byte multiple)
    ; ... body ...
    add  sp, sp, #N               ; deallocate locals
    ldp  x29, x30, [sp], #16      ; pop fp, lr
    ret

For leaf functions (functions that don't call others), x30 doesn't need to be saved, and the prologue can be omitted entirely if locals fit in registers.

The convention is straightforward and consistent. Apple's ABI and Microsoft's ABI for Windows on ARM diverge in small details (e.g., x18 is reserved on Apple/Windows but available on Linux, vector argument layout for vararg differs), but the core is shared.

06.Common Idioms

A few idioms appear frequently in compiled AArch64 code.

Zeroing. Use xzr/wzr:

Assembly

mov   x0, xzr           ; x0 = 0
mov   w0, wzr           ; w0 = 0 (zeros all 64 bits of x0)
str   xzr, [x1]         ; store zero

The zero register source is recognized by rename hardware; no false dependencies.

Comparing to zero. Use cbz/cbnz if the result is for a branch:

Assembly

cbz x0, .Lzero ; if x0 == 0, branch

If the result feeds something else, use tst:

Assembly

tst   x0, x0            ; sets flags (alias for ands xzr, x0, x0)
b.eq  .Lzero

Branchless absolute value.

Assembly

cmp   x0, #0
cneg  x0, x0, lt        ; if lt (negative), negate; else keep

Single instruction cneg (conditional negate) is a clean expression.

Sign extension.

Assembly

sxtw  x0, w1            ; sign-extend 32-bit w1 to 64-bit x0
sxth  x0, w1            ; sign-extend 16-bit
sxtb  x0, w1            ; sign-extend 8-bit

Or use the S form of a load directly: ldrsw x0, [x1] loads 32 bits with sign extension to 64.

Loop counter.

Assembly

.Lloop:
    ldr   w0, [x1], #4   ; load and post-increment
    add   w2, w2, w0     ; accumulate
    subs  x3, x3, #1     ; decrement counter
    b.ne  .Lloop          ; loop if not zero

The subs ... b.ne pattern is the canonical decrementing loop.

07.Compiler Output Walk-Through

The same example as in Chapter 33: array sum.

int sum_array(const int* a, size_t n) {
    int s = 0;
    for (size_t i = 0; i < n; i++)
        s += a[i];
    return s;
}

Compiled with clang -O2 --target=aarch64-linux-gnu:

Assembly

sum_array:
    cbz     x1, .Lret_zero
    mov     x8, #0                ; i = 0
    mov     w0, #0                ; s = 0
.Lloop:
    ldr     w9, [x0, x8, lsl #2]  ; load a[i]
    add     w0, w0, w9            ; s += a[i]
    add     x8, x8, #1            ; i++
    cmp     x8, x1
    b.ne    .Lloop
    ret
.Lret_zero:
    mov     w0, #0

Notice:

Argument a is in x0, n is in x1.
Return value is in w0/x0.
The scaled-register addressing [x0, x8, lsl #2] encodes a[i] as *(a + i*4) in one instruction.
The loop is 4 instructions, similar density to the x86 version.
cbz handles the early exit cleanly.

For very simple loops, AArch64 and x86 produce similar instruction counts. AArch64 instructions are uniform 4 bytes; x86 instructions average ~3-4 bytes; the code-size difference is small at this scale.

08.Position-Independent Code

PIC on AArch64 uses ADRP/ADD/LDR sequences:

Assembly

; Reading a global int 'g' (with ARMv8 small code model)
adrp x0, :got:g            ; page address of g's GOT entry
ldr  x0, [x0, :got_lo12:g] ; load g's actual address
ldr  w0, [x0]              ; load g's value

For local symbols (defined in the same module), a simpler ADRP/ADD without GOT works:

Assembly

adrp x0, mylocal
add  x0, x0, :lo12:mylocal
ldr  w0, [x0]              ; load mylocal's value

The ADRP/ADD pattern is similar to x86-64's RIP-relative addressing, just with the address computation explicit (one instruction for the page, one for the offset). This is needed because a 32-bit AArch64 instruction can't fit a full 32- or 64-bit displacement.

09.Thread-Local Storage

AArch64 uses TPIDR_EL0 (Thread Pointer ID Register, EL0) as the TLS base:

Assembly

mrs  x0, tpidr_el0          ; x0 = thread pointer
ldr  w1, [x0, #:tpoff:var]  ; read thread-local 'var'

The TLS base is set up by the OS or runtime when creating each thread. Reading TPIDR_EL0 is unprivileged (cheaper than a syscall, and not needing fs/gs games as in x86).

10.Privileged vs. Unprivileged

User mode (EL0) can run nearly all integer/FP/SIMD instructions. Privileged operations are restricted:

System register access (MRS/MSR): most system regs are EL1+ only; some are accessible from EL0 (TPIDR_EL0, CNTVCT_EL0, etc.).
Cache maintenance (DC, IC, TLBI): mostly EL1+, with some user-accessible operations like DC CVAU (clean cache to point of unification).
HVC, SMC: hypervisor and secure-monitor calls; trap to higher EL.
ERET: exception return; EL1+ only.

Attempting a privileged operation from EL0 traps to EL1 (typically delivering SIGILL).

11.The Weak Memory Model in Practice

The AArch64 memory model is fundamentally weaker than x86's TSO, and this difference is the single most common source of subtle bugs when porting concurrent code from x86 to ARM. A complete formal treatment belongs in Chapter 31 (Cache Coherence and Consistency); this section gives the practical view from the programmer's seat.

Under AArch64's weakly-ordered memory model, the hardware is free to reorder essentially any pair of memory accesses to different locations — load-load, load-store, store-load, and store-store — from the perspective of other observers, except where the program explicitly forbids reordering. Code that worked correctly on x86 because of the TSO guarantee that older stores precede younger loads to different addresses can fail on ARM without warning.

Four mechanisms restrict reordering:

Address dependencies. A load whose result feeds into the address of a subsequent load creates a data dependency the hardware must respect. Code that traverses linked lists or reads pointer-flag-data tuples relies on this implicitly.
Acquire/release accesses. LDAR/LDAPR (load-acquire) prevent any later access from being reordered before them; STLR (store-release) prevents any earlier access from being reordered after it. These are the same one-way fences C11/C++11 expose as memory_order_acquire and memory_order_release. They are cheap on AArch64 — essentially free in straight-line code on modern Apple and Cortex cores — and are the right tool for almost all lock and message-passing code.
Atomic read-modify-write. ARMv8.1 LSE instructions (LDADD, CAS, SWP, STADD, ...) come with optional acquire (-A), release (-L), and acquire-release (-AL) suffixes; the suffix-free form is relaxed (no ordering). The acquire-release form (e.g. CASAL) is the AArch64 equivalent of x86's LOCK CMPXCHG.
Explicit barriers. DMB ISH (data memory barrier, inner shareable) is the full bidirectional fence between memory accesses; DSB waits for completion of prior accesses; ISB is an instruction synchronization barrier used after self-modifying code or after changing system state. DMB ISHLD and DMB ISHST are weaker partial barriers (load- or store-only). Barriers are more expensive than acquire/release; correct concurrent code prefers acquire/release where possible.

The pre-LSE atomics built from LDXR/STXR (load-exclusive / store-exclusive) implement compare-and-swap as a software loop: load the value with reservation, compare, conditionally store, retry on failure. This style remains valid and is what older ARMv8.0-A targets use, but LSE is faster and simpler on contended atomics. Linux distributions targeting modern AArch64 increasingly mandate LSE through the arm64-v8.1-a baseline; user-mode runtimes patch atomic primitives at startup based on AT_HWCAP to use whichever path the host supports.

The practical advice for AArch64 concurrent programming: use <stdatomic.h> or std::atomic with explicit memory orders, prefer memory_order_acquire/memory_order_release over memory_order_seq_cst (which compiles to a barrier-bracketed access on AArch64), and treat any code that worked on x86 without atomics as suspect when porting. The cost of getting this wrong is rare, hard-to-reproduce data races that manifest only on specific cores and only at certain speculation depths.

12.Practical Tools

objdump -d -m aarch64 binary — disassemble.
gcc -S -O2 --target=aarch64-linux-gnu — compile to assembly.
Compiler Explorer (godbolt.org) — supports AArch64 across many compilers.
perf annotate on ARM Linux — instruction-level profiling.
ARM Architecture Reference Manual — the canonical reference (massive PDF; thousands of pages).
Felix Cloutier-style references — third-party AArch64 references exist; ARM's official docs are comprehensive but heavy.

13.Summary

AArch64 is a clean RISC ISA: 31 general-purpose 64-bit registers plus xzr and sp, fixed 32-bit instruction width, large register file with structured encoding, weak memory model, explicit flag-setting (S-suffix), and a uniform set of addressing modes including pre/post-indexed forms and shifted/extended register operands. Common patterns — function prologues with STP, branchless selects with CSEL/CSET, compare-and-branch with CBZ/CBNZ/TBZ, atomic LSE operations, acquire/release with LDAR/STLR — are central to typical compiler output.

Compared with x86-64, AArch64's encoding is far simpler, the register file is twice as large, conditional execution is explicit and limited, the memory model is weaker (requiring more programmer awareness), and SIMD is its own first-class subsystem (NEON and SVE) rather than a layered set of extensions on a base. The next chapter steps up to the system level: exception levels, MMU, interrupts (GIC), system registers, boot.

Book mode

	add x0, x1, x2 ; x0 = x1 + x2; flags unchanged
	adds x0, x1, x2 ; x0 = x1 + x2; flags updated

	mov x0, x1 ; x0 = x1
	mov x0, #42 ; x0 = 42 (small immediate)
	mov x0, #0x1234 ; x0 = 0x1234 (16-bit immediate)

	movz x0, #0x5678, lsl #0 ; x0 = 0x5678
	movk x0, #0x1234, lsl #16 ; x0 = 0x12345678 (keep low 16, set bits 16-31)
	movk x0, #0xabcd, lsl #32 ; x0 = 0xabcd_12345678 (set bits 32-47)

	ldr x0, [x1] ; x0 = [x1] (64-bit load)
	str x0, [x1] ; [x1] = x0 (64-bit store)
	ldr w0, [x1] ; 32-bit load with zero-extension to 64 bits
	ldrb w0, [x1] ; 8-bit load with zero-extension
	ldrsh x0, [x1] ; 16-bit load with sign-extension to 64 bits

	stp x29, x30, [sp, #-16]! ; push fp, lr; pre-decrement sp
	ldp x29, x30, [sp], #16 ; pop fp, lr; post-increment sp

	ldr x0, [x1] ; register indirect
	ldr x0, [x1, #8] ; base + immediate offset
	ldr x0, [x1, x2] ; base + register
	ldr x0, [x1, x2, lsl #3] ; base + scaled register (here scale = 8 = 2^3)
	ldr x0, [x1, x2, sxtw #2] ; base + sign-extended w-reg, scaled
	ldr x0, [x1, #8]! ; pre-indexed: x1 += 8, then load from new x1
	ldr x0, [x1], #8 ; post-indexed: load from x1, then x1 += 8
	ldr x0, [pc, #offset] ; PC-relative (assembled from a label)

	loop:
	ldr x0, [x1], #8 ; load from x1, advance x1 by 8
	cbnz x0, loop ; loop if non-zero

	adrp x0, mygvar ; x0 = page address of mygvar
	ldr x1, [x0, :lo12:mygvar] ; x1 = mygvar (low 12 bits within page)

	add x0, x1, x2 ; x0 = x1 + x2
	add x0, x1, #100 ; x0 = x1 + 100
	sub x0, x1, x2 ; x0 = x1 - x2
	neg x0, x1 ; x0 = -x1 (alias for sub x0, xzr, x1)
	adds x0, x1, x2 ; flag-setting add
	subs x0, x1, x2 ; flag-setting sub (used by cmp)

	add x0, x1, x2, lsl #4 ; x0 = x1 + (x2 << 4)
	add x0, x1, x2, lsr #4 ; x0 = x1 + (x2 >> 4) [logical]
	add x0, x1, x2, asr #4 ; x0 = x1 + (x2 >> 4) [arithmetic]

	mul x0, x1, x2 ; x0 = x1 * x2 (low 64 bits)
	umulh x0, x1, x2 ; x0 = high 64 bits of unsigned x1*x2
	smulh x0, x1, x2 ; x0 = high 64 bits of signed x1*x2
	madd x0, x1, x2, x3 ; x0 = x3 + x1*x2 (multiply-add)
	msub x0, x1, x2, x3 ; x0 = x3 - x1*x2

	udiv x0, x1, x2 ; x0 = x1 / x2 (unsigned)
	sdiv x0, x1, x2 ; x0 = x1 / x2 (signed)

	sdiv x3, x1, x2 ; x3 = x1 / x2
	msub x4, x3, x2, x1 ; x4 = x1 - x3*x2 = x1 mod x2

	and x0, x1, x2 ; bitwise AND
	orr x0, x1, x2 ; bitwise OR (note: ORR not OR)
	eor x0, x1, x2 ; bitwise XOR (note: EOR not XOR)
	mvn x0, x1 ; bitwise NOT (alias for orn x0, xzr, x1)
	ands x0, x1, x2 ; flag-setting AND
	tst x1, x2 ; AND but discard result, set flags (alias for ands xzr, x1, x2)

	lsl x0, x1, #4 ; logical shift left
	lsr x0, x1, #4 ; logical shift right
	asr x0, x1, #4 ; arithmetic shift right
	ror x0, x1, #4 ; rotate right

	ubfx x0, x1, #4, #8 ; unsigned bitfield extract (extract 8 bits starting at bit 4)
	sbfx x0, x1, #4, #8 ; signed bitfield extract
	bfi x0, x1, #4, #8 ; bitfield insert
	ubfm, sbfm, bfm ; underlying generic forms

	cmp x0, x1 ; flags = x0 - x1, discard result (alias for subs xzr, x0, x1)
	cmp x0, #100 ; flags = x0 - 100
	cmn x0, x1 ; flags = x0 + x1 ("compare negative")
	tst x0, x1 ; flags = x0 & x1

	b label ; branch (jump) to label
	bl label ; branch and link (function call): x30 = pc+4, then jump
	br x0 ; branch register (indirect jump)
	blr x0 ; branch with link to register (indirect call)
	ret ; return: jump to x30 (link register); alias for br x30

	b.eq label ; branch if equal (Z=1)
	b.ne label ; branch if not equal (Z=0)
	b.lt label ; branch if less than, signed (N!=V)
	b.le label ; branch if less or equal, signed
	b.gt label ; branch if greater than, signed (Z=0 and N=V)
	b.ge label ; branch if greater or equal, signed (N=V)
	b.lo / b.cc ; lower (unsigned less than) (C=0)
	b.hi ; higher (unsigned greater than)
	b.ls ; lower or same (unsigned)
	b.hs / b.cs ; higher or same (unsigned)
	b.mi / b.pl ; minus / plus (sign)
	b.vs / b.vc ; overflow set / clear

	cbz x0, label ; if x0 == 0, branch
	cbnz x0, label ; if x0 != 0, branch
	tbz x0, #5, label ; if bit 5 of x0 is 0, branch (test bit and branch)
	tbnz x0, #5, label ; if bit 5 of x0 is 1, branch

	csel x0, x1, x2, eq ; x0 = x1 if eq else x2
	csinc x0, x1, x2, eq ; x0 = x1 if eq else x2+1
	csinv x0, x1, x2, eq ; x0 = x1 if eq else ~x2
	csneg x0, x1, x2, eq ; x0 = x1 if eq else -x2

	cset x0, eq ; x0 = 1 if eq else 0 (alias of csinc x0, xzr, xzr, ne)
	csetm x0, eq ; x0 = -1 if eq else 0

	loop:
	ldxr x0, [x1] ; load-exclusive
	add x0, x0, #1 ; modify
	stxr w2, x0, [x1] ; store-exclusive; w2 = 0 on success, 1 on failure
	cbnz w2, loop ; retry if failed

	mov w0, #1
	ldadd w0, w1, [x2] ; atomically: w1 = [x2]; [x2] = old + w0
	ldset, ldclr, ldeor ; atomic OR, AND-NOT, XOR
	swp ; atomic exchange
	cas ; compare-and-swap
	casa, casl, casal ; CAS with acquire/release/both

	ldar x0, [x1] ; load-acquire (subsequent ops don't move before)
	stlr x0, [x1] ; store-release (preceding ops don't move after)
	ldaxr, stlxr ; LR/SC variants with acquire/release

	function:
	stp x29, x30, [sp, #-16]! ; push fp, lr
	mov x29, sp ; new fp
	sub sp, sp, #N ; allocate locals (N must be 16-byte multiple)
	; ... body ...
	add sp, sp, #N ; deallocate locals
	ldp x29, x30, [sp], #16 ; pop fp, lr
	ret

	mov x0, xzr ; x0 = 0
	mov w0, wzr ; w0 = 0 (zeros all 64 bits of x0)
	str xzr, [x1] ; store zero

	tst x0, x0 ; sets flags (alias for ands xzr, x0, x0)
	b.eq .Lzero

	cmp x0, #0
	cneg x0, x0, lt ; if lt (negative), negate; else keep

	sxtw x0, w1 ; sign-extend 32-bit w1 to 64-bit x0
	sxth x0, w1 ; sign-extend 16-bit
	sxtb x0, w1 ; sign-extend 8-bit

	.Lloop:
	ldr w0, [x1], #4 ; load and post-increment
	add w2, w2, w0 ; accumulate
	subs x3, x3, #1 ; decrement counter
	b.ne .Lloop ; loop if not zero