Part V·ISA Case Studies·Chapter 40 of 62

Part VISA Case Studies

AArch64 SIMD and Vector

May 16, 2026·20 min read·advanced

This chapter covers ARM's vector and SIMD facilities: **NEON** (the fixed-128-bit SIMD that has been standard in AArch64 since the beginning), **SVE** and **SVE2** (Scalable Vector Extension, ARM's…

This chapter covers ARM's vector and SIMD facilities: NEON (the fixed-128-bit SIMD that has been standard in AArch64 since the beginning), SVE and SVE2 (Scalable Vector Extension, ARM's variable-length vector ISA introduced in ARMv8.2 and extended in ARMv9), and SME (Scalable Matrix Extension, the most recent addition for matrix-style workloads). The treatment parallels Chapter 35's coverage of x86-64's FP and SIMD, with comparisons throughout.

ARM's approach to vector computing has evolved differently from x86's. Where Intel and AMD have layered fixed-width extensions (SSE 128, AVX 256, AVX-512 512), ARM moved to a variable-length design with SVE: the same instruction works on whatever vector width the implementation provides. Code compiled once for SVE runs unchanged on cores with 128-bit, 256-bit, 512-bit, or longer vectors, automatically benefiting from the wider hardware. This is a structurally different solution to the SIMD problem, with its own trade-offs.

01.NEON: ARM's Original SIMD

NEON is the SIMD extension introduced with ARMv7-A (around 2005) and made mandatory in AArch64. Every AArch64 processor implements NEON; it is the baseline SIMD for AArch64 software.

Registers

NEON shares its register file with scalar floating-point: 32 128-bit registers, V0 through V31. The same register can be viewed as:

vN.16b — 16 packed bytes.
vN.8h — 8 packed halfwords (16-bit).
vN.4s — 4 packed singles (32-bit floats or ints).
vN.2d — 2 packed doubles (64-bit floats or ints).
vN.16b, vN.8b — full 128-bit or low 64-bit views.
bN, hN, sN, dN, qN — scalar views of various widths.

The scalar variants (sN, dN) are how AArch64 expresses scalar floating-point: there is no separate FP register file. fadd s0, s1, s2 adds two single-precision FP values; fadd v0.4s, v1.4s, v2.4s adds four pairs of singles in parallel. The same V0 register holds both interpretations.

Doubling the register count compared to x86's xmm0-xmm15 (in non-AVX) is a significant ISA-level advantage: more registers reduce spill traffic, more vectorized loops fit entirely in registers.

NEON Instructions

NEON has hundreds of instructions, generally following the form op_size or vop_size:

Assembly

; integer arithmetic
add  v0.4s, v1.4s, v2.4s    ; add 4 packed 32-bit ints
sub  v0.16b, v1.16b, v2.16b ; subtract 16 packed bytes
mul  v0.4s, v1.4s, v2.4s    ; multiply 4 packed 32-bit ints

; floating-point arithmetic
fadd v0.4s, v1.4s, v2.4s    ; 4 single-precision adds
fmul v0.2d, v1.2d, v2.2d    ; 2 double-precision multiplies
fmla v0.4s, v1.4s, v2.4s    ; 4 fused multiply-accumulate (v0 += v1*v2)

; logical
and  v0.16b, v1.16b, v2.16b
orr  v0.16b, v1.16b, v2.16b
eor  v0.16b, v1.16b, v2.16b

; loads / stores
ldr  q0, [x0]                  ; load 128 bits as q0 (full vector)
ldp  q0, q1, [x0]              ; load pair of vectors
ld1  {v0.16b}, [x0], #16       ; NEON-specific load with post-increment
ld2  {v0.4s, v1.4s}, [x0]      ; deinterleaving load: separates AoS into SoA
st4  {v0.4s, v1.4s, v2.4s, v3.4s}, [x0]  ; interleaving store

The deinterleaving / interleaving loads and stores (LD1-LD4, ST1-ST4) are a NEON-specific feature: load N consecutive vectors, with elements distributed across N registers (or vice versa for stores). Useful for processing arrays of structs:

Assembly

; Suppose memory contains [R0,G0,B0,R1,G1,B1,...] (interleaved RGB)
ld3 {v0.16b, v1.16b, v2.16b}, [x0]
; v0 = R0..R15, v1 = G0..G15, v2 = B0..B15

This single instruction does what would take many shuffles in SSE/AVX. It is heavily used in audio, image, and graphics code.

Permute and Shuffle

NEON has flexible element permutations:

Assembly

zip1 v0.4s, v1.4s, v2.4s    ; interleave low halves
zip2 v0.4s, v1.4s, v2.4s    ; interleave high halves
uzp1 v0.4s, v1.4s, v2.4s    ; deinterleave odd elements
uzp2 v0.4s, v1.4s, v2.4s    ; deinterleave even elements
trn1 v0.4s, v1.4s, v2.4s    ; transpose
trn2 v0.4s, v1.4s, v2.4s
ext  v0.16b, v1.16b, v2.16b, #4   ; extract bytes from concatenated source
tbl  v0.16b, {v1.16b}, v2.16b      ; arbitrary byte shuffle (table lookup)

TBL (table lookup) is particularly powerful: each byte of the destination is set to one of the bytes from the (1 to 4) source vectors, with the index from a fourth vector. This makes NEON capable of arbitrary byte-level permutations in one instruction (or at most a few). x86's PSHUFB is similar but limited to single-vector lookups.

Crypto and Special Operations

NEON has crypto extensions when the implementation supports them:

AESE / AESD / AESMC / AESIMC: AES round operations.
SHA1H / SHA1C / SHA1P / SHA1M: SHA-1 round operations.
SHA256H / SHA256H2 / SHA256SU0 / SHA256SU1: SHA-256 operations.
PMULL / PMULL2: polynomial (carryless) multiply.
SHA512 and others added in later extensions.

A modern Apple M-series core can encrypt AES-256 at over 10 GB/s/core using these instructions. Software AES is around 200 MB/s/core, so the speedup is dramatic.

FP Special Cases

NEON's scalar FP and vector FP are unified. Instructions like FSQRT, FRSQRTE (reciprocal square root estimate), FRECPE (reciprocal estimate), FCMP, FMIN/FMAX, etc., work on both scalar and vector. The scalar form is what clang -O2 emits for double arithmetic; the vector form is what auto-vectorization or intrinsics produce.

NEON's IEEE 754 compliance is configurable: a flag (in FPCR) selects between strict IEEE behavior (with denormals, traps, etc.) and a "default NaN" / FTZ mode for performance.

02.SVE: Scalable Vector Extension

SVE (announced 2016, Cortex-A720 era and Neoverse V1 onwards) is a new vector ISA. Its defining characteristic: the vector width is implementation-defined, anywhere from 128 bits to 2048 bits in 128-bit increments. Code compiled for SVE runs on any SVE implementation, dynamically scaling to whatever width is present.

The Idea

A traditional SIMD ISA bakes vector width into instructions: SSE has 128-bit instructions, AVX-256 has 256-bit, AVX-512 has 512-bit. Every time you want a wider machine, you need new instructions, new registers, new encodings. Software has to be ported.

SVE takes a different approach: every SVE instruction operates on a vector of whatever-the-hardware-has bits. There is no "vaddps_256" and "vaddps_512"; there is just "ADD" (vector add), and the hardware processes as many lanes as fit in the implementation's vector length.

The benefit: write the loop once, get speedups automatically when running on wider hardware. A program compiled for SVE in 2018 (when only 256-bit hardware existed) runs unchanged on 512-bit Neoverse V1 in 2022 and gets the wider speedup.

The cost: code is somewhat harder to write (must be loop-control aware and use predication for boundary conditions) and somewhat harder to optimize (compiler can't make assumptions about specific lane counts).

SVE Registers

SVE adds:

Z0-Z31: 32 scalable vector registers. Each Z register's lower 128 bits aliases the corresponding NEON V register (so SVE and NEON share register state). Above 128 bits, the Z register has additional lanes that NEON cannot see.
P0-P15: 16 predicate registers. Each predicate register has 1 bit per byte of vector (so a 256-bit vector has 32 predicate bits per register; a 512-bit vector has 64). Predicates are how SVE handles per-lane masking.
FFR: First-Fault Register, for fault-tolerant loads.

The vector length (VL) is queryable: RDVL X0, #1 puts VL in bytes into X0. This lets the program adjust loop strides at runtime.

SVE Programming Model

A typical SVE loop:

Assembly

mov  x0, #0                    ; i = 0
mov  x1, #N                    ; count
whilelt p0.s, x0, x1           ; p0 = predicate: true for lanes where i+lane < N
.Lloop:
    ld1w  z0.s, p0/z, [x2, x0, lsl #2]    ; load array[i:i+VL]
    fadd  z0.s, z0.s, z0.s                ; double each element
    st1w  z0.s, p0, [x2, x0, lsl #2]      ; store back
    incw  x0                              ; i += VL/4 (number of single-precision lanes)
    whilelt p0.s, x0, x1                  ; update predicate
    b.first .Lloop                        ; loop while any lane active

Key elements:

WHILELT generates a predicate that is true for lanes where the index is still less than the bound. Handles loop-bound issues automatically: when only 3 elements remain in a 16-lane vector, only 3 lanes are active.
LD1W/ST1W are predicated loads/stores: only active lanes load/store; inactive lanes are zeroed (in zero form, indicated by /z) or preserved (in merging form, /m).
INCW increments by the number of words per vector — automatically scales with VL.
B.FIRST branches if any lane is still active, terminating the loop when all lanes are done.

This loop is length-agnostic: it works for any VL. If VL is 128 bits (4 floats), each iteration does 4. If VL is 512 bits, each iteration does 16. No special-cased remainder loop needed; the predicate handles the tail.

Predication Throughout

Every SVE arithmetic instruction can be predicated:

Assembly

fadd z0.s, p0/m, z0.s, z1.s    ; z0 += z1, only where p0 is true (merging)
fadd z0.s, p0/z, z0.s, z1.s    ; z0 = (z0+z1) where p0; 0 elsewhere

/m = merging (preserve inactive lanes), /z = zeroing (set inactive lanes to 0). This is a generalization of AVX-512's mask registers, where every operation is mask-aware from the start.

Predicated arithmetic enables vectorizing loops with conditionals:

for (int i = 0; i < N; i++) {
    if (a[i] > 0)
        b[i] = c[i] / a[i];
}

Vectorized in SVE:

Assembly

.Lloop:
    ld1w   z0.s, p0/z, [x_a, x_i, lsl #2]      ; load a (predicated by loop control)
    fcmgt  p1.s, p0/z, z0.s, #0.0               ; p1 = a > 0 (only where p0)
    ld1w   z1.s, p1/z, [x_c, x_i, lsl #2]      ; load c only where needed
    fdiv   z0.s, p1/m, z0.s, z1.s               ; oops; correct: divide c by a where p1
    st1w   z0.s, p1, [x_b, x_i, lsl #2]         ; store only where p1
    incw   x_i
    whilelt p0.s, x_i, x_n
    b.first .Lloop

(Pseudocode with errors fixed in real code.) The predicate p1 represents "lanes where a > 0", and subsequent loads, divisions, and stores all respect it. A scalar fallback is unnecessary; the predicate naturally handles the divergent control flow.

Gather and Scatter

SVE has full gather/scatter support:

Assembly

ld1w  z0.s, p0/z, [x0, z1.s, sxtw]   ; gather: load from x0 + sign-extended z1[lane]
st1w  z0.s, p0,    [x0, z1.s, sxtw]   ; scatter

These let SVE handle indirect access patterns (sparse linear algebra, hash table probes, lookups) more naturally than fixed-width SIMD. AVX-512 has gather/scatter too, but SVE's predicated forms are particularly clean.

First-Fault Loads

Vectorizing loops with potentially out-of-bounds accesses is awkward. SVE introduces first-faulting loads:

Assembly

ldff1w z0.s, p0/z, [x0, x1, lsl #2]

If lane 0's load faults, the instruction faults normally. If a later lane's load faults, the FFR (First-Fault Register) records which lanes were OK; the instruction completes, providing valid data for those lanes only.

This lets a loop probe for end-of-array or end-of-string by reading until a fault, then rebounding without ever actually faulting. Useful for strlen-like operations, parsing, and similar code where the end isn't known a priori.

Other SVE Features

Horizontal reductions: FADDA (FP add across all lanes), UADDV (unsigned add reduction), MIN/MAX reductions.
Inter-lane operations: EXT (extract from concatenation), TBL (table lookup), SPLICE (combine using predicate).
Match/compare: MATCH (find equal elements), BRKB/BRKA (break propagation for predicate manipulation).
Histograms: HISTCNT (count occurrences across lanes).

The instruction set is rich. Compared with NEON (which has many distinct instructions for many element sizes), SVE has fewer but more polymorphic instructions, with the predicate and the lane size carrying the variation.

03.SVE2: Extending SVE

SVE2 (ARMv9, 2021) extended SVE with additional instructions targeting common use cases that SVE1 didn't cover well:

Integer arithmetic that NEON supported but SVE1 didn't: saturating arithmetic (with rounding modes), absolute differences, etc.
Polynomial arithmetic (PMULLB/PMULLT) for crypto.
Crypto (AES, SHA): vectorized across SVE registers.
Bit manipulation: bit insert, bit deposit/extract, bit reverse.
Histograms and table lookups generalized.

The goal of SVE2: be a complete superset of NEON's functionality, so programs compiled for SVE2 don't need to fall back to NEON for any operation. Some workloads (DSP, multimedia, crypto) had been NEON-only because SVE1 lacked the right primitives; SVE2 brings them in.

Adoption

SVE/SVE2 adoption has been gradual. Implementations with SVE/SVE2:

Cortex-A510, A715, A720, X3, X4, X925: SVE2 (mobile and laptop Cortex cores).
Neoverse V1: SVE (256-bit, used in AWS Graviton 3, Microsoft Cobalt 100).
Neoverse N2, V2: SVE2 (used in Ampere One, Graviton 4, NVIDIA Grace, Microsoft Cobalt 200).
Neoverse V3: SVE2 with longer VL.

Apple's M-series chips, as of M4, do not implement SVE/SVE2; Apple has its own ARM-compatible custom extension to NEON (AMX in older chips, now incorporated into more standard forms). This is a notable holdout: the largest single vendor of high-performance AArch64 silicon (in personal computers) does not yet ship SVE.

The implication for portable code: SVE/SVE2 is good for servers and (recent) Cortex-based mobile. NEON remains the lowest-common-denominator AArch64 SIMD. Multi-versioned binaries dispatch on getauxval(AT_HWCAP) bits.

04.VL Choices in Practice

Implementation VL choices to date:

Fugaku supercomputer (Fujitsu A64FX): 512-bit SVE, the original SVE deployment.
AWS Graviton 3 (Neoverse V1): 256-bit SVE (each core has two 128-bit FP units that pair to act as 256-bit SVE).
Graviton 4 (Neoverse V2): 128-bit SVE2 per core.
Microsoft Cobalt (V2): 128-bit SVE2.
NVIDIA Grace (V2): 128-bit SVE2.
Ampere One: 128-bit SVE2.
Cortex-X4: 128-bit SVE2.

Notice the convergence to 128-bit SVE2 in modern designs. The wider VLs (256, 512) have had limited deployment; 128-bit SVE2 with more cores often wins on perf/area/power.

This is a practical issue: SVE's promise of automatic scaling to wider VLs hasn't fully materialized in mainstream silicon. Software written for SVE2 with a 128-bit VL works perfectly on hardware with a wider VL, but the wider VL hasn't been the priority for chip designers (apart from Fugaku and a few research designs).

05.Comparing SVE to AVX-512

It's instructive to compare SVE/SVE2 with x86-64 AVX-512.

Aspect	AVX-512	SVE / SVE2
Vector width	Fixed 512 bits	Implementation-defined 128-2048
Mask registers	8 (k0-k7)	16 (P0-P15)
Mask granularity	Per-lane	Per-byte
FP precision	Single, double; AVX-512 BF16, FP16	Single, double, half, BF16
Gather/scatter	Yes	Yes
First-fault loads	No	Yes
Approach to portability	Add new wider extension	Same code scales

SVE's portability advantage is real but theoretical until wider VLs ship in volume. AVX-512's advantage of being available now in many Intel server chips (and Zen 4/5) is concrete. Software realities tend to favor concrete advantages over theoretical ones, which is why SVE adoption has been slower than originally anticipated.

06.SME: Scalable Matrix Extension

SME (ARMv9-A, 2022) adds matrix-style operations to AArch64. It extends SVE with:

Streaming SVE mode: SVE operations execute in a special mode with potentially different VL (often wider than normal mode) optimized for matrix kernels.
ZA storage: a 2D tile of state, configurable as 4 to several tiles of submatrices.
Outer-product instructions: e.g., FMOPA computes an outer product of two vectors, accumulating into a tile.

The model: for matrix multiplication, you load a row from A and a column from B, compute their outer product, and accumulate into the output tile. SME provides fast outer-product instructions and large accumulator storage to make this efficient.

SME is the AArch64 counterpart to Intel AMX. Like AMX, it targets ML workloads where matrix-multiply dominates compute time. As of 2026, SME is in Apple's M4 chips (with Apple's variant called AMX privately, and now exposed as SME) and in several recent ARM cores.

The programming model is more complex than NEON or SVE: you switch to streaming mode, configure tiles, run a sequence of outer-product accumulations, and switch back. Compilers for matrix-heavy code (like neural networks) emit SME via libraries (Apple's Accelerate, ARM's KleidiAI, etc.). Direct SME programming is still relatively rare.

07.Apple's AMX (and SME)

Apple's M-series chips have, since the M1 (2020), included an undocumented coprocessor called AMX (Apple Matrix Extension, distinct from Intel's AMX of the same name). It was originally accessed only through Apple's Accelerate framework — direct AMX assembly was not supported.

In M4 (2024), Apple exposed similar functionality through standard ARMv9 SME instructions. The capability is roughly: very high throughput matrix multiplication, with FP32, FP16, BF16, and int8 modes. Used heavily by Apple's CoreML, Apple's video pipelines, and image processing.

Apple's specific SME implementation in M4 has very wide effective vectors (512-bit streaming SVE) and is competitive with dedicated AI accelerators on smaller models.

08.Programming SIMD on AArch64

Three approaches, paralleling x86-64:

Compiler auto-vectorization. GCC and Clang auto-vectorize NEON for many simple loops. SVE auto-vectorization is supported but more subtle (vector-length agnosticism). Auto-vectorization is the easy win for typical numerical loops.

void vadd(float* a, float* b, float* c, int n) {
    for (int i = 0; i < n; i++)
        c[i] = a[i] + b[i];
}

With -O3 -march=armv8-a+simd, this gets vectorized to NEON. With -march=armv9-a+sve2, it gets vectorized to SVE2.

Intrinsics. ARM provides intrinsics in <arm_neon.h> (NEON) and <arm_sve.h> (SVE/SVE2). NEON example:

#include <arm_neon.h>
void vadd(float* a, float* b, float* c, int n) {
    int i;
    for (i = 0; i + 4 <= n; i += 4) {
        float32x4_t va = vld1q_f32(a + i);
        float32x4_t vb = vld1q_f32(b + i);
        float32x4_t vc = vaddq_f32(va, vb);
        vst1q_f32(c + i, vc);
    }
    for (; i < n; i++) c[i] = a[i] + b[i];
}

SVE example:

#include <arm_sve.h>
void vadd(float* a, float* b, float* c, int n) {
    svbool_t pg;
    for (int i = 0; (pg = svwhilelt_b32(i, n)), svptest_first(svptrue_b32(), pg); i += svcntw()) {
        svfloat32_t va = svld1_f32(pg, a + i);
        svfloat32_t vb = svld1_f32(pg, b + i);
        svfloat32_t vc = svadd_f32_z(pg, va, vb);
        svst1_f32(pg, c + i, vc);
    }
}

The SVE version is length-agnostic — same code works for any VL.

Inline assembly. Used for the most performance-critical kernels, but rarely beyond what intrinsics can express.

09.Worked Example: NEON Matrix Multiply Kernel

Like the AVX-512 example in Chapter 35, here's a NEON micro-kernel for matrix multiply (single precision, accumulating into 4×4 tile):

void matmul_4x4_neon(float* A, float* B, float* C, int K, int lda, int ldb, int ldc) {
    float32x4_t c0 = vld1q_f32(C + 0*ldc);
    float32x4_t c1 = vld1q_f32(C + 1*ldc);
    float32x4_t c2 = vld1q_f32(C + 2*ldc);
    float32x4_t c3 = vld1q_f32(C + 3*ldc);

    for (int k = 0; k < K; k++) {
        float32x4_t b = vld1q_f32(B + k*ldb);
        c0 = vfmaq_laneq_f32(c0, b, vld1q_f32(A + 0*lda + k - (k & 3)), k & 3);
        // (Real code uses cleaner indexing; the laneq form broadcasts a single lane)
    }

    vst1q_f32(C + 0*ldc, c0);
    vst1q_f32(C + 1*ldc, c1);
    vst1q_f32(C + 2*ldc, c2);
    vst1q_f32(C + 3*ldc, c3);
}

(Sketchy pseudocode; production BLAS kernels are more polished.) Key features: the C tile stays in registers, multiplied-and-accumulated against broadcast lanes from A and a vector load from B. With vfmaq_laneq_f32 (FMA with broadcast from a specific lane), 4 FMAs run per cycle (typical NEON FP throughput on a wide AArch64 core). This is the basis of OpenBLAS, Eigen, and similar libraries' AArch64 paths.

10.Throughput

For dense FP throughput on a high-end AArch64 core (Apple M4 P-core, Cortex-X4, Neoverse V2), typical numbers:

2 to 4 NEON FMA pipes, each 128-bit (4 SP lanes or 2 DP lanes).
1 to 2 SVE FMA pipes (where applicable), each 128 to 256 bits.
Typical peak: ~16-32 SP FMAs/cycle = 32-64 SP FLOPs/cycle; ~16-32 DP FLOPs/cycle.

Apple's M4 P-core runs around 4 GHz and has very wide vector resources, putting it in the same class as a Zen 4 or Lion Cove core for FP throughput. The big difference vs. AVX-512 is per-clock peak: a single AVX-512 FMA produces 16 SP FLOPs (8 lanes × 2 ops); a single NEON FMA produces 8 SP FLOPs. ARM cores compensate by having more FMA pipes and (usually) higher clock efficiency.

11.Crypto, Dot-Product, and Domain-Specific NEON Extensions

Beyond plain integer and FP arithmetic, NEON has accumulated a series of small extensions targeting specific application domains. They are individually narrow but collectively important for explaining why AArch64 is competitive on workloads that are sometimes assumed to favour x86's wider AVX vectors.

The Cryptography Extension (mandatory in ARMv8.2-A and later for the A-profile) provides hardware AES, SHA-1, SHA-256, and SHA-3 round operations, plus polynomial multiplication for GHASH and CRC. The pattern mirrors x86's AES-NI and SHA-NI:

Assembly

    aese   v0.16b, v1.16b      ; one AES encryption round on v0 with key v1
    aesmc  v0.16b, v0.16b      ; mix columns (combined with previous round on most cores)

A modern AArch64 core fuses AESE + AESMC (and the corresponding decrypt pair) into a single throughput-1 micro-op, so AES throughput often matches or exceeds AES-NI on a per-clock basis. AES-GCM on Apple M-series and Neoverse V cores reaches 5–10 GB/s per core, comparable to or better than contemporaneous x86 implementations. The SHA-3 extension is the only major mainstream ISA implementation of Keccak in hardware, used by Linux's crypto API and several blockchain workloads.

The Dot Product Extension (UDOT/SDOT, ARMv8.2 optional, ARMv8.4 mandatory) computes signed and unsigned 8-bit-to-32-bit dot products with one instruction:

Assembly

    udot v0.4s, v1.16b, v2.16b   ; v0[i] += sum_{j=0..3} v1[4i+j] * v2[4i+j], unsigned 8-bit

Four multiply-accumulate operations per output lane, sixteen per 128-bit vector, with the accumulator at higher precision to avoid overflow. The instruction was introduced specifically for INT8 quantized neural-network inference, the dominant arithmetic shape for on-device ML. Apple, Qualcomm, and ARM cores all implement it at full vector throughput. Combined with SDOT and the matrix-multiply accumulate (MMLA, see below), AArch64 NEON delivers competitive INT8 inference performance without needing AVX-512 VNNI.

The Matrix Multiply Accumulate instructions (MMLA family, ARMv8.6 optional) extend the dot-product idea to 2×2 outer-product accumulation: each instruction computes a 2×2 tile of an 8-bit (or BF16) matrix multiply per 128-bit register pair, doubling throughput on dense GEMM relative to UDOT alone. Apple Silicon implements these; ARM's Neoverse V2 and the Cortex-X4 do as well.

The BFloat16 Extension (BF16, ARMv8.6 optional) adds BFloat16 multiply-accumulate, with 16-bit operands and 32-bit accumulation. BF16's range matches FP32 (only the mantissa is reduced), making it a near drop-in replacement for FP32 in neural-network training. The instructions are heavily used in PyTorch, TensorFlow, and ONNX runtimes on AArch64.

The JavaScript FP convert instruction (FJCVTZS, ARMv8.3) deserves a mention as a curiosity and an instructive example. JavaScript's numeric type is double-precision FP, but bitwise operators silently convert to a 32-bit integer using a specific rule (truncate toward zero, modulo 2^32). The conversion is hot in JavaScript runtimes, and AArch64 added a single instruction implementing it exactly. The lesson is that very narrow workload-specific instructions can make architectural sense if the dynamic count is high enough — a recurring theme in modern ISA design.

Beyond these, the Random Number instructions (RNDR, RNDRRS, ARMv8.5) parallel x86's RDRAND/RDSEED, drawing from on-die entropy. The Pointer Authentication instructions (PAC family, Chapter 39) live conceptually between the integer and security domains. Each extension is small; together they round out the AArch64 ISA into something that handles modern workloads competitively across cryptographic, ML, and security-sensitive code paths.

12.Summary

AArch64's SIMD story has three layers. NEON is the universal 128-bit fixed-width SIMD, mandatory in every AArch64 chip; it is the workhorse for multimedia, crypto, and many numerical kernels. SVE/SVE2 is the variable-length vector ISA, designed to scale across implementations; it is deployed in modern server-class and Cortex-X cores but not in Apple's M-series. SME is the matrix extension for ML and matrix-heavy workloads.

Compared to x86-64, AArch64's SIMD is more elegantly organized at the architectural level (predication everywhere in SVE; uniform encoding) but has had a slower rollout into mainstream silicon (especially the wider SVE VLs and SME). Programs that target AArch64 portably typically use NEON; high-performance code uses SVE2 where available, with NEON fallback.

The next chapter brings the AArch64 picture together by looking at how modern AArch64 cores are actually built: Cortex-X4, Neoverse V2, Apple's Avalanche/Everest, and Qualcomm's Oryon. The micro-architectural choices, the cache hierarchies, and the comparison with their x86-64 contemporaries.

Book mode

	; Suppose memory contains [R0,G0,B0,R1,G1,B1,...] (interleaved RGB)
	ld3 {v0.16b, v1.16b, v2.16b}, [x0]
	; v0 = R0..R15, v1 = G0..G15, v2 = B0..B15

	zip1 v0.4s, v1.4s, v2.4s ; interleave low halves
	zip2 v0.4s, v1.4s, v2.4s ; interleave high halves
	uzp1 v0.4s, v1.4s, v2.4s ; deinterleave odd elements
	uzp2 v0.4s, v1.4s, v2.4s ; deinterleave even elements
	trn1 v0.4s, v1.4s, v2.4s ; transpose
	trn2 v0.4s, v1.4s, v2.4s
	ext v0.16b, v1.16b, v2.16b, #4 ; extract bytes from concatenated source

	tbl v0.16b, {v1.16b}, v2.16b ; arbitrary byte shuffle (table lookup)

	mov x0, #0 ; i = 0
	mov x1, #N ; count
	whilelt p0.s, x0, x1 ; p0 = predicate: true for lanes where i+lane < N
	.Lloop:
	ld1w z0.s, p0/z, [x2, x0, lsl #2] ; load array[i:i+VL]
	fadd z0.s, z0.s, z0.s ; double each element
	st1w z0.s, p0, [x2, x0, lsl #2] ; store back
	incw x0 ; i += VL/4 (number of single-precision lanes)
	whilelt p0.s, x0, x1 ; update predicate
	b.first .Lloop ; loop while any lane active

	fadd z0.s, p0/m, z0.s, z1.s ; z0 += z1, only where p0 is true (merging)
	fadd z0.s, p0/z, z0.s, z1.s ; z0 = (z0+z1) where p0; 0 elsewhere

	for (int i = 0; i < N; i++) {
	if (a[i] > 0)
	b[i] = c[i] / a[i];
	}

	.Lloop:
	ld1w z0.s, p0/z, [x_a, x_i, lsl #2] ; load a (predicated by loop control)
	fcmgt p1.s, p0/z, z0.s, #0.0 ; p1 = a > 0 (only where p0)
	ld1w z1.s, p1/z, [x_c, x_i, lsl #2] ; load c only where needed
	fdiv z0.s, p1/m, z0.s, z1.s ; oops; correct: divide c by a where p1
	st1w z0.s, p1, [x_b, x_i, lsl #2] ; store only where p1
	incw x_i
	whilelt p0.s, x_i, x_n
	b.first .Lloop

	ld1w z0.s, p0/z, [x0, z1.s, sxtw] ; gather: load from x0 + sign-extended z1[lane]
	st1w z0.s, p0, [x0, z1.s, sxtw] ; scatter

	void vadd(float* a, float* b, float* c, int n) {
	for (int i = 0; i < n; i++)
	c[i] = a[i] + b[i];
	}

	aese v0.16b, v1.16b ; one AES encryption round on v0 with key v1
	aesmc v0.16b, v0.16b ; mix columns (combined with previous round on most cores)