Part VISA Case Studies

AArch64 System Architecture

May 16, 2026·23 min read·advanced

This chapter covers AArch64 from the operating system's and firmware's point of view: the four exception levels, the MMU and translation regime, exceptions and interrupts (the GIC), system registers,…

This chapter covers AArch64 from the operating system's and firmware's point of view: the four exception levels, the MMU and translation regime, exceptions and interrupts (the GIC), system registers, virtualization, security extensions, and the boot process. The treatment parallels Chapter 34's coverage of x86-64 system architecture; the differences highlight the design choices that distinguish ARM.

AArch64's system architecture is, in many ways, more elegantly organized than x86-64's. Where x86-64 carries decades of accreted features (segmentation, SMM, the four-ring system mostly using only two rings, multiple operating modes), AArch64 was designed in 2011 with a clean slate. The result is a more orthogonal model — though no less complex in absolute terms.

01. The Four Exception Levels

AArch64 organizes privilege into four Exception Levels (EL):

ELTypical Use
EL0User-mode applications
EL1OS kernel
EL2Hypervisor
EL3Secure monitor / firmware

The numbering goes from least privileged (EL0) to most privileged (EL3). Higher EL has access to a strict superset of resources of lower EL.

Unlike x86's four rings (where rings 1 and 2 are mostly unused), all four AArch64 levels are widely used:

  • EL0: applications. iOS apps, Android apps, Linux user processes.
  • EL1: kernel. Linux, XNU, Windows on ARM kernel.
  • EL2: hypervisor. KVM, Hyper-V, Xen, Apple's Hypervisor Framework.
  • EL3: secure monitor. Boot firmware, ARM Trusted Firmware, the code that mediates between the secure and non-secure worlds for TrustZone.

The model is genuinely four-level. Smartphones with a secure boot chain often use all four: EL3 firmware runs at boot, sets up the secure world (TrustZone OS, e.g., Trusty or QSEE), then drops to EL2 for a hypervisor (which Apple uses for iOS isolation features), which runs the EL1 kernel hosting EL0 applications.

Privilege Transitions

Transitions between exception levels happen through exceptions (in either direction):

  • Lower-to-higher (synchronous): explicit instructions like SVC (Supervisor Call, EL0→EL1, the system-call instruction), HVC (Hypervisor Call, EL1→EL2), SMC (Secure Monitor Call, EL1/EL2→EL3). Or asynchronous (interrupts, traps).
  • Higher-to-lower: the ERET (Exception Return) instruction, which loads PC and PSTATE from saved exception state.

When an exception is taken, the processor:

  1. Switches to the target exception level.
  2. Saves the current PC into ELR_ELx (Exception Link Register at the target EL).
  3. Saves the current PSTATE into SPSR_ELx (Saved Processor State Register).
  4. Jumps to the corresponding entry in the vector table for the target EL.

ERET reverses this: restores PC from ELR_ELx, PSTATE from SPSR_ELx, drops back to the originating EL.

Each EL has its own banked SP (SP_EL0, SP_EL1, etc.), exception link register, status register, and vector base register. Switching EL doesn't lose the previous EL's state.

Vector Tables

Each EL (except EL0, which never takes exceptions itself) has a Vector Base Address Register (VBAR_ELx) pointing to a vector table. The table has 16 entries, organized as 4 entries each for 4 cases:

  • Current EL with SP_EL0 (sometimes used for kernel testing).
  • Current EL with SP_ELx (normal kernel exception).
  • Lower EL using AArch64 (e.g., user from kernel's view).
  • Lower EL using AArch32 (legacy 32-bit user from 64-bit kernel).

Within each group of 4, the entries are:

  • Synchronous exception (faults, syscalls).
  • IRQ.
  • FIQ (fast interrupt).
  • SError (asynchronous abort, e.g., delayed memory error).

Each entry is 128 bytes — enough for short prologue code that branches to a longer handler. The table itself is at most 16 × 128 = 2 KiB.

This single vectoring mechanism handles both synchronous exceptions (page fault, illegal instruction, syscall) and asynchronous interrupts. Compare to x86's IDT (256 entries indexed by vector number): AArch64's is more structured, with the source EL and source state determining the entry, and the cause encoded in the fault status register.

02. MMU and Translation Tables

AArch64 has a comprehensive MMU with multi-level translation tables. The translation regime depends on the EL.

Multiple Translation Regimes

  • EL1 & EL0 share a translation regime, with two translation table base registers: TTBR0_EL1 (typically user-space) and TTBR1_EL1 (typically kernel-space).
  • EL2 has its own regime with TTBR0_EL2 (and optionally TTBR1_EL2 if extension enabled).
  • EL3 has its own regime.

The split between TTBR0 and TTBR1 is by the high bits of the virtual address: typically TTBR0 covers low addresses (user space), TTBR1 covers high addresses (kernel space). TCR_EL1 configures the split. This eliminates the x86-64 issue where switching processes requires changing CR3 and flushing the TLB; on AArch64, the kernel half stays mapped via TTBR1_EL1 across context switches.

Translation Table Structure

The AArch64 translation table is a multi-level table similar to x86's. The structure is configurable:

  • Granule size: 4 KiB, 16 KiB, or 64 KiB pages. Linux on AArch64 typically uses 4 KiB; iOS uses 16 KiB; some systems use 64 KiB.
  • Number of levels: depends on virtual address size and granule. Typical: 4 levels for 48-bit addresses with 4 KiB granule (analogous to x86's PML4-PT). 5 levels with 52-bit addresses (LPA / LPA2 extensions).
  • Block sizes: at intermediate levels, an entry can map a large block directly instead of pointing to the next level. With 4 KiB granule: 2 MiB blocks at level 2, 1 GiB blocks at level 1.

Each table entry is 8 bytes (64 bits) and encodes:

  • Validity bit.
  • Type (block, table, or page).
  • Output address (the physical address of the next-level table or the page).
  • Memory attributes (memory type via the MAIR register, shareability).
  • Access permissions (AP bits): read/write, EL0 access.
  • Access flag (AF) and dirty bit (DBM).
  • Execute-never bits (UXN for unprivileged, PXN for privileged).
  • Contiguous hint (for groups of consecutive pages, helps TLB).

The encoding is more sophisticated than x86's PTE in some ways: the contiguous hint, the explicit shareability domain, the memory attribute index (MAIR-based) all exist for ARM-specific reasons.

MAIR and Memory Attributes

AArch64 uses an indirect memory-attribute system. The MAIR_ELx (Memory Attribute Indirection Register) is a 64-bit register holding 8 attribute encodings, each 8 bits wide. A page table entry contains a 3-bit attribute index selecting which of the 8 MAIR slots applies.

Typical configurations include:

  • Normal memory, write-back cacheable (RAM).
  • Normal memory, non-cacheable.
  • Device memory, nGnRnE (no Gathering, no Reordering, no Early write acknowledgment) — strict-ordered MMIO.
  • Device memory, nGnRE — somewhat looser.
  • Device memory, GRE — very loose, for write-combining.

Setting up MAIR is one of the OS's first MMU tasks. Once set, page tables just use indices; the indirection lets the OS reconfigure memory types globally without changing every PTE.

TLB and Maintenance

The TLB structures are similar to x86: L1 TLBs (data and instruction, separate), L2 unified TLB, page-walk caches. Numbers vary by core — Apple's cores have particularly large TLBs (~3000+ entries unified L2).

TLB invalidation uses the TLBI instruction with various scopes:

  • TLBI VAE1 — invalidate one VA at EL1.
  • TLBI VMALLE1 — invalidate all entries for current ASID at EL1.
  • TLBI ALLE1 — invalidate all entries at EL1.
  • TLBI VMALLS12E1 — invalidate all stage-1 and stage-2 (for hypervisor).

A DSB is required to ensure the invalidation has completed before subsequent accesses. The pattern TLBI ...; DSB ISH; ISB is common.

ARM's TLB invalidations propagate across cores via the interconnect's Inner Shareable domain — a TLBI ISH variant invalidates on all cores in the domain in hardware, no IPI needed. This is significantly cheaper than x86's IPI-based TLB shootdowns and is a real performance advantage for kernel-intensive workloads.

ASIDs

To avoid full TLB flushes on context switch, AArch64 has Address Space Identifiers (ASIDs): each process gets an ASID (8 or 16 bits), and TLB entries are tagged with it. Switching to a different process changes the active ASID; TLB entries from other ASIDs simply don't match. Equivalent to x86's PCID, but architecturally always present (not optional). The kernel manages a pool of ASIDs and recycles them when needed.

Two-Stage Translation (Virtualization)

For virtualization, AArch64 supports two-stage translation:

  • Stage 1: VA → IPA (Intermediate Physical Address). Done by guest's page tables.
  • Stage 2: IPA → PA. Done by hypervisor's page tables.

The hardware walks both stages on a TLB miss. This is the AArch64 equivalent of EPT/NPT on x86. Each guest has its own VTTBR_EL2 (virtualization translation table base) holding stage-2 tables. The TLB caches the combined translation, tagged with VMID (virtual machine ID).

Multi-stage translation is what lets a hypervisor run multiple VMs concurrently with each VM seeing its own contiguous physical address space, without the hypervisor intervening on every access.

03. Exceptions

AArch64 exceptions fall into four groups:

  • Synchronous: caused by an instruction that just executed. Page faults, illegal instructions, alignment faults, system call (SVC), HVC, SMC, breakpoints.
  • IRQ (Interrupt Request): asynchronous, normal-priority interrupts.
  • FIQ (Fast Interrupt Request): asynchronous, higher-priority interrupts. Historically faster (with banked registers in AArch32). In AArch64, less distinct from IRQ, though the GIC distinguishes them.
  • SError: asynchronous error reporting (memory parity, ECC failures, etc.).

When a synchronous exception is taken at EL1, the ESR_EL1 (Exception Syndrome Register) holds an exception class (EC) field describing the cause and ISS (Instruction Specific Syndrome) bits with details. The kernel reads ESR_EL1 and dispatches.

Common exception classes:

EC valueMeaning
0x00Unknown reason
0x07SVE / FP access trap
0x15SVC (system call)
0x16HVC
0x17SMC
0x18MSR/MRS trapped
0x20Instruction abort (lower EL)
0x21Instruction abort (same EL)
0x24Data abort (lower EL)
0x25Data abort (same EL)
0x26SP alignment fault
0x2CFP exception
0x30SError
0x31Breakpoint (lower EL)
0x32Software step
0x35Watchpoint
0x3CBrk

For a data or instruction abort, additional registers tell where:

  • FAR_EL1 (Fault Address Register): the virtual address that caused the fault.
  • ESR_EL1.ISS: more details (write vs. read, fault status code, level of translation).

The kernel reads these to handle page faults: was it a non-present page (demand-page in), a permission violation (SIGSEGV), a stack overflow (SIGSEGV with a stack-specific report), and so on.

04. System Calls

User mode invokes the kernel via SVC #imm:

Assembly
mov x8, #syscall_number
mov x0, arg0
mov x1, arg1
mov x2, arg2
...
svc #0
; result in x0

The SVC instruction takes an immediate encoded in the instruction; Linux ignores it and uses x8 as the syscall number, with arguments in x0-x5. The transition is:

  1. SVC at EL0.
  2. Hardware switches to EL1, saves return PC in ELR_EL1, status in SPSR_EL1, jumps to the synchronous-exception entry in VBAR_EL1.
  3. Kernel saves user registers, dispatches by syscall number, services the call.
  4. Kernel sets x0 to the return value, restores other registers, executes ERET.
  5. Hardware drops back to EL0, restoring PC and PSTATE.

The SVC mechanism is uniform: same code path for any syscall number, same register convention. There's no equivalent of x86's MSR-driven SYSCALL/SYSRET configuration; the vector table is the only thing the kernel needs to set up.

A typical kernel exception entry:

Assembly
sync_exception_handler:
sub sp, sp, #FRAME_SIZE ; allocate save area
stp x0, x1, [sp, #0]
stp x2, x3, [sp, #16]
; ... save all registers ...
mov x0, sp ; pass save area as arg
bl do_syscall ; C handler
; ... restore registers ...
add sp, sp, #FRAME_SIZE
eret

Linux on AArch64 has heavily optimized this path; it is typically ~50 ns per syscall on a fast core, comparable to x86.

05. Interrupts and the GIC

AArch64 systems use the Generic Interrupt Controller (GIC) for interrupt management. There are several GIC versions:

  • GICv2: older, used in many embedded systems and some mobile chips. Limited to 8 cores. Memory-mapped interface.
  • GICv3: scalable, used in modern mobile and server chips. Supports thousands of cores. System-register interface (CPU interface as MSRs).
  • GICv4: GICv3 plus virtualization improvements (direct injection of virtual interrupts).
  • GICv5: ARM-introduced 2024, more streamlined and modern.

The GIC has two main components:

  • Distributor (GICD): routes interrupts to specific cores, manages priorities, enables/disables.
  • Redistributor / CPU interface (GICR / GICC): per-core component that delivers interrupts to the CPU.

Interrupt types:

  • SGI (Software-Generated Interrupts): IPIs between cores.
  • PPI (Private Peripheral Interrupts): per-core (e.g., per-core timer).
  • SPI (Shared Peripheral Interrupts): standard hardware interrupts shared across cores.
  • LPI (Locality-specific Peripheral Interrupts, GICv3+): MSI-style, used for PCIe MSI.

A core acknowledges an interrupt by reading the ICC_IAR1_EL1 register, gets the interrupt ID, services it, and writes EOI via ICC_EOIR1_EL1. The pattern is similar to x86's local APIC.

GICv3+ uses a system register interface: the CPU interface is exposed via MSRs (MRS/MSR instructions on ICC_*_EL1 registers). This is faster than memory-mapped access (which GICv2 used) and better for virtualization.

The GIC also handles virtual interrupts: a hypervisor can inject interrupts directly into a guest VM by writing the list registers (ICH_LR<n>_EL2). The guest receives a virtual interrupt without the hypervisor needing to intervene on each interrupt return.

06. Generic Timer

AArch64 specifies a Generic Timer at the architecture level: every core has timer registers accessible via MRS/MSR. There are at least four timers per core:

  • Physical timer at EL1: kernel-controlled.
  • Virtual timer at EL1: hypervisor-controlled (offset from physical for VM time).
  • Physical timer at EL2: hypervisor's own timer.
  • Secure physical timer at EL3: firmware's timer.

Each generates an interrupt when it expires (a PPI in GIC terms). The Linux kernel uses the EL1 physical timer for the main scheduling tick.

CNTVCT_EL0 (virtual counter) and CNTPCT_EL0 (physical counter) are user-readable: mrs x0, cntvct_el0 reads the current 64-bit counter. These are the AArch64 equivalent of x86's TSC, used for timing in user space without a syscall. Frequency is reported by CNTFRQ_EL0 (typically a few tens of MHz; not the CPU clock).

07. Performance Monitoring

AArch64 specifies a Performance Monitor Unit (PMU) at the architecture level:

  • A 64-bit cycle counter (PMCCNTR_EL0).
  • Several programmable counters (PMEVCNTR<n>_EL0), typically 4-8 per core.
  • Configuration registers for selecting events and filtering.

User-mode access is governed by a control register; the kernel typically allows it for unprivileged profiling. Standard events include cycles, instructions retired, branch mispredicts, cache references, cache misses, etc.

Linux's perf tool, ARM Streamline, and other profilers use the PMU directly.

08. System Register Access

Most system state is exposed through system registers accessed via:

Assembly
mrs x0, <register> ; move from system reg to general reg
msr <register>, x0 ; move from general reg to system reg

System registers have names like SCTLR_EL1, TTBR0_EL1, MAIR_EL1, VBAR_EL1, TCR_EL1, etc. Hundreds of them, organized by EL and function.

Examples of EL1-accessible registers:

  • SCTLR_EL1: system control (MMU enable, cache enable, alignment check, NX enforcement, etc.).
  • TCR_EL1: translation control (granule size, number of translation levels, ASID size, etc.).
  • TTBR0_EL1, TTBR1_EL1: translation table base addresses.
  • MAIR_EL1: memory attribute indirection.
  • VBAR_EL1: vector base address.
  • CPACR_EL1: coprocessor access (mostly FP/SVE access).
  • CONTEXTIDR_EL1: context ID for debug/trace.
  • TPIDR_EL1: thread pointer (kernel-managed per-CPU pointer).

Some registers are EL0-accessible (in addition to or instead of EL1 access):

  • TPIDR_EL0: user thread pointer (TLS base).
  • TPIDRRO_EL0: read-only thread pointer.
  • CNTVCT_EL0, CNTFRQ_EL0: timer.
  • PMCCNTR_EL0 and other PMU regs (if user access enabled).

The structured naming and access mechanism makes the system register space navigable in a way x86's MSR space (with its undocumented holes and vendor-specific extensions) is not.

09. CPU Identification

AArch64 cores expose ID registers describing what they support:

  • MIDR_EL1: main ID register (manufacturer, part number, revision).
  • REVIDR_EL1: revision details.
  • ID_AA64ISAR0_EL1, ID_AA64ISAR1_EL1, etc.: feature support — atomic operations, crypto, dot product, FMA, etc.
  • ID_AA64MMFR0_EL1, ID_AA64MMFR1_EL1, ID_AA64MMFR2_EL1: memory-management features (granule sizes supported, virt addr size, etc.).
  • ID_AA64PFR0_EL1, ID_AA64PFR1_EL1: processor features (FP, SIMD, SVE, virtualization, etc.).
  • ID_AA64DFR0_EL1: debug features.

Each register has multiple 4-bit fields describing the level of support for specific features. The OS (Linux) reads them at boot to determine what's available, and provides this information to user space via /proc/cpuinfo and HWCAP/HWCAP2 bits in the ELF auxiliary vector.

User-mode programs read HWCAP via getauxval(AT_HWCAP) to dispatch (e.g., choose AES-NI version of a function vs. fallback). Architecture licensees expose their own custom features in vendor-specific registers (e.g., Apple's IMPLEMENTER field in MIDR_EL1 is 0x61 = 'A' = Apple).

10. Virtualization (EL2)

AArch64's virtualization is integrated into the architecture: EL2 is a dedicated hypervisor level, with its own translation regime (stage 2), its own system registers (HCR_EL2 controls trapping, VTTBR_EL2 holds stage-2 tables), and its own exception handling.

The HCR_EL2 (Hypervisor Configuration Register) configures which guest operations trap to the hypervisor:

  • TGE bit: trap general exceptions (route guest exceptions to EL2).
  • VM bit: enable stage-2 translation.
  • IMO/FMO bits: route IRQs/FIQs to EL2.
  • Many trap-on-X bits for specific instructions (timer access, performance counters, etc.).

The hypervisor configures HCR_EL2 to control how much it intervenes. KVM on ARM is comparatively thin because the hardware does most of the heavy lifting (stage-2 page tables, vGIC, virtual timer).

A VM exit happens when a configured trap fires: the guest's PC and PSTATE are saved into ELR_EL2 / SPSR_EL2, control transfers to the hypervisor's vector entry. The hypervisor inspects ESR_EL2 to determine the cause, services it, and ERETs back to the guest.

Nested Virtualization

ARMv8.3 added support for nested virtualization (EL2 emulating EL2 for a guest hypervisor). It's complex — emulating EL2 in software requires intercepting all of the EL2 state — but works well enough for cloud-on-cloud scenarios.

11. TrustZone (EL3 and Secure World)

TrustZone partitions the system into secure and non-secure worlds:

  • Each world has its own EL0/EL1 (and optionally EL2).
  • EL3 is the secure monitor, switching between worlds.
  • Memory and peripherals are tagged secure or non-secure; secure-only resources are inaccessible from the non-secure world.

Use cases:

  • DRM: secure video paths for protected content (Netflix HD, FairPlay).
  • Payment: Apple Pay, Google Pay use TrustZone for payment-token operations.
  • Biometrics: Touch ID, Face ID processing happens in TrustZone (Apple's Secure Enclave is a separate processor, but many Android phones use TrustZone directly).
  • Key storage: hardware-backed keystores (Android Keystore, iOS Keychain).
  • Remote attestation: TPM-equivalent functions.

The non-secure world (Android, iOS application) cannot read secure-world memory or registers. Communication is through SMC calls to a defined ABI (the Trusted Foundations or OP-TEE interface) which the secure-world OS services.

A typical SoC has:

  • An EL3 monitor (often ARM Trusted Firmware, ATF).
  • A secure-world EL1 OS (OP-TEE, Trusty, QSEE depending on vendor).
  • Various trusted applications running at S-EL0.

This is heavily used: every iPhone unlock and every Android fingerprint touch crosses into the secure world. TrustZone is one of ARM's defining architectural features.

12. Confidential Compute (CCA, RME)

ARMv9 introduced Confidential Compute Architecture (CCA) and the Realm Management Extension (RME) that implements it. The model:

  • Two new states: Realm and Root, in addition to the existing Secure and Non-Secure.
  • Realms are like VMs but isolated from the hypervisor itself.
  • Root state runs the RMM (Realm Management Monitor), trusted by both Realms and the hypervisor.

This is ARM's answer to Intel TDX and AMD SEV-SNP: confidential VMs whose memory the hypervisor cannot read.

CCA is just starting to deploy in mainstream silicon (Cortex-X4 and later, Neoverse V3 and later). Software support is in development.

13. Memory Tagging Extension (MTE)

MTE (ARMv8.5) is a hardware feature for memory-safety bug detection:

  • Each 16-byte region of memory has an associated 4-bit tag.
  • Each pointer also has a 4-bit tag in its top byte (Top-Byte-Ignore feature lets pointers carry tags without breaking VA semantics).
  • Each load/store checks that the pointer's tag matches the memory's tag; mismatch triggers a fault (or sets a status bit, in async mode).

MTE catches use-after-free, buffer overflows, and various memory-safety bugs in C/C++ code. The cost is moderate (5-15% in typical use). Apple uses related but distinct mechanisms (e.g., kalloc_type guarded zones); Android and Linux are deploying MTE for mass debugging and exploit mitigation.

14. Pointer Authentication (PAC)

PAC (ARMv8.3) cryptographically signs pointers with a 16-bit MAC stored in their high bits:

  • PACIA/PACIB/PACDA/PACDB instructions sign pointers with one of four keys (instruction A, instruction B, data A, data B).
  • AUTIA/AUTIB/AUTDA/AUTDB verify and strip the signature.
  • Mismatch causes the pointer to become invalid (fault on use) — defeats ROP/JOP attacks.

Apple uses PAC pervasively in iOS and macOS: every return address is signed with PACIB before being stored, verified with AUTIB on return. ROP attacks become essentially impossible without cracking the per-process PAC keys. Linux and Android have begun adopting PAC.

15. Branch Target Identification (BTI)

BTI (ARMv8.5) marks valid indirect-branch targets with BTI instructions. An indirect branch (BR, BLR) to a non-BTI instruction faults. Defeats jump-oriented programming and forces gadget control to start only at function entries. Combined with PAC, it makes ROP/JOP attacks much harder.

16. Boot Process

The AArch64 boot sequence varies by platform but follows a general pattern.

Hardware Reset

On reset, the core starts at EL3 (or whichever is the highest implemented level), executing from a reset vector defined by RVBAR_ELx. Caches and MMU are off.

Boot ROM

The first code is in Boot ROM, on-die mask ROM that cannot be modified after manufacture. Boot ROM:

  1. Performs minimal initialization.
  2. Reads the next-stage boot image from the configured boot device (eMMC, SPI flash, USB, etc.).
  3. Optionally verifies the next-stage image's signature against a public key fused into the chip.
  4. Jumps to the next stage.

This secure boot root of trust is what makes consumer devices like iPhones difficult to jailbreak: the public key for the boot image is fixed in silicon, and only Apple-signed images run.

Secondary Boot Loader (SBL)

The next stage initializes more hardware: DRAM, primary peripherals, the storage controller. On servers, the SBL might be a UEFI implementation; on mobile, a vendor-specific bootloader (Android's aboot, Apple's iBoot).

ARM Trusted Firmware (BL31)

On servers and many embedded systems, ARM Trusted Firmware (ATF) provides the EL3 monitor (BL31) and various boot stages. ATF:

  1. Sets up EL3 state.
  2. Initializes the secure world (TrustZone OS, BL32).
  3. Loads the EL2 hypervisor or EL1 kernel (BL33) — typically UEFI on servers.
  4. Drops to BL33 to start the OS boot.

ATF stays resident in EL3, servicing SMC calls from the OS for power management (ARM's PSCI — Power State Coordination Interface), secure-world communication, and similar.

OS Kernel

The kernel boots at EL1 (or EL2 if the kernel itself is a hypervisor). The Linux kernel on ARM:

  1. Sets up MMU with kernel mappings.
  2. Sets up exception vectors (VBAR_EL1).
  3. Initializes the GIC and timers.
  4. Starts secondary cores via PSCI calls (or "spin tables" on older systems).
  5. Mounts the root filesystem and starts user space.

The PSCI interface (defined by ARM, implemented by ATF) abstracts core power on/off, system shutdown/reboot, and similar operations across vendor-specific firmware.

Multi-Core Bring-Up

In contrast to x86's INIT/STARTUP IPIs, AArch64 uses PSCI (Power State Coordination Interface):

  • The OS calls psci_cpu_on(cpu_id, entry_point, context) via SMC.
  • ATF (in EL3) configures the target core's power and reset state.
  • The target core comes up at the specified entry point.

This is much cleaner than x86's trampoline page mechanics. The OS just specifies an entry point in C (or assembly) and gets called there once the core is alive.

17. Power Management: PSCI, WFI/WFE, and Idle States

ARM systems make power management an explicit, architected concern, in contrast to x86's largely-vendor-specific MSR-based interfaces. Three layers cooperate.

At the core level, two unprivileged instructions let software signal that a core has nothing to do. WFI (Wait For Interrupt) stops the core until an interrupt arrives; the core may enter a low-power idle state while waiting. WFE (Wait For Event) is similar but wakes on an event, signalled by SEV (Send Event) from another core or by certain memory-system events; spinlock implementations use WFE/SEV to put waiters into a low-power state and wake them when the lock is released, rather than hot-looping. Both are nearly free to execute and let the hardware enter shallow idle states without OS involvement.

At the cluster and system level, software requests deeper idle and power-off states through PSCI (Power State Coordination Interface, mentioned above for boot). PSCI defines a set of SMC-invoked services with stable function IDs:

  • CPU_SUSPEND — enter a specified idle state with optional state-loss; firmware handles the transition.
  • CPU_OFF / CPU_ON — take a core out of service or bring it back.
  • SYSTEM_OFF / SYSTEM_RESET / SYSTEM_RESET2 — platform-wide shutdown or reboot.
  • SYSTEM_SUSPEND — whole-system suspend-to-RAM.
  • MIGRATE and friends — used in some Trusted-OS configurations.

PSCI states are described in the firmware's device tree or ACPI tables as a hierarchy: per-core retention, per-core power-off, cluster retention, cluster power-off, system suspend. The OS idle governor picks the deepest state whose entry/exit latency fits the predicted idle duration. Because the actual hardware sequencing lives in EL3 firmware, the OS does not need to know the platform-specific incantation to gate clocks, drop voltage, or save state; it just calls PSCI with the desired state ID.

Frequency and voltage scaling (DVFS) on AArch64 is handled through a separate interface, traditionally vendor-specific MMIO registers but increasingly through ARM's SCMI (System Control and Management Interface), a message-based protocol implemented by an SCP (System Control Processor) on the SoC. The OS sends performance requests; the SCP, with its own firmware, chooses the operating point. This separation — OS expresses intent, SCP implements policy — mirrors the way x86 has moved to hardware P-state management (Intel HWP) and reflects the same insight: power management is fast enough and detailed enough that letting the OS drive every transition wastes both energy and time.

For reading more deeply, the ARM Power State Coordination Interface specification and the ARM System Control and Management Interface specification are the canonical references; the Linux kernel's drivers/firmware/psci/ and drivers/firmware/arm_scmi/ directories show the OS side. Power, thermal, and physical-design topics are treated more broadly in Chapter 52.

18. Summary

AArch64's system architecture is built around four exception levels (EL0-3), each with distinct purpose: applications, kernel, hypervisor, secure monitor. The MMU supports configurable granule sizes, large block mappings, ASIDs for cheap context switching, and two-stage translation for virtualization. Exceptions are vectored through per-EL vector tables; the GIC handles interrupts with software, peripheral, and virtualization-aware delivery. System registers expose the architectural state; CPU ID registers expose feature support to the OS and user space.

TrustZone partitions the system into secure and non-secure worlds, used for security-sensitive operations across iOS, Android, and embedded systems. ARMv9 added Confidential Compute Architecture for hypervisor-isolated VMs. Memory Tagging, Pointer Authentication, and Branch Target Identification provide hardware-supported memory-safety and exploit mitigation.

The boot process flows from a hardware-fixed root of trust through secure-boot stages to ATF (EL3) and finally the OS kernel, with PSCI as the standard interface for power and lifecycle operations. Compared with x86-64's accreted history of modes, segmentation, and APIC variants, AArch64's system architecture is more uniform and explicit.

The next chapter looks at AArch64 SIMD and vector capabilities: NEON, SVE/SVE2, and SME — the throughput backbone of modern AArch64 chips.

Book mode
computer-architecturearmaarch64isa-case-study
Was this helpful?