Part VISA Case Studies

ARM Overview

May 16, 2026·20 min read·advanced

ARM is the most ubiquitous instruction-set architecture in computing. Tens of billions of ARM-based chips ship every year, embedded in smartphones, tablets, microcontrollers, smartwatches, network…

ARM is the most ubiquitous instruction-set architecture in computing. Tens of billions of ARM-based chips ship every year, embedded in smartphones, tablets, microcontrollers, smartwatches, network equipment, automotive systems, and increasingly servers and personal computers. By unit volume, ARM dwarfs x86-64 by an order of magnitude. By compute footprint in the world today, the two ISAs are roughly comparable, with ARM dominating mobile and embedded, x86-64 dominating PCs and servers, and the two competing in laptops and increasingly in cloud servers.

ARM's history is also distinctive. Unlike x86, which was born from a single company designing a single processor and accreting features over decades, ARM was conceived as a clean RISC architecture in the mid-1980s and has been licensed widely. Hundreds of companies design ARM-based chips, each tailoring the implementation to their use case. Apple, Qualcomm, Samsung, MediaTek, NVIDIA, Amazon, Ampere, Microsoft, Google, NXP, Texas Instruments, and many others all ship ARM cores; some design their own implementations of the ARM ISA, others license ARM's reference designs.

This chapter introduces ARM. We trace the history from the Acorn RISC Machine to today's AArch64. We survey the major architectural variants (A-profile, R-profile, M-profile), the relationship between ARMv7 (32-bit) and ARMv8/9 (64-bit), and the design philosophy that distinguishes ARM from x86. The next four chapters develop AArch64 in depth: programming model (38), system architecture (39), SIMD and vector (40), and micro-architecture (41).

01. A Short History

The story begins in Cambridge, England, in 1983. Acorn Computers, the maker of the BBC Micro, decided to develop its own processor to compete with the Motorola 68000 and Intel 8086 lines. The team — led by Sophie Wilson and Steve Furber — designed a 32-bit processor based on RISC principles, with a small, regular instruction set, a load-store architecture, and 16 general-purpose registers. The first chip, the ARM1 (Acorn RISC Machine, 1985), was a working prototype. The ARM2 (1987) shipped commercially in the Acorn Archimedes computer.

The ARM2 had no cache, no MMU, no pipelining beyond fetch-execute overlap, and ran at 8 MHz. Yet it outperformed contemporary 16-bit x86 processors on integer benchmarks. The architecture was tight: 32 instructions, mostly conditional, fitting in a then-tiny 25,000-transistor design (compared to ~275,000 for an Intel 80386).

In 1990, Acorn spun off the chip team as Advanced RISC Machines (later renamed simply ARM Ltd.) in a joint venture with Apple, who needed a low-power CPU for the Newton handheld. ARM's business model from then on was licensing: ARM Ltd. designed the ISA and reference cores, and licensees built and sold chips. ARM Ltd. did not (and still does not) manufacture chips itself.

Major architectural milestones:

  • ARMv1 (1985, ARM1): the original; 26-bit addressing.
  • ARMv2 (ARM2, ARM3): added multiply and atomic swap; ARM3 added cache.
  • ARMv3 (ARM6, ARM7): 32-bit addressing; entered widespread use in embedded systems.
  • ARMv4 (1996, ARM7TDMI): introduced Thumb — a 16-bit alternate instruction set for code density. ARM7TDMI became the most-shipped 32-bit CPU in history; powered the Nokia phones of the 1990s, the Game Boy Advance, and countless embedded devices.
  • ARMv5 (1999, ARM9, ARM10): improved DSP, Java acceleration (Jazelle).
  • ARMv6 (2002, ARM11): SIMD instructions for media; introduced Thumb-2 (16-bit and 32-bit instructions intermixed).
  • ARMv7 (2005, Cortex-A8, Cortex-A9, etc.): big architectural revision. Three profiles introduced: A (Application), R (Real-time), M (Microcontroller). NEON SIMD. Used in essentially every smartphone of the early 2010s.
  • ARMv8 (2011, announced; 2013 first chips): introduced AArch64, a fundamentally new 64-bit ISA, alongside the legacy 32-bit AArch32 mode. The transition from ARMv7 to ARMv8/AArch64 is roughly comparable in scope to the transition from x86 to x86-64.
  • ARMv9 (2021): extension of ARMv8 with Scalable Vector Extension v2 (SVE2), Confidential Compute Architecture, and various other features. Backward-compatible with ARMv8.

The 64-bit transition started in 2013 with the Apple A7 in the iPhone 5s — the first ARM-based 64-bit processor in a consumer device. By 2017, virtually all ARM-based smartphones were 64-bit. Apple's M1 (2020) brought AArch64 to laptops; AWS Graviton (2018), Ampere Altra (2020), and others brought it to cloud servers. The M-profile microcontroller world has stayed mostly 32-bit, since 64-bit addressing is rarely needed there.

02. ARM as a Licensable Architecture

ARM Ltd.'s business model is unique among major ISA owners. ARM does not manufacture chips. It designs the architecture (the ISA specification) and a portfolio of reference cores, and it licenses these to other companies. The licensees fall into several categories:

Architecture licensees. Pay for the right to design their own implementations of the ARM ISA, from scratch. They get the ISA specification but not ARM's reference cores. Examples: Apple, Qualcomm (Oryon), Samsung (Mongoose; discontinued), Marvell, NVIDIA (Project Denver), Cavium (now Marvell), Ampere (Siryn).

Implementation licensees. Pay for the right to use ARM's reference core designs (Cortex-A, Cortex-R, Cortex-M) directly. They typically integrate the cores into a system-on-chip with their own peripherals, GPUs, modems, etc. Examples: most Qualcomm Snapdragon chips have used ARM Cortex cores (with some custom variants); MediaTek, Samsung Exynos, NVIDIA Tegra, almost all microcontroller vendors.

Subset licensees. Companies who pay for limited rights to a specific core or feature set. Common in embedded.

This model means there is no single "ARM processor". Every ARM-based chip is its own design, even if it uses a stock ARM core. Performance, power, area, and feature mix vary enormously across implementations — far more than the variation between Intel and AMD x86 chips.

ARM Ltd. itself has changed hands. It was acquired by SoftBank in 2016, NVIDIA tried unsuccessfully to acquire it in 2020-2022, and it IPO'd on the NASDAQ in 2023.

03. Three Profiles

ARMv7 introduced and ARMv8 maintained three architecture profiles, each targeted at a different market:

A-profile (Application). Cores designed for application-class workloads: smartphones, tablets, laptops, servers. Has a full MMU, multiple privilege levels, virtualization extensions, generic interrupt controller, advanced caches. Examples: Cortex-A53, A72, A78, X1, X2, X3, X4; Apple Firestorm/Avalanche/Everest; Neoverse N1/N2/V1/V2.

R-profile (Real-time). Cores designed for real-time embedded systems: automotive, industrial control, modem basebands. Has a memory protection unit (MPU) instead of full MMU (deterministic latency), simpler privilege model, predictable execution timing. Examples: Cortex-R4, R5, R7, R52.

M-profile (Microcontroller). Tiny, low-power, low-cost cores for microcontrollers and IoT. No MMU, simplified privilege model, optimized for interrupt response and code density. Always 32-bit (no 64-bit M-profile yet, though M85 and others have grown more capable). Examples: Cortex-M0, M0+, M3, M4, M7, M23, M33, M55, M85.

Cortex-M3 alone has shipped tens of billions of units. Cortex-M is the dominant microcontroller architecture worldwide.

The three profiles share the basic ARM concept (RISC, load-store, conditional execution, etc.) but differ substantially in details. A-profile and M-profile programs are generally not source- or binary-compatible; the system-level features and the privileged model differ enough that they are practically separate ISAs.

This book focuses on A-profile AArch64 — the modern 64-bit ARM as used in smartphones, laptops, and servers. R-profile and M-profile are mentioned where relevant but not developed in detail.

04. RISC Design Philosophy

ARM is described as a RISC architecture, with the standard RISC properties: fixed-length instructions, load-store architecture (only loads and stores access memory), large register file, simple addressing modes, simple instruction encoding. The contrast with x86's variable-length, memory-operand-rich, prefix-laden encoding is sharp.

A few specific design choices distinguish ARM:

Fixed instruction width. AArch64 instructions are all exactly 32 bits. AArch32 (the 32-bit ARM ISA) had two encodings: ARM (32-bit instructions) and Thumb (16-bit instructions, with Thumb-2 mixing 16- and 32-bit). AArch64 abandoned the Thumb mode entirely.

Large register file. AArch64 has 31 general-purpose 64-bit registers (x0-x30, with x30 being the link register; the stack pointer is a separate register, sp). This is roughly twice as many as x86-64's 16. The larger file reduces register pressure and reduces the need for spills.

No condition flags by default. Most AArch64 arithmetic and logical instructions do not set flags. Flag-setting variants exist (with an S suffix in AArch32, distinct mnemonics in AArch64 like ADDS/SUBS). The contrast with x86 (where almost every arithmetic instruction sets flags) reduces implicit dependencies between adjacent instructions and simplifies OoO renaming of flags.

Conditional execution (limited). AArch32 had conditional execution on every instruction (each ARM instruction had a 4-bit predicate field). AArch64 dropped this: only a handful of instructions are conditional (CSEL, CSINC, CCMP, conditional branches). The ARM designers found that pervasive predication complicated OoO execution and the benefits did not justify the cost.

Simple addressing. AArch64 supports several addressing modes (register, register + immediate, register + register, register + scaled register, pre/post-increment), but each is straightforward. No complex multi-component addressing like x86's [base + index*scale + disp].

Weak memory ordering. AArch64 uses a weak memory model. Loads and stores can be reordered freely subject to data dependencies. Explicit barriers and acquire/release semantics provide synchronization. This is one of the largest differences from x86's TSO (Chapter 31).

These choices make ARM cores generally simpler and more area-efficient than x86 cores at similar performance levels. The simpler decoder is one of the largest power and area savings: AArch64 instructions can be decoded in parallel, with little of the cross-instruction dependency that x86's variable-length encoding requires.

05. Architecture vs. Implementation

A persistent confusion in ARM discussions: the architecture vs. the implementation. The architecture is the ISA — the set of instructions, registers, memory model, and other features that programs see. The implementation is the actual silicon — the pipelines, caches, predictors, and so on. Multiple very different implementations can implement the same architecture.

A few examples of how the same ARM architecture is implemented differently:

  • Cortex-A55 (ARM Ltd.) — small, in-order, dual-issue. Used in big.LITTLE little cores. ~80,000 transistors per core (excluding cache).
  • Cortex-A78 (ARM Ltd.) — wide out-of-order, 6-wide decode, 160-entry ROB. The big in big.LITTLE.
  • Cortex-X4 (ARM Ltd.) — even wider, 10-wide decode, 384-entry ROB. The "ultimate performance" core.
  • Apple Firestorm / Avalanche / Everest — Apple's custom AArch64 cores. ~8-wide decode, ~600-entry ROB. Famously the highest-IPC cores in the industry.
  • Neoverse N1/N2/V1/V2 (ARM Ltd.) — server-targeted Cortex-A variants with high core counts and server-relevant features.
  • AWS Graviton 3 (built on Neoverse V1) — 64-core, 256-bit SVE.
  • Ampere AmpereOne — Ampere's custom core, 192 cores per chip, single-thread focused.
  • Qualcomm Oryon — Qualcomm's custom AArch64 core (acquired via Nuvia), now in Snapdragon X Elite for Windows on ARM laptops.

All of these implement AArch64 (specifically ARMv8.x or ARMv9.x). Software compiled for AArch64 runs on any of them, modulo specific optional extensions. The performance and power characteristics, though, vary wildly.

The point: when discussing "ARM performance" or "ARM cores", the reference implementation matters greatly. Apple's M-series Firestorm cores are not representative of all ARM cores any more than Intel's Lion Cove is representative of all x86-64 cores.

06. AArch64 vs. AArch32

ARMv8 introduced two execution states:

AArch64. The new 64-bit state. 31 general-purpose 64-bit registers, fresh instruction encoding, no Thumb mode, weak memory model, new exception model, new MMU page table format. All applications and operating systems on modern ARM platforms (smartphones, laptops, servers) run in AArch64.

AArch32. The legacy 32-bit state. Compatible with ARMv7. Includes both ARM (32-bit fixed) and Thumb (16/32-bit mixed) instruction sets. Used to run legacy 32-bit applications and operating systems.

A processor can switch between AArch64 and AArch32 only at exception boundaries, and only if both states are supported. Many recent ARM cores have dropped AArch32 support entirely:

  • Apple A11 (2017) and later: AArch64 only (in user mode); A12 dropped AArch32 entirely.
  • Cortex-A510, A715, A720, X4 and later: AArch64 only.
  • Neoverse N2, V2 and later: AArch64 only.

The transition mirrors x86-64's handling of legacy modes: 32-bit support was crucial during the transition, but as the 64-bit ecosystem matured, the silicon cost of supporting old modes became hard to justify.

This book treats AArch64 as the ARM ISA. AArch32 is mentioned only where relevant to legacy compatibility.

07. Modes and Privilege Levels

AArch64 has four Exception Levels (EL):

  • EL0 — user mode. Applications.
  • EL1 — kernel/OS. Linux kernel, iOS XNU kernel, macOS XNU kernel, Windows on ARM kernel.
  • EL2 — hypervisor.
  • EL3 — secure monitor / firmware.

The four levels form a strict privilege hierarchy: each level can transition only to higher (more privileged) or lower levels via specific instructions and only the OS/firmware can configure transitions.

EL3 is reserved for firmware, particularly the secure monitor that switches between secure world and non-secure world (TrustZone). EL2 is the hypervisor level, used by KVM, Hyper-V, Apple's Hypervisor Framework, etc. EL1 is the OS kernel. EL0 is the application.

This four-level model is more elegant than x86's ring system: each level has clear purpose, and the transitions are well-defined. We will see the details in Chapter 39.

08. TrustZone and Confidential Computing

Two ARM-specific security features deserve mention.

TrustZone (since ARMv6). Splits the system into a secure world and a normal world. Each world has its own EL0/EL1 (and optionally EL2). EL3 is the secure monitor that mediates between them. Used for digital rights management, payment processing, biometric authentication, and other security-sensitive operations. Apple's Secure Enclave, Android's TEE (Trusted Execution Environment), and iOS's various secure operations rely on TrustZone.

Confidential Compute Architecture (CCA, ARMv9-A). A more recent extension targeting the cloud-VM use case: each VM can be isolated from the hypervisor, so that even a compromised hypervisor cannot read VM memory or registers. Analogous to Intel TDX and AMD SEV-SNP.

These features sit at the intersection of architecture and cryptography. They are deeply embedded in modern mobile-device security (every iPhone unlock and every Apple Pay transaction relies on TrustZone), but the day-to-day programmer rarely interacts with them.

09. ARMv8 Extensions

ARMv8 has been extended in many sub-versions, each adding new features:

  • ARMv8.0 (2013): the baseline.
  • ARMv8.1: Atomic memory operations (LSE), virtualization improvements, large physical addressing.
  • ARMv8.2: Persistent memory support, half-precision FP, dot product instructions for ML, RAS extensions.
  • ARMv8.3: Pointer authentication (PAC), nested virtualization improvements, complex-number support.
  • ARMv8.4: Memory tagging (MTE) early form, FlagM, additional crypto.
  • ARMv8.5: Branch Target Identification (BTI), MTE proper, RNG.
  • ARMv8.6: BFloat16, matrix multiply (BF16, INT8).
  • ARMv8.7, 8.8, 8.9: smaller refinements.

ARMv9 (2021) is essentially ARMv8.5 + SVE2 + CCA. Notable features:

  • SVE2: Scalable Vector Extension v2, the second generation of ARM's variable-length vector ISA (more in Chapter 40).
  • CCA (Confidential Compute Architecture).
  • Various incremental improvements.

ARMv9.x continues to add: SME (Scalable Matrix Extension) for AI, finer-grained MTE, more crypto, etc.

The pace of feature addition in modern ARM mirrors x86's: many small extensions, each with its own CPUID-equivalent (ID register) feature flag, and operating systems and compilers must dispatch based on what the silicon supports.

Common in current high-end chips:

  • AArch64 mandatory.
  • NEON SIMD mandatory.
  • LSE atomics (ARMv8.1) — universal in modern chips.
  • Crypto extensions (AES, SHA) — common but not always present (export-control reasons in some markets).
  • SVE / SVE2 — variable: present in Neoverse V1/V2, AWS Graviton 3+, Apple has its own variant; absent in many older mobile chips.
  • PAC, BTI, MTE — strong in iOS (Apple uses PAC heavily), growing in Android.

10. big.LITTLE and DynamIQ

ARM pioneered the heterogeneous multi-core idea years before Intel's hybrid topology. big.LITTLE (2011) pairs high-performance "big" cores with low-power "LITTLE" cores in the same chip. Examples: Cortex-A57 + Cortex-A53; Cortex-A78 + Cortex-A55. The OS scheduler migrates threads between cores based on load: heavy workloads on big cores, light workloads on LITTLE cores. Power is saved when the big cores can sleep.

DynamIQ (2017) is the evolution: rather than separate clusters, big and LITTLE cores share a unified L3 cache and interconnect, so cross-cluster migration is fast. Modern Snapdragons and MediaTek Dimensity chips have 1 ultra-large core + 3 large cores + 4 small cores in a DynamIQ configuration.

Apple has used a similar approach since the A11: performance cores + efficiency cores. The M-series chips have 4-12 P-cores + 4-12 E-cores in their various configurations.

The hybrid approach has proven essential for mobile: most of the time the device runs lightly-loaded background tasks that fit on small cores, with the big cores dormant. Only during demanding workloads (gaming, video processing) do the big cores wake up. This is fundamental to how a modern phone gets a full day of battery life despite having processing capabilities approaching desktop levels.

11. ARM Servers

For decades, "ARM in servers" was a perpetual prediction. Several attempts (Calxeda 2012, AMD Opteron A1100 2016, Cavium ThunderX 2014) had mixed results. Server software was deeply x86-tied; server hardware was already commoditized; switching costs were high.

The breakthrough came around 2018-2020:

  • AWS Graviton (2018) — Amazon's custom ARM server chip for its EC2 cloud. Now in its 4th generation (Graviton4, 2024); AWS's most-used CPU.
  • Apple M1 (2020) — first ARM in a mainstream personal computer, demonstrating ARM could match high-end x86 in single-thread performance.
  • Ampere Altra (2020) — first widely-deployed dedicated ARM server processor (80 cores, then 128 cores, then 192 cores).
  • NVIDIA Grace (2023) — 72-core ARM CPU paired with NVIDIA's GPUs for AI and HPC.
  • Microsoft Cobalt (2024) — Microsoft's custom ARM CPU for Azure.
  • Google Axion (2024) — Google's ARM CPU for GCE.

These chips are competitive with x86 servers on price-performance and often better on perf/watt, especially in cloud workloads with many small instances. They have changed the server landscape: as of 2026, ARM is a substantial minority (~10-20%) of cloud server cores and growing.

Software compatibility is the remaining challenge. Most cloud-native software (containerized web services, databases, Kubernetes, etc.) is portable. Less-portable workloads (legacy enterprise software, databases with x86-specific tuning, certain numerical libraries) are slower to migrate.

12. ARM in the PC

ARM's PC presence is more recent and more mixed.

Apple Silicon (M1 2020, M2 2022, M3 2023, M4 2024). Apple's transition of Macs from Intel to ARM was completed in 2022. The M-series chips combine ARM cores with Apple's GPU, Neural Engine, video encoders, and unified memory. They are widely regarded as among the best PC CPUs in their class for power efficiency and per-thread performance.

Windows on ARM. Microsoft has been pushing Windows on ARM for nearly a decade with limited success. The Surface Pro X (2019) was an early effort. The breakthrough came with Snapdragon X Elite (2024) and X2 (2025) using Qualcomm's Oryon cores; performance now matches mainstream x86 laptops, and Windows 11 has improved x86 emulation. Adoption is growing but x86 remains dominant on Windows PCs.

Linux on ARM. Always supported in distributions, but the desktop usage is small. Most Linux on ARM is in servers (cloud) or embedded systems.

The PC story for ARM continues to evolve. Apple has shown it can be done; Qualcomm and partners are extending it to Windows; AMD and Intel are responding with more competitive x86 chips. The next few years will tell.

13. ARM in Mobile and Embedded

ARM utterly dominates these markets:

  • Smartphones: 100% ARM-based. iPhones use Apple Silicon; Androids use Qualcomm Snapdragon, MediaTek Dimensity, Samsung Exynos, Google Tensor, all ARM.
  • Tablets: nearly 100% ARM.
  • Smartwatches and wearables: nearly 100% ARM.
  • Microcontrollers: ARM Cortex-M dominates the 32-bit MCU market. The 8-bit and 16-bit MCU markets remain partially in older ISAs (PIC, AVR, 8051), but 32-bit is ARM-dominant.
  • Networking equipment: routers, switches, base stations — heavily ARM (and increasingly RISC-V).
  • Automotive infotainment and ADAS: ARM is the major architecture, often Cortex-R for safety-critical and Cortex-A for application-level.
  • IoT and consumer electronics: ARM nearly everywhere.

The diversity here is staggering. Hundreds of millions of devices, dozens of vendors, thousands of distinct chips, all running ARM. The ecosystem of toolchains, compilers, RTOSes (FreeRTOS, Zephyr, ThreadX), and libraries is correspondingly vast.

14. ARM and RISC-V

The other major free/licensed RISC ISA is RISC-V (Chapter 42-45). RISC-V is open-source: anyone can implement it without paying royalties. It started in 2010 at UC Berkeley and has gained traction quickly in:

  • Embedded systems where royalty avoidance matters.
  • Research and academic projects.
  • AI accelerators and SoC components as the control-plane CPU.
  • China, where RISC-V offers an architecture not subject to Western export controls.

Whether RISC-V will displace ARM in mainstream applications is debated. ARM has decades of ecosystem, mature high-performance implementations, and a vast software base. RISC-V has open-source momentum, no licensing cost, and is maturing fast. Both are likely to coexist, with ARM stronger in performance applications and RISC-V stronger in cost-sensitive or IP-sensitive ones, for the foreseeable future.

15. Architecture Versions and Feature Levels

ARM publishes its architecture in numbered versions (ARMv7, ARMv8, ARMv9), with successive minor updates that introduce new mandatory and optional features. Because real silicon ships years after the version is announced, and because optional features are picked up unevenly, the version number alone is not enough to know what a particular chip implements.

ARMv8 (announced 2011, first silicon 2013) introduced AArch64 and the four-EL exception model. The minor revisions added many features:

  • ARMv8.1: atomic memory operations (LSE — Large System Extensions: LDADD, CAS, SWP), limited-ordering region (LORegions), virtualization host extensions (VHE) so KVM can run a Type-2 hypervisor at EL2 without changing the host kernel's view of EL1.
  • ARMv8.2: half-precision FP, statistical profiling extension (SPE), persistent memory hints, RAS (Reliability, Availability, Serviceability) extension.
  • ARMv8.3: Pointer Authentication (PAC), nested virtualization, FCMA (complex-number FP), JavaScript-flavoured FP convert.
  • ARMv8.4: secure-EL2, generic timer enhancements, TLB range invalidate.
  • ARMv8.5: Memory Tagging Extension (MTE), Branch Target Identification (BTI), random number generator instructions, speculation barriers.

ARMv9 (announced 2021) bundled SVE2 and Confidential Compute Architecture (CCA, including the Realm Management Extension RME) as headline features, along with continued security and memory-tagging refinements. ARMv9 minor revisions through 2026 add SME and SME2 (Scalable Matrix Extension), further confidential-compute features, and incremental performance hints.

The practical consequence is that feature detection on AArch64 is feature-by-feature, not version-by-version. The ID_AA64* system registers expose feature bits readable by privileged code; user-mode software on Linux reads /proc/cpuinfo or uses the getauxval(AT_HWCAP) and AT_HWCAP2 interfaces, which the kernel populates from those system registers. macOS exposes a similar interface through sysctl. Compilers use these to dispatch optimized code variants in the same multi-versioning style as x86's CPUID-based dispatch.

The lesson for readers of vendor whitepapers: "ARMv8.2-A" is a useful shorthand but tells you only what the chip is required to implement; the optional features it actually includes are listed individually in its TRM (Technical Reference Manual). Apple's M-series, ARM's own Cortex-X cores, AWS Graviton, and Qualcomm's Oryon all implement broadly the same architecture version but differ in optional-feature uptake.

16. Looking Ahead

The remaining chapters of Part VIII develop AArch64.

Chapter 38 covers the programmer-visible model: registers, instruction categories, addressing modes, calling conventions, common idioms.

Chapter 39 covers the system architecture: the four exception levels, MMU and translation tables, exceptions and interrupts (GIC), system registers, the boot process.

Chapter 40 covers SIMD and vector: NEON (the historical SIMD), SVE/SVE2 (variable-length vectors), SME (matrix), and how they compare to x86's AVX/AVX-512.

Chapter 41 covers micro-architecture: ARM Cortex cores (A78, X4, Neoverse), Apple's Firestorm/Avalanche/Everest, and how modern AArch64 implementations are built.

By the end of Part VIII, AArch64 should be as familiar as x86-64.

17. Summary

ARM is the most ubiquitous ISA in computing by unit volume, dominating mobile and embedded and increasingly present in PCs and servers. Its history runs from the 1985 ARM1 to the modern AArch64 (introduced with ARMv8 in 2013), with the 64-bit transition now essentially complete in performance markets. ARM is licensed widely, with both architecture licensees designing custom cores (Apple, Qualcomm, Ampere) and implementation licensees using ARM's reference Cortex cores.

AArch64 is a clean RISC architecture: 32 general-purpose 64-bit registers, fixed 32-bit instruction width, weak memory model, four exception levels, no Thumb mode. The ecosystem is vast and varied, with tens of billions of devices shipping per year. Apple's M-series demonstrated that ARM can compete at the high end of single-thread performance; AWS Graviton and Ampere have brought ARM into mainstream cloud servers. The next chapter develops the programmer-visible model in detail.

Book mode
computer-architecturearmaarch64isa-case-study
Was this helpful?