Instruction Categories
May 16, 2026·37 min read·intermediate
The previous chapter introduced an ISA as a contract between hardware and software, listing the kinds of things the contract specifies. We now look at the most concrete part of that contract: the…
The previous chapter introduced an ISA as a contract between hardware and software, listing the kinds of things the contract specifies. We now look at the most concrete part of that contract: the instructions themselves. Every ISA defines a set, sometimes small and sometimes very large, of operations that the processor knows how to perform. The set varies from one ISA to another, but the categories — the broad families of operations that any general-purpose processor must support — are remarkably consistent across architectures. A modern programmer reading the manual for a new ISA will find familiar shapes: data movement, arithmetic, logic, shifts, comparisons, branches, calls, atomic operations, and system instructions. Different ISAs draw the lines between these categories slightly differently, and each adds its own specialized members, but the core taxonomy has been stable for fifty years.
This chapter walks through that taxonomy. For each family we will look at what the instructions do, what they typically encode, and how real ISAs (RISC-V, AArch64, x86-64) realize them. The aim is not exhaustive coverage of any single instruction set — manuals do that better — but a working sense of the kinds of operations a CPU offers and how they fit together to express programs.
01. Data Movement Instructions
The simplest and the most common family. A data-movement instruction copies a value from one place to another without changing it.
The "places" are typically a register, a memory location, an immediate value embedded in the instruction, or a special architectural register such as the program counter or a status register. The combinations differ by ISA.
In RISC ISAs, where memory access is reserved to load and store instructions, the data-movement family splits naturally:
- Loads copy a value from memory into a register.
- Stores copy a value from a register into memory.
- Register-to-register moves copy between registers (often implemented as
add rd = rs + 0oror rd = rs | 0with a synonymous mnemonic). - Move-immediate loads a constant into a register.
# RISC-V
ld a0, 0(a1) # load 8 bytes from memory[a1] into a0
sd a0, 0(a1) # store 8 bytes from a0 into memory[a1]
mv a0, a1 # pseudo-instruction; assembled as addi a0, a1, 0
li a0, 42 # pseudo-instruction; assembled as addi a0, x0, 42| ; AArch64 | |
| ldr x0, [x1] ; load 8 bytes from [x1] into x0 | |
| str x0, [x1] ; store 8 bytes from x0 into [x1] | |
| mov x0, x1 ; copy x1 to x0 | |
| mov x0, #42 ; load constant 42 |
In CISC ISAs like x86-64, a single mnemonic — MOV — covers most of the family, distinguished only by the operand types:
| mov rax, [rbx] ; load: register from memory | |
| mov [rbx], rax ; store: memory from register | |
| mov rax, rbx ; register-to-register | |
| mov rax, 42 ; immediate to register |
The x86 unification is convenient for assembly programmers but obscures the underlying distinction: the load/store cases involve memory and have different latency, ordering, and exception behavior from the register-to-register case. Internally, modern x86 chips break the load/store form of MOV into separate µops anyway.
A few specialized members appear in nearly every ISA.
Load with sign or zero extension. A 32-bit value loaded into a 64-bit register has to come from somewhere. Sign-extending loads (LDRSW on AArch64, LW on RISC-V for 32-bit signed) replicate the high bit; zero-extending loads (LDR of a 32-bit form on AArch64, LWU on RISC-V) fill with zeros.
Load and store of partial widths. Bytes, half-words (16 bits), and words (32 bits) can all be moved individually, with the appropriate extension behavior.
Load-pair and store-pair. AArch64's LDP/STP move two consecutive registers in one instruction. Useful for prologue/epilogue code that saves and restores adjacent register pairs.
Multi-register load/store. Older ARM (AArch32) and many CISC ISAs have instructions that load or store many registers in one go. Modern RISC ISAs typically do not, on the grounds that breaking the work into individual operations gives the scheduler more flexibility.
Pre- and post-indexed forms. AArch64 lets a load or store update its base register as a side effect, supporting common patterns like stack push/pop in a single instruction. RISC-V omits these to keep the encoding regular; x86 has dedicated PUSH/POP instructions instead.
A useful conceptual reminder: data-movement instructions move bits, not meanings. The hardware does not care whether the value being moved is an integer, a pointer, a floating-point number, or part of a struct. It moves the bytes; the program's interpretation is up to the program.
02. Arithmetic Instructions
The second-largest family. Arithmetic instructions perform numerical operations on values, almost always in registers.
The fundamental scalar integer operations are addition, subtraction, multiplication, division, and the related "modulo" or remainder. Variants may take signed or unsigned operands and may produce different results for the same inputs accordingly.
| # RISC-V (RV64I + M extension) | |
| add a0, a1, a2 # a0 = a1 + a2 | |
| sub a0, a1, a2 # a0 = a1 - a2 | |
| mul a0, a1, a2 # a0 = (a1 * a2) low 64 bits | |
| mulh a0, a1, a2 # a0 = (a1 * a2) high 64 bits, signed | |
| div a0, a1, a2 # a0 = a1 / a2 (signed) | |
| divu a0, a1, a2 # a0 = a1 / a2 (unsigned) | |
| rem a0, a1, a2 # a0 = a1 % a2 (signed) |
| ; AArch64 | |
| add x0, x1, x2 ; x0 = x1 + x2 | |
| sub x0, x1, x2 ; x0 = x1 - x2 | |
| mul x0, x1, x2 ; x0 = x1 * x2 (low 64) | |
| sdiv x0, x1, x2 ; signed division | |
| udiv x0, x1, x2 ; unsigned division | |
| madd x0, x1, x2, x3 ; x0 = x3 + x1 * x2 (multiply-add) |
| ; x86-64 | |
| add rax, rbx ; rax += rbx | |
| sub rax, rbx ; rax -= rbx | |
| imul rax, rbx ; signed multiply (low 64 in rax) | |
| idiv rbx ; rdx:rax / rbx → quotient in rax, remainder in rdx |
A few details worth noting.
Multiply produces a wider result than its inputs. Two 64-bit operands produce a 128-bit result. ISAs handle this in different ways: RISC-V splits high and low halves into separate instructions (mul/mulh); x86 uses an implicit register pair (rdx:rax); AArch64 has separate MUL, UMULH, SMULH instructions and a 128-bit-result UMULL that takes 32-bit operands.
Division is slow. Even in modern processors, integer division takes 10 to 30 cycles, far more than addition (1 cycle) or multiplication (3 to 5 cycles). Compilers go to considerable lengths to replace divisions by constants with cheaper sequences (multiplications by reciprocals, shifts).
Divide-by-zero is an exception. Most ISAs raise a fault. RISC-V is unusual in defining divide-by-zero to return a specified result (all-ones for signed division, the dividend for unsigned remainder) without faulting.
Multiply-add is common. A fused a + b * c operation is so useful that nearly every ISA has it. AArch64 has MADD/MSUB; RISC-V has it through a small extension; floating-point variants (FMA) appear universally.
Saturating arithmetic is offered by some ISAs for digital-signal-processing workloads. Where an ordinary add wraps modulo on overflow, a saturating add clamps to the maximum (or minimum) representable value. ARM's NEON and various SIMD extensions provide saturating variants.
Overflow detection is handled differently across ISAs. x86 sets the OF (overflow) flag on every arithmetic instruction; AArch64 sets condition flags (NZCV) only on instructions that explicitly request it (e.g., ADDS rather than ADD); RISC-V provides no condition flags at all and requires explicit branches on the result. We will return to flags shortly.
Floating-point arithmetic is structurally similar but operates on a separate register file, with its own instructions: FADD, FSUB, FMUL, FDIV, FMA, plus type-conversion instructions to and from integer formats. We will not dwell on FP here; the operations parallel their integer counterparts but follow the IEEE 754 standard for representation, rounding, and exception behavior.
03. Logical Instructions
Bitwise operations: AND, OR, XOR, and NOT. They treat their operands as unstructured bit strings and produce a result bit by bit.
| and a0, a1, a2 # bitwise AND | |
| or a0, a1, a2 # bitwise OR | |
| xor a0, a1, a2 # bitwise XOR | |
| not a0, a1 # bitwise NOT (often a pseudo-instruction) |
In RISC-V, not is a pseudo-instruction implemented as xori a0, a1, -1 (XOR with all-ones). AArch64 has a dedicated MVN (move negated). x86 has NOT.
Logical operations are the most common building blocks for bit manipulation: setting, clearing, testing, and toggling individual bits.
- Set bit n in r:
or r, r, (1 << n) - Clear bit n in r:
and r, r, ~(1 << n) - Toggle bit n in r:
xor r, r, (1 << n) - Test bit n in r:
and tmp, r, (1 << n)followed by a branch ontmp
The mask 1 << n is itself produced with a shift; we look at those next.
A subtler use of logical operations is immediate forms. Most ISAs allow AND, OR, and XOR with a small immediate operand. The immediate has to fit in some number of bits in the encoding, and the constant-encoding tricks are different across ISAs:
- RISC-V allows a 12-bit signed immediate, sign-extended to 64 bits.
- AArch64 uses a curious encoding called the bitmask immediate that can represent any rotated repeating pattern of 1s and 0s — it covers 5,334 distinct 64-bit values, far more useful than a raw small constant for the masks programs typically use.
- x86 supports 8-, 16-, 32-, and (for some forms) 64-bit immediates.
A common idiom worth knowing: xor r, r, r clears a register. On x86, the 32-bit form xor eax, eax is the standard idiom for zeroing a register, recognized by every x86 decoder as a "register zeroing" pattern that breaks the dependence on the previous value of the register. Modern CPUs implement it for free in the rename stage.
04. Shift and Rotate Instructions
Shifts move bits within a register. They come in three flavors: logical left, logical right, and arithmetic right.
A logical left shift by multiplies an unsigned value by (modulo the register width):
Bits shifted out the top are discarded; zeros are shifted in at the bottom.
A logical right shift by divides an unsigned value by :
Bits shifted out the bottom are discarded; zeros are shifted in at the top.
An arithmetic right shift by divides a signed value by , rounding toward negative infinity:
The high bit of the original value is replicated into the new high bits, preserving the sign.
| sll a0, a1, a2 # logical left shift | |
| srl a0, a1, a2 # logical right shift | |
| sra a0, a1, a2 # arithmetic right shift |
The distinction between SRL and SRA matters only for negative values. SRL of 0xFFFFFFFFFFFFFFFE (which is as a signed value) by 1 is 0x7FFFFFFFFFFFFFFF (); SRA of the same value by 1 is 0xFFFFFFFFFFFFFFFF (, i.e., ).
A rotate (or circular shift) is similar to a shift but the bits shifted out one end re-enter the other.
Rotates appear in cryptographic and hashing algorithms, and most ISAs have dedicated rotate instructions because building them from shifts and ORs takes more cycles. AArch64 has ROR (rotate right); x86 has ROL/ROR; RISC-V's base ISA does not, but the optional Zbb extension adds them.
A few important details.
Shift amount is masked. In most ISAs, only the low bits of the shift-amount operand are used: 5 bits on a 32-bit shift, 6 on a 64-bit shift. A shift by 64 on a 64-bit register might therefore behave as a shift by 0 (RISC-V, x86) rather than producing zero. The exact behavior is ISA-specific; programs that need defined behavior at shift amounts must check explicitly.
Shifts as multiplication. A constant left shift is faster than a multiplication by a power of two; compilers always emit the shift. Variable shifts are also typically faster than variable multiplications, but the gap is smaller on modern processors that implement multiply efficiently.
Funnel shifts. Some recent ISAs (x86's BMI2, AArch64's EXTR) include funnel-shift instructions that shift a concatenation of two registers as if they were a single double-width value. Useful for arbitrary-precision arithmetic and string manipulation.
05. Compare Instructions
A comparison takes two operands, computes their relationship (less than, equal, greater than, etc.), and signals the result. There are two main styles for representing the result.
Flags-Based Comparisons
CISC ISAs and ARM's flags-style instructions write the result into a small set of architectural status bits — the flags register or condition codes. The standard flags are:
- Zero (Z): result was zero.
- Negative (N): result was negative (high bit was 1).
- Carry (C): unsigned overflow occurred.
- Overflow (V): signed overflow occurred.
A comparison conceptually performs a subtraction and sets the flags according to the result, without storing the result anywhere.
| ; x86-64 | |
| cmp rax, rbx ; flags = (rax - rbx) | |
| je equal ; jump if Z is set (i.e., rax == rbx) | |
| jl less ; jump if N != V (signed less) | |
| jb below ; jump if C is set (unsigned less) |
| ; AArch64 | |
| cmp x0, x1 ; flags = (x0 - x1) | |
| b.eq equal ; branch if Z is set | |
| b.lt less ; branch if N != V (signed less) | |
| b.lo below ; branch if C is clear (unsigned less) |
Subsequent conditional branches read the flags and branch accordingly. The flags are an implicit communication channel between the comparison and the branch.
The convenience is that many arithmetic instructions also set flags as a side effect, so add followed by branch-if-zero does not need a separate compare. The drawback is that the flags are a hidden global state: each compare clobbers them, and the program must be careful not to insert a flag-setting instruction between a compare and the branch that uses its result.
Comparison-Producing-Result Style
RISC-V (and MIPS) avoid flags entirely. A comparison produces a 0 or 1 in a register, which the program then uses in a branch.
| slt a0, a1, a2 # a0 = (a1 < a2 signed) ? 1 : 0 | |
| sltu a0, a1, a2 # a0 = (a1 < a2 unsigned) ? 1 : 0 |
For branches, RISC-V provides instructions that combine the comparison with the branch in one step:
| beq a1, a2, label # branch if a1 == a2 | |
| bne a1, a2, label # branch if a1 != a2 | |
| blt a1, a2, label # branch if a1 < a2 (signed) | |
| bge a1, a2, label # branch if a1 >= a2 (signed) | |
| bltu a1, a2, label # branch if a1 < a2 (unsigned) | |
| bgeu a1, a2, label # branch if a1 >= a2 (unsigned) |
This is regular and avoids the hidden state of flags. The cost is that comparisons that need to be reused (e.g., for both a branch and a later result) must produce a register value with slt, occupying a register slot. In practice, branches dominate, and the combined branch-on-compare instructions cover most of what flags would be used for.
Predicated Execution
Some ISAs let the program compute a small value (1 or 0) from a comparison, then use it as a predicate on a later instruction: the instruction executes only if the predicate is true.
ARM's AArch64 has predicated forms for some instructions; AArch32 (the older 32-bit ARM) was famous for predicating every instruction. x86 has CMOVcc (conditional move), which writes its result only if a flag condition is true:
| mov rax, 0 | |
| cmp rdx, rcx | |
| cmovl rax, rbx ; if rdx < rcx, rax = rbx; else rax stays 0 |
The benefit of predication is that it can replace small if-else branches with branchless code, eliminating branch mispredictions on small unpredictable conditionals. The cost is that the predicated instruction's work is done unconditionally and only its result is suppressed; if the work is expensive or has side effects, predication is no faster than a branch.
06. Branches and Jumps
Once the program has a comparison's result, it needs a way to alter control flow. Branches and jumps do this.
A branch is conditional: the next instruction depends on a runtime condition. An unconditional jump changes the program counter regardless of any condition.
Several variations exist.
Direct branches specify their target as a constant offset from the current PC. The instruction encodes the offset; the hardware adds it to PC and fetches the next instruction from there.
Indirect branches read their target from a register or memory location. They are needed for return-from-function, virtual method dispatch, switch statements, and any other case where the target is not known at compile time.
| ; AArch64 | |
| b label ; unconditional direct branch | |
| b.eq label ; conditional direct branch | |
| br x0 ; unconditional indirect branch (to address in x0) | |
| ret ; pseudo-instruction: br x30 (the link register) |
| # RISC-V | |
| j label # unconditional direct jump (pseudo: jal x0, label) | |
| jal ra, label # jump-and-link: ra = PC+4, jump to label | |
| jalr ra, 0(rs1) # indirect jump (with link) | |
| ret # pseudo: jalr x0, 0(ra) |
The combination of direct/indirect and conditional/unconditional gives four basic branch types, of which three are common (no widely-used ISA has indirect conditional branches).
A few important details.
Branch-target reach. Direct branches can only reach as far as the offset field in their encoding allows. Conditional branches typically have shorter reach (because they share encoding bits with the condition); unconditional jumps have longer reach. We saw the specifics in the original Chapter 13 discussion of immediate encoding.
Branch delay slots. Older RISC ISAs (MIPS, original SPARC) defined the instruction immediately after a branch to execute regardless of whether the branch was taken — the delay slot. This was an artifact of simple pipelines that could not stop the instruction in flight when a branch took effect. Modern ISAs do not have delay slots; the hardware is fast enough to handle the branch cleanly, and the visible architectural complication of delay slots is no longer worth it.
Branch hints. Some ISAs allow the compiler to provide a hint about whether a branch is likely to be taken. PowerPC has explicit hint bits in the encoding. x86 used to use 2E and 3E prefixes (now mostly ignored). The hint helps the hardware prefetcher choose what to bring in. Modern dynamic branch prediction (Chapter 23) has made hints much less important.
07. Calls and Returns
A function call is a special branch that records where to return. A return is a special branch to the recorded return point. Together they implement the procedure-call abstraction that all higher-level languages rely on.
There are several common conventions for storing the return address.
Link register. RISC-V, ARM, PowerPC, MIPS — most RISC ISAs — store the return address in a designated register. The call instruction (jal on RISC-V, bl on ARM) writes PC+4 (or PC+next-instruction) into the link register and branches. The return instruction reads the link register and branches there.
Stack-based return. x86 stores the return address on the stack. The CALL instruction pushes the return address onto the stack and jumps; RET pops the address and jumps to it.
| ; AArch64 | |
| bl foo ; x30 = PC+4; branch to foo | |
| ; (later, in foo) | |
| ret ; branch to address in x30 |
| ; x86-64 | |
| call foo ; push PC+5; jump to foo | |
| ; (later, in foo) | |
| ret ; pop into PC |
| # RISC-V | |
| jal ra, foo # ra = PC+4; jump to foo | |
| # (later, in foo) | |
| ret # pseudo: jalr x0, 0(ra) |
The link-register style is faster for leaf functions (functions that do not call other functions), because the return address never has to touch memory: it stays in the register throughout the call. But once a function calls another, the link register has to be saved somewhere — typically on the stack — to free it for the inner call.
The stack-based style requires a memory write on every call, which is slower for shallow call trees but does not require the function prologue to save the link register before making sub-calls.
We will return to the full mechanics of function calls — argument passing, stack frames, register saving — in Chapter 14, where we look at calling conventions and ABIs in detail.
08. Atomic Instructions
A multi-threaded program needs to update shared variables safely. An ordinary load-modify-store sequence is unsafe: another thread can interleave between the load and the store, causing lost updates or other inconsistencies. The hardware solves this with atomic instructions: operations that complete as a single, indivisible unit, with respect to other CPUs and to interrupts.
Two main families of atomic primitives exist.
Atomic Read-Modify-Write Instructions
A single instruction performs the load, the modification, and the store as one unit. On x86, the LOCK prefix turns most read-modify-write instructions atomic:
| lock add [rcx], rax ; atomically: *rcx = *rcx + rax | |
| lock cmpxchg [rcx], rax ; atomically: compare-and-swap | |
| lock xadd [rcx], rax ; atomically: exchange-and-add |
Specific operations include atomic add, atomic increment/decrement, atomic exchange, and the fundamental atomic primitive compare-and-swap (CAS), which atomically reads a memory location, compares it to an expected value, and if they match, writes a new value. CAS is sufficient to implement essentially every concurrent algorithm; it is the universal atomic primitive.
Load-Linked / Store-Conditional
RISC ISAs use a different primitive. A pair of instructions — load-linked (sometimes called load-reserved) and store-conditional — together implement atomic operations.
| # RISC-V | |
| loop: | |
| lr.d t0, (a0) # load-reserved: read *a0, mark it linked | |
| addi t0, t0, 1 # modify | |
| sc.d t1, t0, (a0) # store-conditional: write *a0 = t0 if still linked | |
| # set t1 = 0 on success, 1 on failure | |
| bnez t1, loop # retry if failed |
The LR instruction reads a memory location and creates a reservation on that location's cache line. The SC instruction attempts to store, but only succeeds if the reservation is still intact — that is, no other CPU has written to the line since the LR. If the reservation has been broken, the SC fails and returns a non-zero value, signaling that the program must retry.
Compared with CAS, LR/SC is more flexible: any sequence of instructions can sit between the LR and the SC, computing whatever modification the program wants. CAS, by contrast, can only store a single computed value. ARM and RISC-V both use LR/SC, often called LDXR/STXR on ARM. ARM has also added optional CAS instructions (CAS, CASB, etc.) for compatibility with x86-style code.
Atomic Operations and Memory Ordering
Atomic instructions also serve as memory barriers — points at which the program forces ordering between memory operations. We will treat memory ordering in detail in Chapter 31; the relevant fact for now is that atomic instructions on most ISAs come in several variants with different ordering strengths: relaxed (no ordering), acquire (subsequent loads and stores cannot move before), release (preceding loads and stores cannot move after), and sequentially consistent (full barrier). The ISA syntax encodes the chosen strength.
09. Floating-Point Arithmetic
Integers cover most of what computers do, but a great deal of important work — graphics, simulation, signal processing, machine learning, much of scientific computing — needs fractional numbers. Every modern general-purpose ISA therefore includes a floating-point instruction family that mirrors the integer family closely while adding the considerations specific to a non-finite number system.
The representation is the IEEE 754 standard. Single precision occupies 32 bits (1 sign, 8 exponent, 23 fraction); double precision occupies 64 bits (1 sign, 11 exponent, 52 fraction); half precision occupies 16 bits and is the default for machine-learning workloads; quad precision occupies 128 bits and is used in a few high-precision scientific contexts. The standard fixes the value of every bit pattern, the rounding rules, the handling of infinities and NaNs, and the exception conditions, so that the same FP program produces (modulo platform-defined details) the same answer on every conformant machine.
The instruction set parallels the integer one closely: fadd, fsub, fmul, fdiv perform the four basic operations; fsqrt computes the square root; fma (fused multiply–add) computes in one operation with a single rounding. Comparisons (fcmp, fle, fmin, fmax) and conversions (fcvt.s.d, fcvt.d.w, etc.) round out the family. RISC-V provides them in the F and D extensions; AArch64 includes them as part of the base ISA; x86 has both legacy x87 stack-based instructions and modern scalar SSE/AVX instructions (addss, addsd, vaddss).
Four issues distinguish floating point from integer arithmetic.
Rounding modes. Most operations cannot be represented exactly and must round. IEEE 754 defines five modes (round to nearest with ties-to-even, ties-away-from-zero, round toward zero, toward , toward ), selectable through a control register or, on RISC-V, through bits in the instruction itself. The default is round-to-nearest-even; deliberately changing the mode is rare and usually for interval arithmetic.
Exception flags. Five sticky flags — invalid, divide-by-zero, overflow, underflow, inexact — record any unusual conditions encountered. Software can read or clear them, and on architectures that support trapping, individual flags can be configured to raise an exception instead of being recorded silently.
Special values. , , signaling and quiet NaNs, and signed zero are all distinct floating-point values whose behaviour is fixed by IEEE 754. A NaN compares unequal to everything including itself; a divide of finite by zero produces an infinity; a divide of zero by zero produces NaN.
Determinism and reproducibility. Floating-point addition is not associative; rearranging to may change the result by the size of the rounding error. Compiler flags such as -ffast-math permit reorderings that the standard forbids; their use silently changes numerical results, sometimes by huge amounts in ill-conditioned code. Programs that need bit-identical results across platforms must restrict themselves to standard-conformant operations and avoid such flags.
10. SIMD and Vector Instructions
The instruction families above operate on a single value at a time. A great deal of computation — graphics, signal processing, dense linear algebra, simulation, machine learning — applies the same operation to many values, and ISAs reflect this with single-instruction–multiple-data (SIMD) instructions that pack several values into a single wide register and operate on them in parallel.
Three generations of SIMD design exist in the wild.
Fixed-width packed SIMD. A wide register — 128, 256, or 512 bits — holds several smaller elements (sixteen 8-bit, eight 16-bit, four 32-bit, or two 64-bit values for a 128-bit register). One instruction operates element-wise on all of them. Intel's SSE (128-bit), AVX/AVX2 (256-bit), and AVX-512 (512-bit), AMD's parallel implementations, and ARM's NEON Advanced SIMD (128-bit) all fall into this category. Each new width has historically required a new instruction encoding and a new round of compiler and library work; binaries built for one width do not automatically benefit from a wider one.
Length-agnostic vectors. ARM's Scalable Vector Extension (SVE) and RISC-V's V extension both abandon the fixed width. The vector register is some power-of-two number of bits (128 to 2048 on SVE, anywhere from 128 upward on RV-V), determined by the implementation; software queries the length at runtime and writes loops that work for any length. The same binary runs efficiently on chips with different vector widths. The encoding is more elaborate than packed SIMD's but the portability benefit is substantial.
Predicated SIMD. SVE and RV-V also introduce predicate (or mask) registers: per-element booleans that gate which lanes participate in an operation. Predication eliminates the need for tail-handling code at the end of a loop and lets vector instructions express data-dependent control flow without scalar fallback.
At the instruction level, SIMD instructions look much like their scalar counterparts with a width or shape suffix:
| ; AArch64 NEON | |
| add v0.4s, v1.4s, v2.4s ; four 32-bit lanes added in parallel | |
| fmla v0.2d, v1.2d, v2.2d ; two 64-bit fused multiply-add | |
| ; AArch64 SVE (length-agnostic) | |
| add z0.s, p0/m, z1.s, z2.s ; predicated 32-bit add, length determined at runtime | |
| ; x86-64 AVX2 | |
| vpaddd ymm0, ymm1, ymm2 ; eight 32-bit integer adds in 256-bit registers | |
| # RISC-V V | |
| vadd.vv v8, v16, v24 ; element-wise vector add |
A few cross-cutting issues recur. Alignment is often required (or strongly preferred) for SIMD memory accesses; misaligned vector loads are slower or, on some older designs, fault. Lane crossing (shuffling, permutation, reduction across lanes) is the slowest and most encoding-hungry part of any SIMD ISA. Power consumption is significant: a 512-bit AVX-512 instruction can draw enough current that the CPU clocks down for a few microseconds afterward, a behaviour that has caused real performance regressions when a single AVX-512 use slows the rest of a process. Compilers and hand-tuned libraries spend most of their effort on these concerns rather than on the basic arithmetic.
We will return to SIMD and vector machines in detail in Chapter 29. For the cycle and category picture, the relevant fact is that SIMD instructions are an entire parallel family of every category we have already met — movement, arithmetic, logical, compare — and they form the bulk of the encoding space of every modern high-performance ISA.
11. Bit-Manipulation Instructions
A family that sits between the basic logical operations and the exotic specialized ones is bit manipulation. These instructions perform commonly needed bit-level operations that, while expressible from AND/OR/XOR/shift, would take many instructions and many cycles in the obvious form. Almost every modern ISA has acquired a bit-manipulation extension over the past two decades, motivated by the prevalence of these idioms in cryptography, hashing, compression, networking, and bit-vector data structures.
The core members of the family include:
- Population count (
popcnt,cnt): the number of set bits in a word. Used in hash functions, error-correcting codes, and combinatorial algorithms. - Count leading zeros (
clz,lzcnt): the number of zero bits before the highest set bit. Equivalent to for positive integers, and useful for normalizing floating-point values, hashing, and priority queues. - Count trailing zeros (
ctz,tzcnt): the number of zero bits before the lowest set bit. Used in bit-set iteration: "find the next set bit" isctz(x); x &= x - 1;. - Bit reverse (
rbiton AArch64,brevon others): reverse the order of bits in a word. Used in fast Fourier transforms. - Byte reverse (
bswap,rev): reverse the bytes in a word. The standard endianness-conversion instruction. - Bit-field extract and insert (
bfx,bfi,extr,bextr): extract or insert a contiguous run of bits at a specified position. Useful for parsing packed structures and bit-packed encodings. - Pdep / pext (x86 BMI2): deposit or extract bits at positions specified by a mask. Powerful primitives for chess engines, compression, and bitboard data structures.
- Carry-less multiplication (
pclmulqdq,pmull): multiplication in rather than ordinary integer multiplication. The fundamental primitive for CRC and AES-GCM.
| ; AArch64 | |
| clz x0, x1 ; x0 = count_leading_zeros(x1) | |
| rbit x0, x1 ; bit reverse | |
| bfi x0, x1, #4, #8 ; insert 8 bits of x1 into x0 starting at bit 4 | |
| ; x86-64 (BMI1/BMI2) | |
| lzcnt rax, rbx ; count leading zeros | |
| tzcnt rax, rbx ; count trailing zeros | |
| bextr rax, rbx, rcx ; extract a bit field | |
| pdep rax, rbx, rcx ; deposit bits per mask | |
| pext rax, rbx, rcx ; extract bits per mask |
RISC-V provides this family through the optional Zbb, Zbs, Zbc, and Zbkb extensions; AArch64 has most of it in the base ISA; x86 has accumulated it through SSE4.2, ABM, and BMI1/BMI2 over many years. A program that needs these operations can either use compiler intrinsics (__builtin_popcount, _lzcnt_u64, etc.) or fall back to a software implementation; the difference in performance can be an order of magnitude.
12. Cryptographic and Domain-Specific Instructions
A more recent and faster-growing family is cryptographic and other domain-specific accelerators built directly into the ISA. The motivation is performance and side-channel resistance: a single hardware instruction that performs an AES round is both faster and (when designed carefully) less leaky than a software implementation.
The most universally implemented cryptographic family is AES. x86's AES-NI provides AESENC, AESENCLAST, AESDEC, AESDECLAST, and AESKEYGENASSIST; AArch64 provides AESE, AESD, AESMC, AESIMC. A modern implementation of AES-128 in CTR mode can encrypt at over 5 GB/s per core using these instructions. The same design pattern — ISA support for a single round of a popular primitive, with the surrounding scaffolding done in software — has been applied to SHA-1, SHA-2, SHA-3 (SHA1RNDS4, SHA256RNDS2, SHA1H/SHA256H/SHA256H2, etc.), SM3 and SM4 (Chinese standards, present in some AArch64 and x86 chips), CRC32 (CRC32/CRC32C instructions on x86 and AArch64), and GCM multiplication (the PCLMULQDQ and PMULL instructions noted above).
A related but distinct family is random-number generation. x86's RDRAND and RDSEED return cryptographically-strong random numbers from an on-chip entropy source; AArch64's RNDR and RNDRRS do the same. These are not strictly cryptographic primitives but are usually grouped with them because they fill the same role in security-sensitive code.
Matrix and tensor extensions are the most recent additions. Intel's AMX (Advanced Matrix Extensions) introduces eight tmm0–tmm7 two-dimensional registers and a tdpbf16ps/tdpbssd family of tile multiply-accumulate instructions targeted at machine-learning workloads. ARM's SME (Scalable Matrix Extension) and the corresponding RISC-V extensions play similar roles. These instructions blur the line between an ISA and a co-processor; they often have their own startup/shutdown semantics, their own state to save and restore, and significant impact on operating-system context-switch logic.
A last category worth mentioning is virtualization-acceleration instructions — Intel's VT-x (VMENTER, VMEXIT, VMREAD, VMWRITE), AMD's SVM (VMRUN), ARM's hypervisor-mode instructions, RISC-V's H extension. These are technically system instructions, covered in the next section, but they share the domain-specific flavour: an entire problem domain (running a guest OS at near-native speed) absorbed into the ISA.
The practical lesson is that a modern ISA is no longer a small, regular set of orthogonal operations. It is that core, plus a long and growing tail of domain-specific extensions, each of which is essential to some workload and irrelevant to most others. The runtime CPU-dispatch machinery introduced in Chapter 11 is what lets a single binary use these accelerators when present and fall back to portable code when not.
13. String and Block-Memory Instructions
A small but historically significant family of instructions operates on entire ranges of memory in a single architectural step. They are most prominent on x86, where they have been part of the ISA since the 8086, and have made occasional appearances on other architectures.
The x86 family is string instructions prefixed by REP:
| rep movsb ; copy ECX bytes from [RSI] to [RDI] | |
| rep stosb ; fill ECX bytes at [RDI] with the byte in AL | |
| rep cmpsb ; compare ECX bytes; stops on inequality | |
| rep scasb ; scan ECX bytes for the byte in AL |
A single rep movsb is logically equivalent to a loop that copies one byte per iteration. Modern x86 implementations recognize the common rep movsb/rep stosb patterns and execute them with optimized microcode that copies a cache line at a time — the fast strings and enhanced REP MOVSB (ERMSB) features. On a chip that supports them, rep movsb is often the fastest way to copy memory, faster even than a hand-tuned AVX-512 loop. On chips without them, it is much slower than a software loop. The presence of the optimization is reported through CPUID feature bits, and glibc's memcpy chooses between several implementations at startup based on what it finds.
AArch64 takes a different path. It has no general rep mechanism, but recent ARMv8.8 and v9 chips add dedicated memory-copy and memory-set instructions (CPYP, CPYM, CPYE, SETP, SETM, SETE) that the architecture defines explicitly to be the equivalent of a loop, with a defined number of bytes processed per execution and a clearly specified resumption protocol on interruption. The intent is the same: let software invoke a hardware-optimized bulk copy without having to know cache-line widths or the right unroll factor.
RISC-V has historically left this to software; the upcoming Zilsd and related extensions add structure-load/store pairs but no explicit block instruction.
A related family is prefetch instructions — hints that a future load or store is likely, allowing the cache to fetch the line in advance. Every modern ISA provides them: x86's PREFETCH0/PREFETCH1/PREFETCH2/PREFETCHNTA, AArch64's PRFM family, RISC-V's prefetch.r, prefetch.w, prefetch.i in the Zicbop extension. We will return to prefetching in detail in Chapters 17 and 50.
14. Hint and No-Op Instructions
A last small but practically important family is hint instructions: instructions whose architectural effect is nothing, but whose execution conveys information to the implementation that may change its behaviour.
The simplest is the no-op, an instruction that does nothing. Every ISA has one (or several). Its uses include padding code to a desired alignment (compilers emit NOPs to align hot loop entry points), reserving space for runtime patching (the kernel's text-poke mechanism overwrites a NOP region with a real call), and acting as a deliberate marker in instrumented code. Most ISAs encode the NOP as a degenerate form of an existing instruction (RISC-V's nop is addi x0, x0, 0); x86 has a family of multi-byte NOPs explicitly designed for alignment padding (0F 1F 00, 0F 1F 40 00, 0F 1F 84 00 00 00 00 00, ..., up to fifteen bytes) so that the compiler can pad to any alignment without using multiple instructions.
More interesting are the real hints. PAUSE (x86) and YIELD (AArch64) and fence.i followed by wfi patterns (RISC-V) tell the CPU that the current thread is in a spin-wait loop and that it is safe to lower power, switch to another SMT thread, or otherwise back off. They can dramatically reduce the cost of contended spin locks. WFI (wait for interrupt) and WFE (wait for event) on AArch64 put the core into a low-power state until something specific happens. SEV (send event) wakes other cores that are in WFE. Branch hints — PowerPC's predicted-likely/unlikely bits, x86's deprecated 2E/3E prefixes — tell the front-end which way a branch is expected to go.
A more recent and security-flavoured family is the speculation-control hints introduced in response to Spectre and related attacks. LFENCE on x86 acts as a serializing barrier that prevents speculative execution past it; the compiler inserts it after a bounds check that gates a sensitive memory access. AArch64's CSDB (consume speculation data barrier) plays a similar role. SSBB/PSSBB restrict speculative store-bypass behaviour. These instructions are architecturally NOPs in the sense that they have no effect on functional state, but their hint to the implementation prevents a class of side-channel attacks. We will examine the underlying issues in Chapter 51.
The broader point is that a real ISA's instruction set is wider than the user-mode arithmetic and memory operations would suggest. A comfortable reading of any architecture manual requires at least passing acquaintance with the hint family, because it is where many of the implementation-visible behaviours surface as architectural commitments.
15. System Instructions
The last family is the set of instructions that interact with the privileged state of the processor. Most are usable only in kernel mode.
System call instructions transfer control to the operating system. We saw these in Chapter 15: syscall on x86-64, svc on AArch64, ecall on RISC-V. From the user-mode program's point of view, this is the only system instruction.
Privileged instructions perform operations that user mode is not allowed to invoke. Categories include:
- MMU configuration: writing the page-table base register, flushing the TLB, switching address spaces.
- Interrupt management: reading and writing interrupt-controller registers, masking and unmasking interrupts, returning from interrupt handlers.
- Cache management: flushing or invalidating cache lines, setting cache coherence behavior.
- Mode and privilege control: switching privilege levels, modifying the processor's configuration registers, halting the CPU.
- Performance monitoring: configuring and reading the performance counters we met in Chapter 10.
- Debug: setting hardware breakpoints and watchpoints.
| ; AArch64 examples | |
| msr ttbr0_el1, x0 ; write the user page-table base register | |
| tlbi vmalle1 ; invalidate the TLB | |
| ic iallu ; invalidate all instruction cache lines | |
| isb ; instruction synchronization barrier | |
| eret ; return from exception |
| # RISC-V examples | |
| csrrw a0, satp, a1 # atomically read/write supervisor-mode page-table register | |
| sfence.vma # flush the TLB | |
| mret # return from machine-mode trap | |
| wfi # wait for interrupt (low-power) |
x86's privileged instructions are scattered through the ISA. MOV to and from control registers and model-specific registers; INVLPG to flush a TLB entry; WBINVD to flush caches; RDMSR/WRMSR to access MSRs; IRET to return from interrupts; HLT to halt the CPU; MWAIT to wait for memory events. These are the building blocks of operating-system kernels.
A category that has emerged in recent decades is fence and barrier instructions — explicit synchronization points used by the program to enforce orderings of memory accesses. We will examine these in detail in Chapter 31.
16. A Cross-ISA Summary
A useful mental table of the major instruction categories across the three main ISAs:
| Category | Examples (RISC-V) | Examples (AArch64) | Examples (x86-64) |
|---|---|---|---|
| Data movement | ld, sd, mv, li | ldr, str, mov | mov, lea |
| Integer arithmetic | add, sub, mul, div | add, sub, mul, sdiv | add, sub, imul, idiv |
| Logical | and, or, xor | and, orr, eor | and, or, xor |
| Shifts and rotates | sll, srl, sra (Zbb: rol, ror) | lsl, lsr, asr, ror | shl, shr, sar, rol, ror |
| Compare | slt/sltu, beq/bne/... | cmp, b.cond | cmp, test, jcc |
| Branches and jumps | jal, jalr, beq, ... | b, b.cond, br, bl | jmp, jcc, call |
| Calls and returns | jal ra,..., ret | bl, ret | call, ret |
| Atomic | lr.d/sc.d, amoadd.d, ... | ldxr/stxr, cas | lock prefix, cmpxchg |
| System | ecall, mret, sfence.vma, ... | svc, eret, tlbi, ... | syscall, iret, invlpg, rdmsr, ... |
Specialized families layered on top of these include floating-point arithmetic, SIMD/vector operations (Part VI), cryptography accelerators, bit manipulation, and various domain-specific extensions. The details vary, but every modern ISA has, at minimum, the categories above.
17. Summary
The instruction set of any general-purpose ISA falls into a small number of broad categories. Data movement instructions (loads, stores, register moves, immediate loads) shuffle bits between registers, memory, and the instruction stream. Arithmetic instructions (add, subtract, multiply, divide, multiply-add) operate on integer or floating-point values; the floating-point family adds rounding modes, exception flags, special values, and IEEE 754's reproducibility guarantees. Logical instructions (AND, OR, XOR, NOT) and shift/rotate instructions provide the bit-level primitives for masks, packing, and bit-twiddling code, and bit-manipulation extensions — popcount, leading- and trailing-zero counts, bit-field extract/insert, deposit/extract, carry-less multiply — cover idioms that would otherwise take many instructions. Comparison instructions, paired with branches and jumps, control the program counter; calls and returns implement the procedure-call abstraction. Atomic instructions provide the synchronization primitives that multi-threaded programs need. SIMD and vector families lift every category from scalar to wide, in either fixed-width packed form (SSE/AVX, NEON) or modern length-agnostic predicated form (SVE, RV-V). Cryptographic and other domain-specific instructions — AES, SHA, CRC, GCM multiplication, on-chip random numbers, matrix-multiply tiles — fold whole problem domains into the ISA. Block-memory instructions and prefetch hints accelerate bulk data movement. Hint instructions — NOPs of varying widths, PAUSE, YIELD, WFI, branch hints, speculation barriers — carry information to the implementation without changing architectural state. System instructions — most of them privileged — let the operating system configure the machine.
Different ISAs draw the lines between these categories differently and ornament them with their own specialized members, but the categories themselves have been stable for half a century. A programmer who understands them can read any modern ISA's manual without surprise, and a compiler writer can target a new ISA by mapping the same intermediate-language operations onto the same families of machine instructions.
Chapter 13 turns from the instructions themselves to the form they take on disk and on the way to the processor: machine code, assembly language, and the toolchain that connects them.