Part IV·Microarchitecture·Chapter 21 of 62

Part IVMicroarchitecture

From ISA to Micro-Architecture

May 16, 2026·20 min read·intermediate

The previous parts described what a processor *does* — its instruction set, its memory model, the contract it presents to software. None of those parts said anything about *how* the processor…

The previous parts described what a processor does — its instruction set, its memory model, the contract it presents to software. None of those parts said anything about how the processor performs its work internally. The same ISA can be implemented in radically different ways: a tiny embedded core that handles one instruction every several cycles, a smartphone CPU that executes four or five instructions per cycle, a server processor that issues a dozen at once with hundreds in flight. They all run the same programs and produce the same results, because the ISA — the architecture — fixes only the externally visible behavior. The internal implementation is the micro-architecture.

This chapter introduces the distinction and the vocabulary. The remaining chapters of Part V — pipelining, branch handling, superscalar execution, out-of-order execution, the load/store path, decode and microcode — flesh out the techniques that turn a clean ISA into a fast machine. Before getting into the techniques, it is useful to spend a chapter on what the boundary between architecture and micro-architecture actually is, why it matters, and how a designer thinks about implementation when freed from the ISA's surface constraints.

01.Architecture versus Micro-Architecture

The architecture of a processor is its instruction set: the registers visible to software, the instruction encodings, the memory model, the exception model, the privilege levels. Two implementations of the same architecture must run the same programs and produce the same observable results. They do not have to do so in the same way.

The micro-architecture is the implementation: the internal pipeline structure, the cache hierarchy, the branch predictor, the issue queue, the execution units, the register file. Two micro-architectures of the same architecture can differ in every internal detail and still be compatible at the software level.

A small example. The instruction add x0, x1, x2 on AArch64 specifies that the value in x1 is added to the value in x2 and the result placed in x0. That is all the architecture says. A simple in-order implementation might:

Read x1 and x2 from a register file in cycle 1.
Compute the sum in an ALU in cycle 2.
Write the result back to x0 in cycle 3.

A more aggressive implementation might:

Decode the instruction and rename x0 to a physical register in cycle 1.
Wait in an issue queue until x1 and x2 are ready, then issue to one of several ALUs in some later cycle.
Compute the sum in one cycle.
Hold the result in the physical register until the instruction retires and its result becomes architecturally visible.

The same ISA-level operation, executed in profoundly different ways. The architecture says nothing about which one a chip uses. Software written for the architecture works on either.

The decoupling is what makes computer architecture as an engineering discipline possible. ISA designers can fix a clean, durable contract and let implementations evolve underneath it for decades. Programs written for the original Intel 8086 in 1978 still run on a modern x86-64 chip — not on the original implementation, but on a totally different micro-architecture that happens to honor the same architectural contract.

The other half of the bargain is that micro-architecture matters for performance, even though it does not matter for correctness. A program that runs ten times faster on one chip than another may be using the same ISA, with all the same instructions doing all the same work, but the chip's internal organization makes a profound difference in throughput.

02.Why a New Layer

A naive question: why not just build a processor that executes one instruction at a time, fast? A single-cycle implementation would be conceptually simple, easy to reason about, and free of the mysteries of pipelining and out-of-order execution.

The answer is that single-cycle implementations are slow. The clock frequency of a single-cycle processor is bounded by the slowest instruction — the one that takes the longest combinational path through the datapath. If a memory access requires 30 nanoseconds and an addition requires 1 nanosecond, every cycle has to be at least 30 ns long, so the addition runs at the speed of the memory access.

Modern processors have to be fast in two senses simultaneously: each individual instruction completes quickly, and many instructions complete per second. These two goals — latency and throughput — pull in different directions. Reducing latency might mean making each stage simpler; increasing throughput might mean overlapping many instructions, which adds complexity to each stage.

The history of micro-architecture is a sequence of techniques that resolve the tension by overlapping work in time and space. Pipelining lets several instructions occupy different stages of the processor at once, so each stage works on a different instruction every cycle. Superscalar execution puts multiple parallel pipelines side by side, so several instructions can occupy the same stage at the same time. Out-of-order execution lets the hardware re-order instructions when their data dependencies allow, so that one instruction's stall does not stall the rest. Speculation runs instructions before their preconditions are confirmed, then either uses the results or throws them away. Caches put the most-used data in fast on-chip memory, so the slow main memory does not bound the cycle time.

Every one of these techniques is invisible to the architecture. Each one would, in principle, be acceptable to a software contract that says nothing about timing. In practice, some of them — speculative execution especially — have leaked through the contract in unexpected ways (Spectre and Meltdown, Chapter 51), but the original intent is clear: implementation freedom for the chip designer, fixed semantics for the software.

03.Hardware Description: From RTL to Silicon

Modern processor design happens at a level called register-transfer level (RTL). RTL describes the chip as a network of registers (small banks of state storage) connected by combinational logic that computes the next state from the current state and any external inputs. RTL is captured in hardware description languages — Verilog and VHDL primarily, with newer languages like SystemVerilog, Chisel, and SpinalHDL gaining ground.

A small RTL fragment, in pseudo-Verilog, for a single ALU stage:

Verilog

always @(posedge clk) begin
    if (op == ADD)        result <= a + b;
    else if (op == SUB)   result <= a - b;
    else if (op == AND)   result <= a & b;
    else                  result <= a | b;
end

The <= is a non-blocking assignment: at every clock edge, result updates simultaneously with all other registers. The combinational logic — the muxes that pick which expression to evaluate, the adders and AND/OR gates inside the expressions — sits between the registers and resolves in the time between edges.

The RTL description goes through several transformations on the way to silicon.

Synthesis turns the high-level RTL into a network of standard cells — basic logic elements (NAND gates, flip-flops, multiplexers) drawn from a library specific to the manufacturing process. A synthesis tool reads the Verilog and produces a netlist: a graph of cells and the wires connecting them.

Placement decides where on the silicon each cell physically sits. A modern chip may have hundreds of millions or billions of cells; placement is a large optimization problem balancing wire length, congestion, and timing.

Routing decides what shape each wire takes through the metal layers above the silicon. Modern processes have ten or more metal layers, each with its own pitch and width rules.

Verification at every stage checks that the design still behaves correctly. RTL simulation runs the Verilog against test vectors. Formal verification proves properties using mathematical methods. Post-synthesis simulation checks the netlist. Static timing analysis checks that every signal arrives at every register before its clock edge.

Tape-out sends the final layout to the foundry. The foundry uses the layout to make photolithography masks; the masks are used to expose silicon wafers; the wafers are processed into chips.

This pipeline — from a Verilog file describing pipeline stages to a fabricated chip running real software — takes years and costs hundreds of millions of dollars at advanced process nodes. The economics are why ISAs change slowly and why micro-architectures evolve in steady increments rather than radical leaps.

04.The Datapath and the Control

A processor has, at the highest level, two interlocking parts.

The datapath is the network of registers, wires, and functional units that holds and computes on data. It contains the register file, the ALU, the load/store unit, the cache, the multiplexers and buses that route data between them. The datapath does the actual work of computation — adding, multiplying, loading, storing.

The control is the logic that tells the datapath what to do each cycle. It reads the current instruction (or its decoded form) and the current state of the machine, and it asserts the right control signals: which register to read, which ALU operation to perform, whether to write back, where to fetch the next instruction. The control is the brain that orchestrates the datapath's activity.

A simple in-order datapath for a RISC-like ISA looks roughly like:

Figure: Simple in-order RISC datapath: PC, I-cache, decoder, register file, ALU, and D-cache stacked vertically with a writeback path back to the register file

LaTeX

\begin{tikzpicture}[font=\small, >=Stealth, line cap=round,
  blk/.style={draw, thick, fill=white, minimum width=2.6cm, minimum height=0.8cm}]
  \node[blk] (pc)  at (3, -0.5) {PC};
  \node[blk] (ic)  at (3, -2)   {I-Cache};
  \node[blk] (dec) at (3, -3.5) {Decoder};
  \node[blk] (rf)  at (3, -5)   {Reg File};
  \node[blk] (alu) at (3, -6.5) {ALU};
  \node[blk] (dc)  at (3, -8)   {D-Cache};
  \node[font=\small] at (3, -9) {(writeback to reg file)};
  \draw[->] (pc) -- (ic);
  \draw[->] (ic) -- (dec);
  \draw[->] (dec) -- (rf);
  \draw[->] (rf) -- (alu);
  \draw[->] (alu) -- (dc);
  \draw[->] (dc.south) -- (3, -9);
\end{tikzpicture}

The control reads each instruction and configures the datapath to do exactly what that instruction needs: read the right source registers, choose the right ALU operation, possibly access the data cache, and write back to the right destination.

Even this small drawing hides enormous complexity. The "ALU" is a multi-function unit with adders, shifters, comparators. The "Reg File" has multiple read ports and at least one write port. The "Decoder" has to handle hundreds of distinct instruction encodings. The control logic, which we have not drawn, decides what to do every cycle based on the decoded instruction and the current state.

A more sophisticated datapath — pipelined, with multiple functional units, caches, branch predictors, and out-of-order issue — multiplies this complexity many times over. We will assemble it piece by piece in the rest of Part V.

05.Cycle, Frequency, and Critical Path

A processor runs on a clock: a periodic signal whose edges trigger the registers to capture new values. Between edges, the combinational logic computes the next value for each register. The cycle has to be long enough for every register's input to be valid and stable by the time the next edge arrives.

The longest combinational path in the design — the critical path — sets the maximum clock frequency. If the critical path takes 250 picoseconds, the clock period must be at least 250 ps, and the frequency at most $1 / 250 \text{ ps} = 4 \text{ GHz}$ .

A faster processor can be obtained by:

Shortening the critical path. Splitting a long stage into two shorter ones (more pipelining), simplifying the logic, using better circuit-level tricks. Each technique reduces the time between two adjacent registers.
Doing more per cycle. Even at the same frequency, executing two instructions per cycle instead of one doubles throughput. This is what superscalar and out-of-order do.
Overlapping latency. Even when individual operations are slow (a memory access, a divide), keeping the pipeline busy with other work hides the latency in aggregate throughput.

These three are the levers a micro-architect pulls. The chapters that follow explore each in detail.

A useful equation, sometimes called the iron law of performance:

$\text{time per program} = \frac{\text{instructions}}{\text{program}} \times \frac{\text{cycles}}{\text{instruction}} \times \frac{\text{time}}{\text{cycle}}.$

The first term is set by the program and the ISA (more powerful instructions reduce the count; simpler ones increase it). The third term is the clock period, set by the critical path. The middle term, cycles per instruction (CPI) — or its reciprocal, instructions per cycle (IPC) — is where most of micro-architecture's leverage lies. Pipelining drives ideal CPI toward 1; superscalar drives it below 1; out-of-order keeps it close to its ideal even when long-latency operations would otherwise stall.

06.Performance, Power, and Area

A processor design is not optimized for speed alone. Three quantities — performance, power, and area — interact in every design decision.

Performance is what software cares about: instructions per second, frames per second, transactions per second. It is set by the equation above.

Power is what the device, the cooling system, and the battery care about. Power has two main components: dynamic power, dissipated as transistors switch (proportional to $C V^2 f$ , where $C$ is switched capacitance, $V$ is supply voltage, $f$ is frequency); and static power, dissipated as leakage even when nothing switches (a substantial fraction of total power on modern processes).

Area is what the manufacturing cost cares about: bigger chips are more expensive, both because they use more silicon and because larger areas have lower yields (one defect ruins the whole chip).

Every architectural choice trades these off. Adding a larger cache improves performance but costs area and (slightly) power. Increasing the pipeline depth allows higher frequency but costs power (more registers switching every cycle) and may cost performance (more bubbles on branch misprediction). Adding more execution units improves IPC but costs area, power, and verification effort.

The balance is workload-dependent. A server processor running database queries cares about throughput more than absolute frequency, and the answer is more cores in a given area budget. A gaming console cares about per-thread latency on a fixed power budget, and the answer is fewer, faster cores. A mobile phone trades absolute peak performance for energy efficiency at typical loads. A microcontroller in a sensor cares about dollars per chip and milliwatts of standby power, and the answer is a tiny in-order core with no caches.

The same architecture can be implemented at any of these points. A modern AArch64 server chip and a tiny AArch64 microcontroller both implement the same ISA; their micro-architectures differ in essentially every internal detail. Part V develops the techniques in the abstract; Parts VII through IX show specific real-world implementations on x86, ARM, and RISC-V.

07.Clocking, Power Domains, and DVFS

The single global clock implied by the diagrams above is a useful fiction. A modern processor distributes its clock through a tree of buffers — a clock distribution network — that has to deliver edges to hundreds of millions of registers within a small fraction of a cycle of one another (the skew budget, often a few tens of picoseconds). The clock network alone consumes a substantial fraction of the chip's dynamic power, because every buffer in the tree switches every cycle whether or not the logic it drives has any work to do.

The response is clock gating. A small piece of logic disables the clock to any block that has nothing to do this cycle: the floating-point unit when no FP instructions are in flight, the SIMD pipes when only scalar code is running, an idle core when the OS has parked it. Clock gating is so universal that almost every block in a modern design has gating logic at its root, and post-silicon validation tools track gating efficiency as a primary metric.

A more aggressive technique is power gating: cutting power to an entire block, not just its clock. Power gating eliminates leakage as well as dynamic power, but coming back from a power-gated state takes much longer than re-enabling a clock-gated one (state has to be reloaded; transistors have to settle), so it is reserved for blocks that will be idle for many microseconds or more.

Finally, modern processors run at variable clock frequency and supply voltage, dynamically adjusted in response to workload and thermal conditions. Dynamic Voltage and Frequency Scaling (DVFS) lets a core boost frequency briefly when only one or two cores are active and the thermal headroom allows, or drop frequency aggressively when idle. Each combination of voltage and frequency is called a P-state; transitions are coordinated by a small power-management controller on the chip. We will return to DVFS, turbo boost, and thermal throttling in Chapter 52; for now, the architectural fact is that "the" clock frequency on a modern processor is rarely a single number.

08.Verification, Validation, and Performance Counters

A processor design is correct only if it implements its ISA exactly, in every architectural state. Demonstrating that for a chip with billions of transistors and thousands of corner cases is a discipline of its own.

Functional verification runs the RTL against millions of test programs — random instruction streams, hand-written corner cases, the test suites that come with the ISA, and stress tests targeted at known-difficult interactions. Coverage tools track which lines of RTL, which state-machine transitions, and which logical conditions have been exercised, and the team continues until coverage is acceptably complete. A typical commercial CPU project spends as many engineer-years on verification as on design.

Formal verification complements simulation by using mathematical methods to prove properties of the design. Equivalence checking shows that the synthesized netlist matches the RTL. Property checking shows that, for every possible input sequence, certain invariants hold (for instance, that the architectural register file always reflects the most recently retired write). Formal methods catch bugs that random testing might miss, but they scale only to small blocks; the front-end, the rename logic, the cache controller can all be verified formally, but the whole core cannot.

Post-silicon validation runs real chips against real workloads, looking for bugs that escaped pre-silicon verification. Errata are inevitable; the documentation of every commercial CPU includes an errata sheet listing the conditions under which the chip departs from its specification, and how (or whether) software should work around them. The microcode mechanism we will meet in Chapter 27 is the primary vehicle for delivering post-silicon fixes.

Finally, every modern processor includes performance counters — a small set of registers that count micro-architectural events: cache misses, branch mispredictions, retired instructions, cycles. The counters are exposed to software through architectural mechanisms (rdpmc on x86, the PMU registers on AArch64, the hpmcounter CSRs on RISC-V), allowing profilers to attribute performance problems to specific events. Performance counters are micro-architectural in spirit — they expose the implementation — but they are part of the architecture as far as software can see them. We will discuss what to do with the counters in Chapter 54.

09.What Is Not Architectural

Knowing what is not part of the architecture sharpens the picture. Here is a partial list of things that vary between implementations of the same architecture:

The number of pipeline stages.
The number of execution units.
Whether instructions issue in order or out of order.
The presence or absence of speculation.
The size, organization, and number of cache levels.
The size of the TLB.
The branch predictor's design and capacity.
The internal physical register file size.
The number of cores on a chip.
The interconnect between cores.
The clock frequency.

None of these affect program correctness. All of them affect performance, often dramatically. A program tuned for one implementation may run badly on another with the same ISA but a different micro-architecture. This is why performance-critical software is often tuned per implementation generation: the same Linux kernel runs on every x86-64 chip, but its scheduler and memory-allocation paths are tuned with knowledge of cache sizes, prefetcher behavior, and branch-predictor characteristics specific to deployment.

A few things are architectural even though they sound like implementation details:

The set of architectural registers. The number of physical registers can change; the architectural register names and count are part of the contract.
The visible memory model. What ordering of memory operations is guaranteed by the hardware is an architectural property — programs depend on it.
The exception model. Which conditions raise faults, how those faults are reported, and what state the handler sees are architectural.
The instruction encodings. A program is bytes in memory; the meaning of those bytes is architectural.

The general principle is: anything software can observe through the ISA is architectural; anything that affects only timing or hidden state is micro-architectural. Side-channel attacks like Spectre have complicated this clean line by showing that timing of nominally-hidden state can leak information into observable channels — but the design intent of the distinction is sound.

10.A Look Ahead

The remaining chapters of Part V build up modern micro-architecture in layers.

Chapter 22 (Pipelining) introduces the technique of overlapping consecutive instructions. We will look at the classic five-stage RISC pipeline (fetch, decode, execute, memory, writeback), the hazards (data, control, structural) that pipelining introduces, and the solutions (forwarding, stalling, speculation) that remove them.

Chapter 23 (Branch Handling) focuses on the single biggest source of pipeline bubbles: control-flow instructions. We will look at branch prediction (static and dynamic), branch-target prediction, return-address stacks, and the pipeline machinery for recovering from mispredictions.

Chapter 24 (Superscalar Execution) widens the pipeline so that multiple instructions can occupy each stage. We will look at the issues — duplicating units, scheduling parallel issues, data hazards across parallel paths — and how modern wide processors solve them.

Chapter 25 (Out-of-Order Execution) decouples the order in which instructions execute from the order in which they appear in the program. We will look at register renaming, the reorder buffer, the issue queue, and how hardware extracts the parallelism that compilers cannot expose.

Chapter 26 (Load/Store Micro-Architecture) zooms in on memory operations, which are the hardest part of out-of-order execution because they have to respect the architecture's memory model while still benefiting from speculation and re-ordering.

Chapter 27 (Decode and Microcode) revisits the front of the pipeline, particularly for variable-length and CISC ISAs. We will look at how x86 instructions get broken into internal µops, how microcode handles complex operations, and how the front end has become one of the most intricate parts of modern processors.

By the end of Part V, the path from the ISA-level abstractions of Part III to a working modern processor will be complete. The same techniques, applied at varying scale and with varying tradeoffs, are what every modern CPU uses, from the smallest mobile cores to the largest server chips.

11.Summary

Architecture is the contract presented to software; micro-architecture is the implementation that honors it. Two implementations of the same architecture can have wildly different internal designs and still run the same programs identically. The decoupling is what makes computer architecture a discipline distinct from chip design and from software engineering.

Modern processors are built at the register-transfer level, described in hardware description languages, and turned into silicon by an elaborate synthesis-placement-routing-verification pipeline. Their performance is governed by the iron law: program time equals instructions times cycles-per-instruction times time-per-cycle. Each of those factors is an opportunity for micro-architectural innovation, traded against power and area budgets, and is constrained by the realities of clock distribution, clock and power gating, DVFS, and the vast verification effort needed to ship a correct chip. Performance counters give the software side a window into the micro-architectural events that determine performance.

The chapters of Part V develop the techniques that turn a clean ISA into a fast machine. Chapter 22 begins with the most fundamental of those techniques: pipelining.

Book mode

	always @(posedge clk) begin
	if (op == ADD) result <= a + b;
	else if (op == SUB) result <= a - b;
	else if (op == AND) result <= a & b;
	else result <= a \| b;
	end