Part IV·Microarchitecture·Chapter 22 of 62

Part IVMicroarchitecture

Pipelining

May 16, 2026·36 min read·intermediate

A processor that handles each instruction from start to finish before starting the next leaves most of its hardware idle most of the time. While the ALU is computing, the fetch unit sits unused;…

A processor that handles each instruction from start to finish before starting the next leaves most of its hardware idle most of the time. While the ALU is computing, the fetch unit sits unused; while the fetch unit reads the next instruction, the ALU sits unused; while the writeback completes, both sit unused. Pipelining is the technique that turns this serial execution into an assembly line: each stage of the processor handles a different instruction every cycle, so that several instructions are in flight at once.

The technique is older than computers. Henry Ford's automobile assembly line is the canonical example: each station does one task on one car, the car moves to the next station, and a new car arrives at the first station. The throughput of the line — cars per hour — is determined by the slowest station, not by the total time to build a car. Pipelining a processor applies the same idea: divide instruction processing into a small number of stages, run all stages in parallel, and let throughput approach one instruction per cycle even though each individual instruction takes several cycles from start to finish.

This chapter develops pipelining from first principles. We start with the classic five-stage RISC pipeline that has been the standard pedagogical example for forty years, examine the hazards that arise when consecutive instructions interact, look at the techniques (forwarding, stalls, speculative resolution) that eliminate or hide those hazards, and discuss how real modern pipelines extend the classic shape.

01.The Classic Five-Stage Pipeline

The simplest useful pipeline divides instruction processing into five stages, traditionally called IF, ID, EX, MEM, and WB:

IF (Instruction Fetch). The processor reads the instruction at the address in the program counter from the instruction cache, and updates the PC to point to the next instruction.

ID (Instruction Decode). The fetched instruction is decoded — its mnemonic identified, its operands resolved — and the source registers are read from the register file.

EX (Execute). The ALU performs the arithmetic or logical operation specified by the instruction. For loads and stores, the address is computed here. For branches, the target address is computed and the condition is evaluated.

MEM (Memory Access). Loads read from the data cache; stores write to it. Other instructions pass through this stage doing nothing.

WB (Writeback). The result of the instruction (an ALU result, a loaded value) is written into the destination register.

Drawn as a stage-versus-time diagram, with each row a stage and each column a cycle, five consecutive instructions in the pipeline look like:

Figure: Five instructions advancing through the classic IF, ID, EX, MEM, WB stages, one stage per cycle, with each new instruction entering one cycle after the previous

LaTeX

Five stages, and after the initial fill, one instruction completes every cycle. The latency of any individual instruction is still 5 cycles (the time from its IF to its WB), but the throughput is one instruction per cycle.

The hardware implementing this pipeline has registers — pipeline registers — between every pair of stages. These registers carry the results of one stage forward as inputs to the next. Each cycle, every pipeline register captures the values produced by its stage's combinational logic during that cycle, and feeds them into the next stage's combinational logic in the next cycle.

A simplified picture of the pipeline registers:

Figure: Five-stage pipeline with IF/ID, ID/EX, EX/MEM, and MEM/WB pipeline registers separating fetch, decode, execute, memory, and writeback

LaTeX

\begin{tikzpicture}[font=\footnotesize, >=Stealth, line cap=round,
  blk/.style={draw, thick, fill=white, minimum height=0.7cm, align=center}]
  \node[blk, minimum width=1.6cm] (f)     at (1, 0)    {fetch};
  \node[blk, minimum width=1.4cm] (ifid)  at (2.7, 0)  {IF/ID};
  \node[blk, minimum width=1.6cm] (d)     at (4.3, 0)  {decode};
  \node[blk, minimum width=1.4cm] (idex)  at (6.0, 0)  {ID/EX};
  \node[blk, minimum width=1.4cm] (e)     at (7.5, 0)  {exec};
  \node[blk, minimum width=1.6cm] (exmem) at (9.2, 0)  {EX/MEM};
  \node[blk, minimum width=1.4cm] (m)     at (10.9, 0) {mem};
  \node[blk, minimum width=1.8cm] (memwb) at (12.7, 0) {MEM/WB};
  \node[blk, minimum width=2.0cm] (wb)    at (14.7, 0) {writeback};
  \draw[->] (f) -- (ifid);
  \draw[->] (ifid) -- (d);
  \draw[->] (d) -- (idex);
  \draw[->] (idex) -- (e);
  \draw[->] (e) -- (exmem);
  \draw[->] (exmem) -- (m);
  \draw[->] (m) -- (memwb);
  \draw[->] (memwb) -- (wb);
\end{tikzpicture}

The names IF/ID, ID/EX, EX/MEM, MEM/WB are the names of the pipeline registers between the corresponding stages.

The cycle time is set by the slowest stage. If decode takes 200 ps and the others take less, the cycle time is 200 ps. The total latency for one instruction is $5 \times 200 = 1000$ ps; the throughput is one instruction per 200 ps, or $5 \times 10^9$ instructions per second.

A non-pipelined version of the same logic, with no pipeline registers and no overlap, would have a cycle time equal to the sum of all stage delays — perhaps 800 ps — and a throughput of $1.25 \times 10^9$ instructions per second. Pipelining gives us nearly four times the throughput at the cost of slightly increased latency per instruction (the extra delay through the pipeline registers).

02.Pipeline Hazards

Pipelining works perfectly only when the instructions in flight are independent. In real programs, consecutive instructions interact: one writes a register that the next reads, a branch changes which instruction comes next, two instructions need the same hardware unit. Each interaction is a hazard that, naively, would force the pipeline to stall.

Three classes of hazards arise.

Data hazards occur when one instruction depends on the result of an earlier one that is still in the pipeline. The classic example: an add produces a result that an immediately following sub needs to read.

Control hazards occur when the next instruction's address depends on a branch that is still being resolved. Until the branch's outcome is known, the pipeline does not know which instruction to fetch.

Structural hazards occur when two instructions in flight need the same hardware resource at the same time. For instance, if there is only one cache port and a load wants to read it while an instruction fetch wants to read it, one of them has to wait.

The rest of this chapter looks at each in detail.

03.Data Hazards

Consider two consecutive instructions:

Assembly

add  x1, x2, x3      # x1 = x2 + x3
sub  x4, x1, x5      # x4 = x1 - x5

The sub reads x1, which the add writes. In the pipeline:

Figure: A RAW hazard in the pipeline: sub enters ID while add is still in EX, so sub reads x1 before add's WB has updated the register file

LaTeX

\begin{tikzpicture}[font=\footnotesize, line cap=round, x=0.85cm, y=0.55cm]
  \node[anchor=east] at (0.5, 0) {Cycle:};
  \node at (1, 0) {1}; \node at (2, 0) {2}; \node at (3, 0) {3}; \node at (4, 0) {4}; \node at (5, 0) {5}; \node at (6, 0) {6};
  % add (start=1)
  \node[anchor=east] at (0.5, -1) {add:};
  \draw[thick] (0.6, -1.3) rectangle (1.4, -0.7); \node at (1, -1) {IF};
  \draw[thick] (1.6, -1.3) rectangle (2.4, -0.7); \node at (2, -1) {ID};
  \draw[thick] (2.6, -1.3) rectangle (3.4, -0.7); \node at (3, -1) {EX};
  \draw[thick] (3.6, -1.3) rectangle (4.4, -0.7); \node at (4, -1) {MEM};
  \draw[thick] (4.6, -1.3) rectangle (5.4, -0.7); \node at (5, -1) {WB};
  % sub (start=2)
  \node[anchor=east] at (0.5, -2) {sub:};
  \draw[thick] (1.6, -2.3) rectangle (2.4, -1.7); \node at (2, -2) {IF};
  \draw[thick] (2.6, -2.3) rectangle (3.4, -1.7); \node at (3, -2) {ID};
  \draw[thick] (3.6, -2.3) rectangle (4.4, -1.7); \node at (4, -2) {EX};
  \draw[thick] (4.6, -2.3) rectangle (5.4, -1.7); \node at (5, -2) {MEM};
  \draw[thick] (5.6, -2.3) rectangle (6.4, -1.7); \node at (6, -2) {WB};
\end{tikzpicture}

The add's result is available at the end of cycle 3 (after EX) and is written to the register file at the end of cycle 5 (after WB). The sub reads its source registers in cycle 3 (during ID). It reads the old value of x1, two cycles before the add updates it.

This is a read-after-write (RAW) hazard, also called a true dependency. There are two other dependency types worth naming:

Write-after-read (WAR), or anti-dependency: an instruction reads a register that a later instruction writes. In an in-order pipeline this is not a hazard (the read happens before the write), but it becomes one when instructions can complete out of order.
Write-after-write (WAW), or output dependency: two instructions write the same register. In an in-order pipeline the second write naturally overwrites the first; out of order, the order has to be preserved.

We focus on RAW hazards here; WAR and WAW reappear in Chapter 25.

The naive solution to a RAW hazard is to stall the dependent instruction. The decode stage notices that one of sub's source registers is the destination of an instruction still in the pipeline, and freezes the sub until the older instruction's WB is complete:

Figure: Stall-based RAW hazard resolution: sub holds in ID for three cycles until add's WB completes, then continues through EX, MEM, and WB

LaTeX

\begin{tikzpicture}[font=\footnotesize, line cap=round, x=0.85cm, y=0.55cm]
  \node[anchor=east] at (0.5, 0) {Cycle:};
  \node at (1, 0) {1}; \node at (2, 0) {2}; \node at (3, 0) {3}; \node at (4, 0) {4}; \node at (5, 0) {5};
  \node at (6, 0) {6}; \node at (7, 0) {7}; \node at (8, 0) {8}; \node at (9, 0) {9};
  % add (start=1, full)
  \node[anchor=east] at (0.5, -1) {add:};
  \draw[thick] (0.6, -1.3) rectangle (1.4, -0.7); \node at (1, -1) {IF};
  \draw[thick] (1.6, -1.3) rectangle (2.4, -0.7); \node at (2, -1) {ID};
  \draw[thick] (2.6, -1.3) rectangle (3.4, -0.7); \node at (3, -1) {EX};
  \draw[thick] (3.6, -1.3) rectangle (4.4, -0.7); \node at (4, -1) {MEM};
  \draw[thick] (4.6, -1.3) rectangle (5.4, -0.7); \node at (5, -1) {WB};
  % sub: IF, ID at start=2,3; stalls at 4,5,6; EX,MEM,WB at 7,8,9
  \node[anchor=east] at (0.5, -2) {sub:};
  \draw[thick] (1.6, -2.3) rectangle (2.4, -1.7); \node at (2, -2) {IF};
  \draw[thick] (2.6, -2.3) rectangle (3.4, -1.7); \node at (3, -2) {ID};
  \node at (4, -2) {--};
  \node at (5, -2) {--};
  \node at (6, -2) {--};
  \draw[thick] (6.6, -2.3) rectangle (7.4, -1.7); \node at (7, -2) {EX};
  \draw[thick] (7.6, -2.3) rectangle (8.4, -1.7); \node at (8, -2) {MEM};
  \draw[thick] (8.6, -2.3) rectangle (9.4, -1.7); \node at (9, -2) {WB};
\end{tikzpicture}

Three cycles wasted. The pipeline emits "bubbles" (NOP-like nothings) into the stages following the stalled instruction. The processor's effective throughput drops well below one instruction per cycle.

Stalling is correct but expensive. The key insight that saves us: the value add produces is known at the end of EX (cycle 3), even though it is not written to the register file until the end of WB (cycle 5). If we can route the value directly from the EX/MEM pipeline register into the EX stage's input multiplexers in the next cycle, the sub can use it without stalling.

This technique is called forwarding (or bypassing).

04.Forwarding

A forwarding network adds extra wires from later pipeline registers back to the EX stage's inputs. When the decode stage notices that an instruction in EX or MEM will produce a value the next instruction needs, it sets up the EX stage's input multiplexers to take the value from the bypass wire instead of from the register file's read port.

With forwarding, the previous example runs without stalls:

Figure: Forwarding eliminates the RAW stall: the ALU result from add's EX is bypassed directly to sub's EX in the next cycle, so neither instruction stalls

LaTeX

\begin{tikzpicture}[font=\footnotesize, >=Stealth, line cap=round, x=0.85cm, y=0.55cm]
  \node[anchor=east] at (0.5, 0) {Cycle:};
  \node at (1, 0) {1}; \node at (2, 0) {2}; \node at (3, 0) {3}; \node at (4, 0) {4}; \node at (5, 0) {5}; \node at (6, 0) {6};
  % add (start=1)
  \node[anchor=east] at (0.5, -1) {add:};
  \draw[thick] (0.6, -1.3) rectangle (1.4, -0.7); \node at (1, -1) {IF};
  \draw[thick] (1.6, -1.3) rectangle (2.4, -0.7); \node at (2, -1) {ID};
  \draw[thick] (2.6, -1.3) rectangle (3.4, -0.7); \node at (3, -1) {EX};
  \draw[thick] (3.6, -1.3) rectangle (4.4, -0.7); \node at (4, -1) {MEM};
  \draw[thick] (4.6, -1.3) rectangle (5.4, -0.7); \node at (5, -1) {WB};
  % sub (start=2)
  \node[anchor=east] at (0.5, -2) {sub:};
  \draw[thick] (1.6, -2.3) rectangle (2.4, -1.7); \node at (2, -2) {IF};
  \draw[thick] (2.6, -2.3) rectangle (3.4, -1.7); \node at (3, -2) {ID};
  \draw[thick] (3.6, -2.3) rectangle (4.4, -1.7); \node at (4, -2) {EX};
  \draw[thick] (4.6, -2.3) rectangle (5.4, -1.7); \node at (5, -2) {MEM};
  \draw[thick] (5.6, -2.3) rectangle (6.4, -1.7); \node at (6, -2) {WB};
  \draw[->, thick, red] (3, -1.3) to[bend right=20] node[right] {forward} (4, -1.7);
\end{tikzpicture}

The add's ALU output in cycle 3 is forwarded directly to the sub's ALU input in cycle 4, bypassing the register file. Both instructions complete on schedule.

Forwarding paths are needed from several places:

EX/MEM → EX: the ALU result of the previous instruction.
MEM/WB → EX: the loaded value (or the still-traveling ALU result) of the instruction two ahead.
MEM/WB → MEM: the value being stored in a store instruction may need a value just produced.

Each forwarding path adds wires and multiplexers and complicates the timing analysis. A processor designer chooses the set of forwarding paths to handle the common cases; rarer dependencies fall back to stalling.

Load-Use Hazards

One important case forwarding cannot fully hide: a load followed immediately by an instruction that uses the loaded value.

Assembly

ld   x1, 0(x10)      # x1 = *x10
sub  x4, x1, x5      # uses x1

The load's value is not available until the end of MEM (cycle 4). The sub needs x1 at the start of EX (cycle 4 if it follows immediately). The value cannot be forwarded backward in time, only sideways at most.

Figure: Load-use hazard: sub reaches EX in cycle 4 needing the value the load produces in MEM in the same cycle, an impossible backward-in-time forwarding

LaTeX

\begin{tikzpicture}[font=\footnotesize, line cap=round, x=0.85cm, y=0.55cm]
  \node[anchor=east] at (0.5, 0) {Cycle:};
  \node at (1, 0) {1}; \node at (2, 0) {2}; \node at (3, 0) {3}; \node at (4, 0) {4}; \node at (5, 0) {5}; \node at (6, 0) {6};
  \node[anchor=east] at (0.5, -1) {ld:};
  \draw[thick] (0.6, -1.3) rectangle (1.4, -0.7); \node at (1, -1) {IF};
  \draw[thick] (1.6, -1.3) rectangle (2.4, -0.7); \node at (2, -1) {ID};
  \draw[thick] (2.6, -1.3) rectangle (3.4, -0.7); \node at (3, -1) {EX};
  \draw[thick] (3.6, -1.3) rectangle (4.4, -0.7); \node at (4, -1) {MEM};
  \draw[thick] (4.6, -1.3) rectangle (5.4, -0.7); \node at (5, -1) {WB};
  \node[anchor=east] at (0.5, -2) {sub:};
  \draw[thick] (1.6, -2.3) rectangle (2.4, -1.7); \node at (2, -2) {IF};
  \draw[thick] (2.6, -2.3) rectangle (3.4, -1.7); \node at (3, -2) {ID};
  \draw[thick] (3.6, -2.3) rectangle (4.4, -1.7); \node at (4, -2) {EX?};
  \node[font=\scriptsize] at (4, -3) {needs value, but still in MEM};
\end{tikzpicture}

A one-cycle stall is unavoidable: the sub waits one cycle, then forwards from MEM/WB to EX:

Figure: Load-use hazard resolved with one stall plus forwarding: sub waits one cycle in ID, then receives the load's value forwarded from MEM/WB into EX

LaTeX

\begin{tikzpicture}[font=\footnotesize, >=Stealth, line cap=round, x=0.85cm, y=0.55cm]
  \node[anchor=east] at (0.5, 0) {Cycle:};
  \node at (1, 0) {1}; \node at (2, 0) {2}; \node at (3, 0) {3}; \node at (4, 0) {4}; \node at (5, 0) {5}; \node at (6, 0) {6}; \node at (7, 0) {7};
  \node[anchor=east] at (0.5, -1) {ld:};
  \draw[thick] (0.6, -1.3) rectangle (1.4, -0.7); \node at (1, -1) {IF};
  \draw[thick] (1.6, -1.3) rectangle (2.4, -0.7); \node at (2, -1) {ID};
  \draw[thick] (2.6, -1.3) rectangle (3.4, -0.7); \node at (3, -1) {EX};
  \draw[thick] (3.6, -1.3) rectangle (4.4, -0.7); \node at (4, -1) {MEM};
  \draw[thick] (4.6, -1.3) rectangle (5.4, -0.7); \node at (5, -1) {WB};
  \node[anchor=east] at (0.5, -2) {sub:};
  \draw[thick] (1.6, -2.3) rectangle (2.4, -1.7); \node at (2, -2) {IF};
  \draw[thick] (2.6, -2.3) rectangle (3.4, -1.7); \node at (3, -2) {ID};
  \node at (4, -2) {--};
  \draw[thick] (4.6, -2.3) rectangle (5.4, -1.7); \node at (5, -2) {EX};
  \draw[thick] (5.6, -2.3) rectangle (6.4, -1.7); \node at (6, -2) {MEM};
  \draw[thick] (6.6, -2.3) rectangle (7.4, -1.7); \node at (7, -2) {WB};
  \draw[->, red] (4.4, -1.3) to[bend right=15] node[right, font=\scriptsize] {fwd MEM/WB} (5, -1.7);
\end{tikzpicture}

This is a load-use hazard and the bubble it inserts is the load-use delay. Compilers try to fill the slot with an unrelated instruction, scheduling the code so that the load's result is not needed immediately. If the program does not provide enough instruction-level parallelism for the compiler to hide the bubble, the slot stays empty.

In modern processors, where load latency is more than one cycle (because data caches take 3–5 cycles to deliver values), the load-use delay is correspondingly longer, and compilers and hardware work harder to hide it.

05.Control Hazards

Consider a branch:

Assembly

beq  x1, x2, label
add  x3, x4, x5    # next sequential instruction
...
label:
sub  x6, x7, x8

In the IF stage of cycle 1, the processor fetches the beq. The branch's outcome — taken or not taken — is computed in EX (cycle 3). Until that outcome is known, the processor does not know whether to fetch add or sub next.

In a naive pipeline that does nothing about this, the processor would have to stall:

Figure: Naive control-hazard stall: the next fetch cannot start until cycle 4, when the beq's EX has resolved the branch direction

LaTeX

\begin{tikzpicture}[font=\footnotesize, line cap=round, x=0.85cm, y=0.55cm]
  \node[anchor=east] at (0.5, 0) {Cycle:};
  \node at (1, 0) {1}; \node at (2, 0) {2}; \node at (3, 0) {3}; \node at (4, 0) {4}; \node at (5, 0) {5};
  \node[anchor=east] at (0.5, -1) {beq:};
  \draw[thick] (0.6, -1.3) rectangle (1.4, -0.7); \node at (1, -1) {IF};
  \draw[thick] (1.6, -1.3) rectangle (2.4, -0.7); \node at (2, -1) {ID};
  \draw[thick] (2.6, -1.3) rectangle (3.4, -0.7); \node at (3, -1) {EX};
  \draw[thick] (3.6, -1.3) rectangle (4.4, -0.7); \node at (4, -1) {MEM};
  \draw[thick] (4.6, -1.3) rectangle (5.4, -0.7); \node at (5, -1) {WB};
  \node[font=\scriptsize] at (3, -1.7) {branch resolved};
  \node[anchor=east] at (0.5, -3) {?:};
  \draw[thick] (3.6, -3.3) rectangle (4.4, -2.7); \node at (4, -3) {IF};
\end{tikzpicture}

Two cycles of stall on every branch. Since branches are roughly 15-20% of dynamic instructions, this would devastate throughput.

The solution is prediction: guess what the branch will do, fetch as if the guess is correct, and fix things up if the guess turns out wrong.

The simplest form of prediction is predict-not-taken: assume every branch falls through. The pipeline keeps fetching sequential instructions after every branch:

Figure: Predict-not-taken: the fetcher keeps issuing sequential instructions after the branch, speculating that the branch falls through

LaTeX

\begin{tikzpicture}[font=\footnotesize, line cap=round, x=0.85cm, y=0.55cm]
  \node[anchor=east] at (0.5, 0) {Cycle:};
  \node at (1, 0) {1}; \node at (2, 0) {2}; \node at (3, 0) {3}; \node at (4, 0) {4}; \node at (5, 0) {5};
  % beq (start=1, full)
  \node[anchor=east] at (0.5, -1) {beq:};
  \draw[thick] (0.6, -1.3) rectangle (1.4, -0.7); \node at (1, -1) {IF};
  \draw[thick] (1.6, -1.3) rectangle (2.4, -0.7); \node at (2, -1) {ID};
  \draw[thick] (2.6, -1.3) rectangle (3.4, -0.7); \node at (3, -1) {EX};
  \draw[thick] (3.6, -1.3) rectangle (4.4, -0.7); \node at (4, -1) {MEM};
  \draw[thick] (4.6, -1.3) rectangle (5.4, -0.7); \node at (5, -1) {WB};
  % add (next) start=2, only stages 2..5 (cycles 2..5)
  \node[anchor=east] at (0.5, -2) {add (next):};
  \draw[thick] (1.6, -2.3) rectangle (2.4, -1.7); \node at (2, -2) {IF};
  \draw[thick] (2.6, -2.3) rectangle (3.4, -1.7); \node at (3, -2) {ID};
  \draw[thick] (3.6, -2.3) rectangle (4.4, -1.7); \node at (4, -2) {EX};
  \draw[thick] (4.6, -2.3) rectangle (5.4, -1.7); \node at (5, -2) {MEM};
  % add+1 start=3, stages cycles 3..5
  \node[anchor=east] at (0.5, -3) {add+1:};
  \draw[thick] (2.6, -3.3) rectangle (3.4, -2.7); \node at (3, -3) {IF};
  \draw[thick] (3.6, -3.3) rectangle (4.4, -2.7); \node at (4, -3) {ID};
  \draw[thick] (4.6, -3.3) rectangle (5.4, -2.7); \node at (5, -3) {EX};
\end{tikzpicture}

If the branch is not taken, the predictions were correct and the pipeline runs at full speed. If the branch is taken, the two instructions in IF and ID after the branch are wrong: they have to be squashed (turned into bubbles), and the correct target has to be fetched starting in cycle 4.

Figure: Mispredicted branch: the two speculatively fetched instructions are squashed in IF and ID, and the correct target sub begins fetching in cycle 4

LaTeX

\begin{tikzpicture}[font=\footnotesize, line cap=round, x=0.85cm, y=0.55cm]
  \node[anchor=east] at (0.5, 0) {Cycle:};
  \node at (1, 0) {1}; \node at (2, 0) {2}; \node at (3, 0) {3}; \node at (4, 0) {4}; \node at (5, 0) {5};
  \node at (6, 0) {6}; \node at (7, 0) {7}; \node at (8, 0) {8};
  \node[anchor=east] at (0.5, -1) {beq:};
  \draw[thick] (0.6, -1.3) rectangle (1.4, -0.7); \node at (1, -1) {IF};
  \draw[thick] (1.6, -1.3) rectangle (2.4, -0.7); \node at (2, -1) {ID};
  \draw[thick] (2.6, -1.3) rectangle (3.4, -0.7); \node at (3, -1) {EX};
  \node[font=\scriptsize] at (5.5, -1) {taken};
  \node[anchor=east] at (0.5, -2) {add (wrong):};
  \draw[thick] (1.6, -2.3) rectangle (2.4, -1.7); \node at (2, -2) {IF};
  \draw[thick] (2.6, -2.3) rectangle (3.4, -1.7); \node at (3, -2) {ID};
  \node[red] at (3, -2) {X};
  \node[anchor=east] at (0.5, -3) {add+1 (wrong):};
  \draw[thick] (1.6, -3.3) rectangle (2.4, -2.7); \node at (2, -3) {IF};
  \node[red] at (3, -3) {X};
  \node[anchor=east] at (0.5, -4) {sub (right):};
  \draw[thick] (3.6, -4.3) rectangle (4.4, -3.7); \node at (4, -4) {IF};
  \draw[thick] (4.6, -4.3) rectangle (5.4, -3.7); \node at (5, -4) {ID};
  \draw[thick] (5.6, -4.3) rectangle (6.4, -3.7); \node at (6, -4) {EX};
\end{tikzpicture}

Two cycles wasted on a misprediction, but zero cycles wasted on a correct prediction. If most branches are not taken, this is a big improvement over a blanket stall.

Real branch prediction is far more sophisticated than predict-not-taken. We will spend the whole of Chapter 23 on dynamic branch predictors that achieve over 95% accuracy on typical code. But the basic shape — predict, fetch speculatively, recover on misprediction — is what every modern pipeline does.

Branch Resolution Latency

The number of bubbles on a misprediction is the branch misprediction penalty. In the classic 5-stage pipeline, with branches resolved in EX, the penalty is two cycles. Deeper pipelines (modern processors have 14 to 20 stages) have correspondingly larger penalties: 10–15 cycles is typical.

This is a major part of why deeper pipelines are not always better. A deeper pipeline allows higher clock frequency but pays more on every misprediction. The tradeoff is tightly coupled to the branch predictor's accuracy: a great predictor makes deep pipelines worthwhile; a poor predictor punishes them.

06.Structural Hazards

A structural hazard arises when two instructions need the same hardware unit at the same time. The classic example: a unified cache used for both instruction fetches and data accesses. In the 5-stage pipeline, IF needs the cache every cycle to fetch instructions, and MEM needs the cache when there is a load or store. If the cache has only one port, IF and MEM can collide:

Figure: Structural hazard on a single-port unified cache: load's MEM collides with a third instruction's IF in the same cycle

LaTeX

\begin{tikzpicture}[font=\footnotesize, line cap=round, x=0.85cm, y=0.55cm]
  \node[anchor=east] at (0.5, 0) {Cycle:};
  \node at (1, 0) {1}; \node at (2, 0) {2}; \node at (3, 0) {3}; \node at (4, 0) {4}; \node at (5, 0) {5};
  \node[anchor=east] at (0.5, -1) {ld:};
  \draw[thick] (0.6, -1.3) rectangle (1.4, -0.7); \node at (1, -1) {IF};
  \draw[thick] (1.6, -1.3) rectangle (2.4, -0.7); \node at (2, -1) {ID};
  \draw[thick] (2.6, -1.3) rectangle (3.4, -0.7); \node at (3, -1) {EX};
  \draw[thick] (3.6, -1.3) rectangle (4.4, -0.7); \node at (4, -1) {MEM};
  \draw[thick] (4.6, -1.3) rectangle (5.4, -0.7); \node at (5, -1) {WB};
  % add (next) start=2, full
  \node[anchor=east] at (0.5, -2) {add:};
  \draw[thick] (1.6, -2.3) rectangle (2.4, -1.7); \node at (2, -2) {IF};
  \draw[thick] (2.6, -2.3) rectangle (3.4, -1.7); \node at (3, -2) {ID};
  \draw[thick] (3.6, -2.3) rectangle (4.4, -1.7); \node at (4, -2) {EX};
  \draw[thick] (4.6, -2.3) rectangle (5.4, -1.7); \node at (5, -2) {MEM};
  \node[anchor=east] at (0.5, -3) {fetch:};
  \draw[thick] (2.6, -3.3) rectangle (3.4, -2.7); \node at (3, -3) {IF};
  \node[red, font=\scriptsize] at (4.3, -3) {contention};
\end{tikzpicture}

The structural hazard forces one of them to stall.

The standard solution is to provide separate instruction and data caches — the Harvard architecture — so that IF and MEM never contend. Modern processors universally do this at the level-1 cache: a separate L1 instruction cache and L1 data cache, each with its own port. They share the unified L2 cache further down the hierarchy, but contention there is rarer and more easily handled.

Other structural hazards are resolved similarly: by duplicating the resource. A processor with one ALU has structural hazards on every back-to-back ALU instruction; a processor with two does not. A processor with one register-file write port stalls when two instructions want to write back simultaneously; a processor with two ports does not.

07.Putting It Together: A Hazard Example

A small code fragment that exercises all three hazard types. Assume RISC-V, 5-stage pipeline, predict-not-taken, full forwarding except across loads.

Assembly

ld    x1, 0(x10)        # load
add   x2, x1, x3        # depends on x1 (load-use hazard)
beq   x2, x4, target    # depends on x2 (forwarded), and is a branch
add   x5, x6, x7        # speculative under predict-not-taken
sub   x8, x9, x11
target:
add   x12, x13, x14

The pipeline behavior, assuming the branch is taken (so the predict-not-taken speculation is wrong):

Figure: Combined load-use stall and branch mispredict: add x2 stalls one cycle on the load, beq stalls one cycle, then two speculatively fetched instructions are squashed

LaTeX

\begin{tikzpicture}[font=\scriptsize, line cap=round, x=0.75cm, y=0.55cm]
  \node[anchor=east] at (0.5, 0) {Cycle:};
  \node at (1, 0) {1}; \node at (2, 0) {2}; \node at (3, 0) {3}; \node at (4, 0) {4}; \node at (5, 0) {5};
  \node at (6, 0) {6}; \node at (7, 0) {7}; \node at (8, 0) {8}; \node at (9, 0) {9}; \node at (10, 0) {10}; \node at (11, 0) {11};
  % ld x1: start=1, full
  \node[anchor=east] at (0.5, -1) {ld x1:};
  \draw[thick] (0.6, -1.3) rectangle (1.4, -0.7); \node at (1, -1) {IF};
  \draw[thick] (1.6, -1.3) rectangle (2.4, -0.7); \node at (2, -1) {ID};
  \draw[thick] (2.6, -1.3) rectangle (3.4, -0.7); \node at (3, -1) {EX};
  \draw[thick] (3.6, -1.3) rectangle (4.4, -0.7); \node at (4, -1) {MEM};
  \draw[thick] (4.6, -1.3) rectangle (5.4, -0.7); \node at (5, -1) {WB};
  % add x2: IF,ID at 2,3; stall at 4; EX,MEM,WB at 5,6,7
  \node[anchor=east] at (0.5, -2) {add x2:};
  \draw[thick] (1.6, -2.3) rectangle (2.4, -1.7); \node at (2, -2) {IF};
  \draw[thick] (2.6, -2.3) rectangle (3.4, -1.7); \node at (3, -2) {ID};
  \node at (4, -2) {--};
  \draw[thick] (4.6, -2.3) rectangle (5.4, -1.7); \node at (5, -2) {EX};
  \draw[thick] (5.6, -2.3) rectangle (6.4, -1.7); \node at (6, -2) {MEM};
  \draw[thick] (6.6, -2.3) rectangle (7.4, -1.7); \node at (7, -2) {WB};
  % beq: IF at 3, stall at 4, ID,EX,MEM,WB at 5,6,7,8
  \node[anchor=east] at (0.5, -3) {beq:};
  \draw[thick] (2.6, -3.3) rectangle (3.4, -2.7); \node at (3, -3) {IF};
  \node at (4, -3) {--};
  \draw[thick] (4.6, -3.3) rectangle (5.4, -2.7); \node at (5, -3) {ID};
  \draw[thick] (5.6, -3.3) rectangle (6.4, -2.7); \node at (6, -3) {EX};
  \draw[thick] (6.6, -3.3) rectangle (7.4, -2.7); \node at (7, -3) {MEM};
  \draw[thick] (7.6, -3.3) rectangle (8.4, -2.7); \node at (8, -3) {WB};
  % add x5 (spec): IF,ID at 5,6, then squashed
  \node[anchor=east] at (0.5, -4) {add x5 (spec):};
  \draw[thick] (4.6, -4.3) rectangle (5.4, -3.7); \node at (5, -4) {IF};
  \draw[thick] (5.6, -4.3) rectangle (6.4, -3.7); \node at (6, -4) {ID};
  \node[red] at (7, -4) {X};
  % sub x8 (spec): IF at 6, squashed
  \node[anchor=east] at (0.5, -5) {sub x8 (spec):};
  \draw[thick] (5.6, -5.3) rectangle (6.4, -4.7); \node at (6, -5) {IF};
  \node[red] at (7, -5) {X};
  % add x12: start=7, full
  \node[anchor=east] at (0.5, -6) {add x12:};
  \draw[thick] (6.6, -6.3) rectangle (7.4, -5.7); \node at (7, -6) {IF};
  \draw[thick] (7.6, -6.3) rectangle (8.4, -5.7); \node at (8, -6) {ID};
  \draw[thick] (8.6, -6.3) rectangle (9.4, -5.7); \node at (9, -6) {EX};
  \draw[thick] (9.6, -6.3) rectangle (10.4, -5.7); \node at (10, -6) {MEM};
  \draw[thick] (10.6, -6.3) rectangle (11.4, -5.7); \node at (11, -6) {WB};
\end{tikzpicture}

The combination of a load-use stall and a branch misprediction wastes three cycles total. Real code is full of such interactions; the job of the micro-architecture is to minimize the total cost.

08.Pipelining the Datapath in More Detail

The datapath that supports pipelining has to be modified from the unpipelined version in several ways.

Pipeline registers between stages. Every signal that leaves a stage and is needed by a later stage must be latched in a pipeline register. The decoded instruction, the source register values, the immediates, the control signals — all of them flow through the pipeline registers. The total state stored in pipeline registers in a modern wide pipeline is substantial.

A single PC update path. Each instruction's IF reads from the current PC and updates it; the next instruction's IF reads the updated PC. Branches that change the PC introduce control hazards because the update is delayed.

Multiple ports on the register file. ID reads two source registers per cycle; WB writes one destination per cycle. A 2-read-1-write register file is the standard for a 1-instruction-per-cycle pipeline. Wider pipelines (next chapter) need more ports, which are expensive in area and power.

Forwarding multiplexers. Each ALU input has additional sources beyond the register file: the EX/MEM forward, the MEM/WB forward. The forward sources are selected by control logic that compares register numbers across pipeline stages.

Hazard detection logic. A small piece of combinational logic looks at the instructions in the various pipeline stages and detects when stalls are needed. When it detects a hazard, it freezes earlier stages (holds their pipeline registers) and inserts NOPs into later ones.

The combinational logic for hazard detection and forwarding is small relative to the rest of the processor, but it is on the critical path for control. Designers spend considerable effort minimizing its complexity.

09.Beyond Five Stages

Real modern pipelines have many more than five stages. Some examples:

Intel Pentium 4 (Netburst, ~2000): up to 31 stages, an extreme case driven by the desire for high clock frequency.
Intel Core (current generation): approximately 14–19 stages, depending on how one counts.
ARM Cortex-A78: roughly 13 stages.
Apple M-series: approximately 8–10 stages, optimized for low power and high IPC at moderate frequency.

Why so many?

Higher clock frequency. Each stage does less work, so each stage's combinational delay is shorter, so the cycle can be shorter. Pentium 4 traded depth for frequency aggressively.

Specialized stages. Modern processors split fetch into multiple stages (one to generate the address, one to access the cache, one or two to handle line crossings); decode into multiple stages (one for variable-length boundary detection on x86, one for opcode decode, one for µop fusion); execute into multiple stages for complex operations (multipliers take 3–5 cycles; FP add takes 3–4); memory access into separate address-generation and cache-access stages.

Multiple cache ways and TLB lookups. Cache access takes more than one cycle in a deep pipeline; the TLB lookup is interleaved with the cache lookup; tag comparison happens in a later cycle than the data read.

The cost of depth is the misprediction penalty. A 20-stage pipeline with branches resolving in stage 12 has 11 cycles of waste per misprediction. To compensate, modern processors invest heavily in branch prediction (>95% accuracy), branch-target prediction, return-address prediction, and recovery techniques that minimize the cost of the inevitable mispredictions.

10.Variable-Latency Operations

The five-stage pipeline assumes every operation finishes its EX stage in one cycle. Most simple ALU operations — add, sub, and, or, shift — do, but several common operations do not.

Integer multiplication takes 3 to 5 cycles in a typical pipelined multiplier. Integer division is much worse — typically 10 to 40 cycles, often not pipelined, so successive divides cannot be issued back-to-back. Floating-point add and multiply take 3 to 4 cycles each; FP division takes longer still; FP square root is similar to division. SIMD operations on 256- or 512-bit vectors may take longer than the equivalent scalar operation simply because more lanes are involved.

The pipeline copes with variable latency by giving each functional unit its own length. The integer ALU has a one-cycle pipeline; the multiplier has a three-cycle pipeline; the divider has a multi-cycle non-pipelined pipeline. An instruction stays in its functional unit for the appropriate number of cycles before its result is available for forwarding.

The scheduling logic has to track when each result becomes ready. In a simple in-order pipeline, the issue logic checks each instruction's source operands and stalls if any of them are still in flight; the latency table used to do this is part of every pipelined design. In an out-of-order pipeline (Chapter 25), the same information is encoded in the issue queue, where each instruction waits for its operands to wake up.

A related issue is write-port contention on the register file: a one-cycle integer add issued in cycle N produces its result in cycle N+1, while a three-cycle multiply issued in cycle N-2 also produces its result in cycle N+1. Two results are ready to write back to the register file in the same cycle, but the file may have only one write port for that data type. Designs handle this either by adding write ports (expensive) or by stalling the longer-latency operation by a cycle until the conflict clears.

Variable-latency operations are unusual in two specific ways. Loads have variable latency depending on whether they hit in the cache, and that latency is not knowable when the load is issued; the schedule has to assume a hit and recover when a miss extends the latency. Divides and square roots are sometimes implemented as iterative state machines that take a different number of cycles depending on the values of the operands, which makes scheduling around them awkward. We will see how out-of-order pipelines handle both in later chapters.

11.Pipeline Interlocks and Software-Visible Latency

The forwarding logic and stall machinery we have described enforce data dependencies in hardware: regardless of how the compiler schedules instructions, the pipeline keeps them in order at the register-file level. A dependent instruction is stalled until its source is ready, even if the compiler did not anticipate the stall.

A different choice, made by some early RISC architectures, is to expose certain pipeline timing details to software and let the compiler schedule around them. The most famous example is the delay slot: in classic MIPS, an instruction immediately after a branch is always executed, regardless of whether the branch is taken. The architectural justification was that the branch's target is not known until the EX stage, by which point the next instruction has already been fetched; rather than discard it, MIPS made the discarded instruction architecturally visible. The compiler's job was to fill the slot with a useful instruction (often by hoisting an instruction from before the branch).

MIPS's load delay slot worked similarly: the instruction immediately after a load could not use the loaded value, because the load's MEM stage had not yet produced it. The compiler had to insert an unrelated instruction or a nop.

These decisions made the early MIPS pipelines simpler — no stall logic required — but they hard-wired five-stage-pipeline timing into the architecture. When MIPS implementations grew deeper pipelines, the delay slots became awkward; when they grew superscalar, the slots became actively harmful. Modern MIPS, modern SPARC, and every other architecture introduced since around 1990 (AArch64, RISC-V) have abandoned the technique. The hardware now contains pipeline interlocks that stall on data hazards transparently to software, and the architecture says nothing about pipeline depth.

A closely related architectural feature is the load-use latency that some manuals still document, even though hardware enforces it: the compiler is encouraged, but no longer required, to separate a load from its use by one or more instructions. On modern out-of-order machines, the recommendation is largely moot; the hardware finds the parallelism on its own. On simple in-order embedded cores (Cortex-M, small RISC-V), it still pays off.

The larger lesson is that hardware-enforced interlocks have won, and modern ISAs are pipelined-implementation-agnostic by design. Pipeline depth, forwarding paths, and stall behaviour are micro-architectural choices that software does not need to know about for correctness; performance-tuned software may know about them, but only as advisory information.

12.VLIW and Statically-Scheduled Pipelines

A different architectural choice is to push all of the scheduling work into the compiler. VLIW (Very Long Instruction Word) architectures bundle several operations into a single, long instruction; the bundle says, for example, "in this cycle, do an ALU add on these operands, a memory load on these, and a multiply on these, all in parallel." The hardware does not analyze dependencies between operations within a bundle or between bundles; the compiler is expected to have already arranged the program so that no hazards occur.

The motivation is to remove the issue logic from the hardware. A VLIW processor has no rename, no issue queue, no dependency-checking logic to consume area, power, and verification effort. What remains is a wide collection of functional units that read and write a register file, with an instruction format that names which units do what each cycle.

The most prominent commercial VLIW was Intel's Itanium (IA-64), which used a 128-bit bundle containing three 41-bit operations and template bits that described their parallelism. Itanium also exposed the pipeline through predication (every operation can be turned off conditionally without a branch) and speculative loads (loads that may be issued speculatively, with deferred faulting). The architecture was elegant but commercially unsuccessful: compilers struggled to extract enough static parallelism from real workloads, and the surrounding ecosystem of binary compatibility and dynamic optimization that x86 enjoyed proved decisive.

VLIW lives on in DSPs and embedded processors. The Texas Instruments C6x, Qualcomm Hexagon, and many GPU-shader cores use VLIW or VLIW-like designs because their workloads are regular enough that compilers can schedule them well. GPUs also use a related technique called SIMT that we will discuss in Chapter 56. The point for our purposes is that VLIW is a real and viable point in the design space, and the choice between dynamic out-of-order scheduling and static VLIW scheduling is a major architectural decision.

The modern verdict is that for general-purpose code with unpredictable control flow, dynamic scheduling wins; for regular computational kernels with statically analyzable dependencies, VLIW or hybrid designs can be competitive at lower power. The deepest reason is latency uncertainty: a load that may or may not hit in the cache cannot be statically scheduled well; an out-of-order pipeline can wait for it on a per-instruction basis, while a VLIW pipeline either has to stall the entire bundle or speculate.

13.Pipelining and the Iron Law

The iron law from Chapter 21:

$\text{time per program} = \frac{\text{instructions}}{\text{program}} \times \frac{\text{cycles}}{\text{instruction}} \times \frac{\text{time}}{\text{cycle}}.$

Pipelining mainly attacks the third factor: cycle time gets shorter, so the time per cycle drops. It does not improve the ideal CPI (which stays at 1 for a single-instruction-per-cycle pipeline) but it does not make CPI worse either, except when hazards force stalls.

A simple model of pipelined CPI:

$\text{CPI} = 1 + \text{stall cycles per instruction}.$

Stall cycles come from load-use hazards, branch mispredictions, and structural hazards. A well-designed pipeline keeps the stall fraction low — perhaps 0.2 to 0.5 — for a typical program. The product of cycle time and CPI for a pipelined processor is far smaller than for an unpipelined one running at the same effective work, even after accounting for stalls.

14.Summary

Pipelining overlaps the execution of consecutive instructions by dividing instruction processing into stages, each handling a different instruction every cycle. The classic five-stage RISC pipeline (IF, ID, EX, MEM, WB) is the standard pedagogical example; real modern pipelines have many more stages but follow the same principle.

Pipelining introduces hazards: data hazards (RAW dependencies between consecutive instructions), control hazards (branches that change which instruction comes next), and structural hazards (resource conflicts). Forwarding eliminates most data hazards by routing results directly from later stages back to the ALU inputs. Branch prediction handles control hazards by guessing the next PC and recovering on mispredicts. Structural hazards are eliminated by duplicating resources (separate I-cache and D-cache, multiple ALUs). Variable-latency operations — multiplies, divides, floating-point, loads — require per-unit pipelines and write-port arbitration. Modern hardware enforces dependencies through pipeline interlocks; older architectures exposed delay slots and load-use latency to software, a choice that has not aged well. VLIW pushes scheduling into the compiler entirely, a viable choice for regular workloads but commercially unsuccessful for general-purpose code.

A pipeline cannot do better than one instruction per cycle on its own. To go further, the processor needs to issue multiple instructions per cycle and execute them in parallel. Branch prediction sits at the front of every fast pipeline; the next chapter develops it in detail.

Book mode

	beq x1, x2, label
	add x3, x4, x5 # next sequential instruction
	...
	label:
	sub x6, x7, x8

	ld x1, 0(x10) # load
	add x2, x1, x3 # depends on x1 (load-use hazard)
	beq x2, x4, target # depends on x2 (forwarded), and is a branch
	add x5, x6, x7 # speculative under predict-not-taken
	sub x8, x9, x11
	target:
	add x12, x13, x14