Part IV·Microarchitecture·Chapter 23 of 62

Part IVMicroarchitecture

Branch Handling

May 16, 2026·23 min read·intermediate

Branches are the most expensive instructions in a modern processor. Not because they do a lot of work — they do almost none, just compare two values and update a register — but because they determine which instruction comes next. Until a branch is resolved, the front of the pipeline cannot reliably fetch instructions. In a 20-stage pipeline, a branch resolved in stage 12 leaves 11 stages of work potentially wasted on every misprediction. Even at 95% prediction accuracy, the residual 5% of mispredictions can dominate runtime in branch-heavy code.

This chapter is about how modern processors handle branches well enough to keep deep, wide pipelines fed. It is one of the most studied areas of computer architecture, and the techniques have grown remarkably sophisticated. We will start with the simplest static schemes, work through the dynamic predictors that have evolved over forty years, examine the related problems of branch-target prediction and return-address handling, and end with the recovery machinery that activates when prediction nevertheless fails.

01.Why Branches Are Hard

The front end of a pipeline fetches instructions sequentially. Most of the time this is fine: the program counter increments by the instruction width each cycle, and the next instruction is wherever the PC ends up. Branches break this pattern. A taken branch sends the PC somewhere other than the next sequential address, and the next sequential bytes — which the front end has already started fetching — are useless.

The challenge is that the front end is many cycles ahead of the back end. By the time a branch's outcome is computed (in EX, somewhere in the middle of the pipeline), the front end has already fetched, decoded, and possibly even executed several younger instructions speculatively. If the prediction was wrong, all of that work has to be discarded.

Two questions are intertwined.

Direction prediction. Will this conditional branch be taken or not?

Target prediction. If the branch is taken (or if it is an unconditional indirect branch), where does it go?

For most direct conditional branches, the target is encoded in the instruction itself, and once the branch is identified, computing the target is trivial. Direction prediction is the hard part. For indirect branches and returns, the target is unknown until the branch executes; target prediction is the hard part. We treat them in turn.

02.Static Prediction

The simplest predictors guess based on properties of the branch alone, without any runtime information. They cost little and give a baseline that more sophisticated schemes must beat.

Predict not taken. Always assume the branch falls through. Cheap and easy: the front end just keeps fetching sequentially. Works well for branches that are usually not taken — if statements that handle exceptional cases, error-checking branches, the early-exit conditions of loops on the last iteration. About 50% accurate on average.

Predict taken. Always assume the branch is taken. Slightly trickier because the front end must compute the target, but for backward branches (which are usually loops returning to the top) the target is encoded directly in the instruction. About 50% accurate on average.

Backward taken, forward not taken (BTFNT). A small but useful refinement: backward branches are usually loops and are usually taken; forward branches are usually conditionals and are usually not. Predict accordingly. Around 65% accurate on typical code.

Profile-guided. Run the program once with instrumentation, record each branch's average direction, and recompile with hints baked into the binary. Some ISAs (PowerPC, Itanium) had explicit prediction-hint bits in branch instructions. With good profile data this can reach 85% accuracy.

Static prediction was a serious engineering choice in the 1980s, when transistor budgets were small. Modern processors have abandoned it as the primary scheme, because dynamic predictors do far better. But static prediction still appears as a fallback when the dynamic predictor has no information about a particular branch (a cold branch, on its first execution).

03.Dynamic Prediction: One-Bit Predictor

Dynamic prediction tracks the runtime behavior of each branch and uses the history to predict future outcomes. The simplest dynamic scheme is a one-bit predictor: for each branch, remember whether it was taken last time, and predict the same thing next time.

The hardware is a small table — the branch history table (BHT) — indexed by the low bits of the branch's PC. Each entry holds one bit: 1 for taken, 0 for not taken. On every branch, the front end indexes the BHT, predicts according to the bit, and after the branch resolves, updates the bit with the actual outcome.

This works astonishingly well for many branches. A loop that iterates many times and falls through once is correctly predicted on every iteration except the last and the one immediately after. A branch that goes the same way in every dynamic instance is correctly predicted always.

The weakness is anti-correlation. Consider a loop that runs N times each time it is entered:

for (i = 0; i < N; i++) { ... }

The loop's branch is taken N-1 times and not taken once per entry. With a one-bit predictor, the last iteration is mispredicted (the predictor says taken; the iteration falls through), and so is the first iteration of the next entry (the predictor says not taken because last time it fell through; the new entry takes the branch). Two mispredictions per entry, regardless of how long the loop is.

For short loops with many entries this is catastrophic. The fix is the two-bit predictor.

04.Two-Bit Saturating Counter

A two-bit predictor is the textbook elaboration. Each entry in the BHT holds a 2-bit saturating counter with four states: strongly not taken (00), weakly not taken (01), weakly taken (10), strongly taken (11).

The counter is incremented on a taken branch (capping at 11) and decremented on a not-taken branch (capping at 00). The prediction is the counter's high bit: predict taken if the counter is 10 or 11, not taken if 00 or 01.

The key property: a branch must be wrong twice in a row to flip the prediction. A loop that ran 99 taken iterations followed by one not-taken does not switch its prediction on that single not-taken; it only weakens the taken prediction by one step. The next iteration, when the loop is re-entered and the branch is taken again, restores the strong-taken state. The loop's last-iteration misprediction is thus the only misprediction per loop entry — half of the one-bit predictor's cost.

The two-bit predictor is the foundation of every dynamic branch predictor since. Two-bit saturating counters appear inside more elaborate predictors as the basic prediction primitive.

A typical two-bit BHT might have 1024 to 16384 entries, indexed by the low bits of the PC. Two branches with the same low PC bits collide in the table — aliasing — but in practice, with enough entries, aliasing is uncommon.

05.Two-Level Adaptive Predictors

Some branches are not predictable by their own PC alone: their behavior depends on what other branches did recently. The classic example:

if (x > 0) { ... }
if (x < 5) { ... }

The two ifs are correlated: if the first is taken (x > 0), the second is more likely to be taken (x < 5 is true if 0 < x < 5, which is most of the time). A predictor that only sees the second branch's PC misses this correlation; one that also sees the history of recent branches can exploit it.

Yeh and Patt's 1991 two-level adaptive predictors introduced this idea. The general structure has two tables.

Branch History Register (BHR). A shift register of recent branch outcomes. On every branch, the BHR shifts in the new outcome bit. After many branches, the BHR contains a fingerprint of the recent branch history — for example, the last 12 branch outcomes.

Pattern History Table (PHT). A table indexed by the BHR (sometimes combined with the PC). Each entry holds a 2-bit saturating counter that predicts what comes next given the recent history pattern.

The prediction process is: read the BHR, use it (perhaps XORed with the PC) to index the PHT, read the 2-bit counter, predict according to its high bit. After the branch resolves, update the indexed PHT entry's counter and shift the outcome into the BHR.

This scheme captures correlation: branches whose behavior depends on the history pattern get accurate predictions. The cost is more state (BHR + PHT) and more complex update logic. Accuracy on typical code is 90-95%, well above the simple two-bit predictor.

There are several variants, distinguished by how branch-specific the BHR is and how the indexing combines history and PC.

GAg (Global history, global PHT). One BHR shared by all branches; one PHT indexed by the BHR alone.

GAp (Global history, per-PC PHTs). One BHR; many PHTs, one per branch PC. The PC selects the PHT, the BHR indexes within it.

PAg (Per-branch history, global PHT). A BHR per branch (kept in a separate per-branch history table); one shared PHT.

Gshare. A particularly successful design: the BHR is XORed with the PC to index a single PHT. The XOR distributes branches across the PHT well, reducing aliasing without requiring per-branch tables. Gshare with a 12-16 bit BHR was the standard predictor for years.

06.Tournament Predictors

Different branches benefit from different predictors. Some are best predicted by their own PC's two-bit counter (regular branches that go the same way each time). Others need the full history (correlated branches). A tournament predictor maintains both, plus a chooser that learns which predictor is more accurate for each branch.

A typical tournament structure:

A local predictor that uses the branch's own PC and a per-PC history.
A global predictor like gshare that uses the global branch history.
A chooser table (also indexed by PC and BHR) of 2-bit counters that learns whether the local or global predictor is more accurate for the branch in question.

On every branch, both predictors make a prediction; the chooser decides which to believe. After resolution, the chooser updates: if the local predictor was right and the global was wrong, the chooser shifts toward "local"; the reverse the other way. If both were right or both were wrong, the chooser does not change.

Tournament predictors were used in the Alpha 21264 (1998) and many subsequent designs. They reach about 95% accuracy on typical code.

07.TAGE and Modern Predictors

The current state of the art is TAGE (TAgged GEometric history length predictor), developed by André Seznec around 2006. TAGE is the basis for the predictors in modern Intel, AMD, and Arm cores, with proprietary variations and refinements.

The intuition behind TAGE: different branches need different amounts of history. Some branches are predicted accurately with just 4 bits of history; others need 50 or 100 bits to capture long-range correlations. A single global predictor with a fixed history length is a compromise.

TAGE solves this by maintaining several predictor tables in parallel, each using a different history length. The history lengths grow geometrically: typically 4, 8, 16, 32, 64, 128, 256, 512, 1024 bits, or some subset.

Each table is indexed by hashing the PC together with the corresponding history length. Each entry is tagged with a few bits of additional hash, so the predictor can detect when the entry it reads belongs to the branch being predicted versus a colliding one.

On a prediction:

All tables are looked up in parallel.
The tables whose tags match the current branch's hash are hits; the others miss.
Among the hits, the longest-history table provides the prediction (it has the most context).
If no tagged table hits, a default base predictor (a simple bimodal, like the two-bit counter) provides the prediction.

After resolution, only the providing entry is updated, plus various allocation rules that decide whether to allocate new entries in higher-history tables when a misprediction occurs.

TAGE reaches about 97-98% accuracy on standard benchmarks, considerably better than gshare or tournament predictors. Refinements like TAGE-SC (TAGE with a Statistical Corrector that overrides predictions in specific learned cases) and TAGE-SC-L (with a Loop predictor for regular loops) push accuracy a fraction higher. Modern Intel and AMD predictors are based on TAGE-like designs with additional proprietary tuning.

A more recent direction is perceptron-based prediction: each PC has a small neural-network-like perceptron whose inputs are the history bits, and the prediction is computed by a weighted sum. Perceptrons can learn longer history patterns than TAGE in some cases, though they are more expensive in hardware. Some recent processors combine TAGE with a perceptron-based corrector.

08.Branch-Target Prediction

Predicting the direction of a branch is only half the problem. For taken branches, the front end also needs the target address quickly enough to start fetching there.

For direct branches and unconditional jumps, the target is encoded as a PC-relative offset in the instruction. Once the instruction is decoded, the target is just PC + offset. But decode happens after fetch — typically a few cycles after — so even for direct branches the front end has been fetching the wrong path for a few cycles before learning the right target.

The solution is the Branch Target Buffer (BTB): a cache, indexed by branch PC, that holds the target address of each recently-seen branch. The BTB is consulted in parallel with the fetch itself. If the BTB indicates that the instruction being fetched is a branch and gives a predicted target, the front end starts fetching from that target the next cycle, rather than waiting for decode.

A BTB entry typically holds:

The PC tag (to confirm the entry is for the right branch).
The predicted target address.
A type indicator (conditional, unconditional, call, return).
Often, prediction state for direction (the same 2-bit counter we have seen).

On a BTB hit, the front end uses the predicted target as the next fetch address. On a miss, the front end falls back to sequential fetch and adjusts when decode reveals a branch.

Modern BTBs are large (thousands of entries) and may be hierarchical (a small fast BTB and a larger slower one).

Indirect Branches

Indirect branches — jmp rax on x86, br x0 on AArch64, virtual function dispatch, switch statements — have targets that are not encoded in the instruction. The same indirect branch may go to many different targets depending on runtime state. A simple BTB that records one target per branch fails badly here.

Indirect target predictors address this. They keep multiple targets per branch and use the branch history (just like direction predictors) to decide which target to use this time. A typical indirect predictor has a structure similar to a tournament or TAGE-like predictor but with full target addresses instead of single-bit predictions.

The motivation is the prevalence of indirect calls in modern software. Object-oriented languages use them for every virtual method call. JIT-compiled code uses them for inline caches. Interpreters dispatch through them on every bytecode. Without good indirect prediction, all of this software would be slow.

The Return Address Stack

Return instructions deserve special handling. A return is an indirect branch — its target is the address recorded at the most recent call — but its target is highly predictable: the call instruction wrote the return address into the link register or onto the stack just a few instructions ago.

A Return Address Stack (RAS) is a small hardware stack inside the front end that mirrors the program's call stack. On every call, the front end pushes the predicted return address. On every return, it pops the top and uses that as the predicted target.

RAS prediction is nearly perfect for properly nested calls and returns. It only fails when:

The hardware stack overflows (deeper nesting than the RAS can hold).
The program uses non-standard control flow (longjmp, exception unwinding, threading) that does not match the stack discipline.

A typical RAS holds 16 to 32 entries. Programs with deeper nesting suffer extra mispredictions on returns from deep functions, but most code does not nest that deep dynamically.

09.Speculation and Recovery

A modern processor speculates aggressively. The front end fetches and decodes instructions far past the branches whose outcomes are not yet known. The back end may even execute speculative instructions before the branch resolves. The processor must therefore have machinery to undo the speculative work when prediction fails.

The basic mechanism is checkpointing: at every speculation point (a branch prediction), the processor records enough state to roll back if the speculation turns out wrong. The reorder buffer (Chapter 25) and register-rename map (also Chapter 25) provide most of this capability: register writes happen first to a hidden physical register, and only become architecturally visible when the instruction retires, which only happens after all older branches have resolved correctly.

When a branch is resolved as mispredicted:

The processor identifies all instructions younger than the mispredicted branch in the pipeline.
Those instructions are squashed: their pipeline-register entries are turned into bubbles, their tentative writes to physical registers are abandoned, their reorder-buffer entries are released.
The architectural state — register file, memory — is rolled back to the point just before the branch (this is automatic, since speculative writes never made it that far).
The front end is redirected to fetch from the correct target address. The branch predictor is updated to reflect what actually happened.

The recovery is fast — usually a single cycle of bookkeeping followed by a refill of the pipeline — but the cost is the wasted work in the squashed instructions. The deeper the pipeline and the more aggressive the speculation, the more work is in flight and potentially wasted.

A processor that mispredicts a branch may have:

10-15 cycles of wasted fetch.
Several cycles of wasted decode and rename.
Whatever execution and memory work the squashed instructions had begun.

The cumulative cost in real machines is significant. A hot loop with one mispredicted branch every 20 iterations and a 15-cycle penalty loses about $15/20 = 0.75$ cycles per iteration on average. In pathological code with worse predictability, the cost can dominate runtime.

10.The Front-End Pipeline

A modern fetch pipeline contains multiple stages devoted to branch handling. A simplified view of the front end of an aggressive modern processor:

Figure: Modern processor front end: PC drives BTB and direction prediction, then I-cache, then pre-decode and branch identification, with mispredict redirects looping back to the PC

LaTeX

\begin{tikzpicture}[font=\footnotesize, >=Stealth, line cap=round,
  blk/.style={draw, thick, fill=white, minimum height=0.9cm, align=center}]
  \node[blk, minimum width=1.6cm] (pc)  at (1, -0.5)  {PC};
  \node[blk, minimum width=2.6cm] (btb) at (4, -0.5)  {BTB lookup,\\direction pred.};
  \node[blk, minimum width=2.4cm] (ic)  at (7.2, -0.5)  {I-Cache};
  \node[blk, minimum width=2.6cm] (pd)  at (10.2, -0.5) {pre-decode,\\branch ident.};
  \node[blk, minimum width=2.4cm] (dec) at (4, -2.7)  {decode};
  \node[blk, minimum width=2.4cm] (uq)  at (7.2, -2.7)  {$\mu$op queue};
  \draw[->] (pc) -- (btb);
  \draw[->] (btb) -- (ic);
  \draw[->] (ic) -- (pd);
  \draw[->] (pd.south) -- (10.2, -2.7) -- (dec.east);
  \draw[->] (dec) -- (uq);
  \draw[->, dashed] (pd.north) to[bend left=30] node[above] {redirect on mispred / BTB hit} (pc.north);
\end{tikzpicture}

The PC is the starting point. The BTB and direction predictor are looked up in parallel with the I-cache, so that by the time the cache delivers the bytes, the predictor has already decided whether to redirect. If the BTB predicts a taken branch, the next cycle's PC is the predicted target; otherwise, sequential.

Several stages of decoding follow. For x86, the front end has to handle variable-length instructions, identifying instruction boundaries before it can decode. For RISC ISAs, decode is faster (every instruction is a fixed size) but multiple decoders run in parallel. Branches are identified in decode and feed back to the predictor for confirmation; sometimes a branch the predictor missed is found in decode and triggers a small redirect.

Beyond decode, the µop queue feeds the back end. The front end runs ahead of the back end whenever it can, so a long-latency back-end stall (a cache miss, a long division) does not starve the front end of its work, and a short front-end stall (a single I-cache miss, a small misprediction) does not starve the back end.

This decoupled front-end / back-end structure is what makes modern processors tolerant of imperfect branch prediction: a few-cycle bubble in fetch is partially absorbed by the µop queue, and a back-end stall lets the front end build up reserve work for when the back end resumes.

11.The Decoupled Front End and Fetch-Target Queue

The predictor and the I-cache, drawn together in the diagram above, are increasingly decoupled in the most aggressive modern designs. The branch predictor runs ahead of the I-cache fetcher, generating a stream of fetch addresses that the I-cache consumes in turn. A buffer between them — the Fetch Target Queue (FTQ) in AMD's Zen, the Block Address Queue in Intel's recent designs, the Branch Predictor Queue in some Arm cores — holds predicted basic-block boundaries waiting to be fetched.

The motivation is twofold. First, the predictor's structures (BTB, direction predictor, RAS) can be looked up faster than the I-cache, since they are smaller; running them ahead lets the predictor pre-stage misses and prefetches into the I-cache. Second, on an I-cache miss, the front end can keep predicting future basic blocks while the missed line is being fetched, so when it returns, the predictor's queue is already full of work and the cache fetcher resumes immediately.

This decoupling makes the predictor and the cache fetcher into a small producer-consumer system, with backpressure in both directions. When the back end falls behind, the µop queue fills and the fetch target queue stops being consumed; when the back end runs ahead, the predictor races to keep the queue stocked. A modern front end is, in a sense, a small dataflow machine in its own right, with the cycle-to-cycle throughput governed by which side has work and which is waiting.

The architectural consequence for misprediction handling is that an FTQ-based front end already has the wrong basic blocks queued up at the moment a misprediction is signalled. Recovery has to flush not just the back-end pipeline but also the FTQ, the I-cache requests in flight, and any µops in the decode pipeline that came from wrong-path fetches. Modern designs handle this with checkpointed predictor state: the predictor is restored to the state it had at the mispredicting branch, and re-fetch resumes from the corrected target. We will revisit checkpointing more generally in Chapter 25.

12.Multiple Branches per Cycle

A wide front end — 8 instructions per cycle on Apple M-series, 6 on AMD Zen 4, 6 on Intel Golden Cove — needs to predict more than one branch per cycle on average. The naive predictor, which predicts one branch per cycle, becomes a bottleneck once the branch rate exceeds one branch per fetched basic block.

The response is multi-prediction. The branch predictor is augmented to deliver, in a single cycle, predictions for several branches in succession: the next branch in the current basic block, the branch ending the basic block after that, and so on. The BTB is widened or duplicated to look up several targets per cycle. The history register is updated speculatively along the multi-block path.

A cleaner organization, used in most current designs, is to make the predictor block-oriented rather than instruction-oriented. The predictor takes a block address and returns the entire predicted layout of the next basic block: how many instructions it contains, where the terminating branch is, and where it goes. The fetch unit then fetches that many instructions and follows the predictor's recommendation. The predictor can update its history register once per block rather than once per branch, which simplifies the timing on what is one of the tightest critical paths in the design.

The cost of multi-prediction is increased predictor area and complexity, and a small accuracy penalty when the predictor commits to several predictions ahead and one of them is wrong (the dependent ones must be undone). In practice, the win in fetch bandwidth is decisive, and every wide modern front end uses some form of multi-prediction.

13.Branch Prediction and Security

A surprising consequence of speculative branch prediction emerged in 2017–2018 with the Spectre family of vulnerabilities. The core idea: a processor that speculatively executes the wrong path of a branch can leave traces of that wrong-path execution in micro-architectural state — typically, in cache contents — even though the architectural state is correctly rolled back. An attacker can train the branch predictor to mispredict in a specific direction, forcing a victim to execute speculative code that reads secret data and leaks it through cache timing.

Mitigations include:

Speculation barriers like x86's LFENCE that prevent speculation past them.
Indirect branch protection (Intel's IBPB and IBRS, AMD's equivalents) that flush or partition the indirect predictor across context switches.
Retpoline: a software mitigation that replaces indirect branches with constructions the predictor cannot guess.
Hardware changes in newer processors that limit speculation across privilege boundaries.

We will revisit Spectre and Meltdown in detail in Chapter 51, where we discuss advanced speculation and security. The point for this chapter is that branch prediction, once a purely-performance feature, has become entangled with system security in ways that affect both hardware design and software practice.

14.Putting It All Together

A typical modern branch-handling subsystem combines:

A TAGE-based direction predictor with several history lengths and 100,000+ total prediction states.
A multi-level BTB with thousands of entries and predicted target addresses.
An indirect target predictor with target sets per branch, history-conditioned.
A return address stack with 24-32 entries.
Loop predictors for highly regular loops.
Statistical correctors that override the main prediction in learned cases.
A misprediction recovery pipeline that flushes speculative state in a small number of cycles.

The aggregate result on typical workloads is 97-99% direction accuracy and >95% indirect-target accuracy. The remaining mispredictions still cost real cycles, but the rate is low enough that wide deep pipelines remain practical.

This subsystem is one of the most complex parts of any modern CPU. It rivals the cache hierarchy in transistor count and verification effort. It is also a competitive differentiator: ARM, Apple, AMD, and Intel each invest heavily in proprietary refinements to their predictors, and the differences show up directly in benchmark results.

15.Summary

Branches are unavoidable, frequent, and expensive in deep pipelines. Branch prediction lets the front end fetch speculatively past unresolved branches, hiding the resolution latency when the prediction is correct. The history of branch prediction is a progression from static schemes to one-bit and two-bit dynamic counters, then to two-level adaptive predictors that capture history-dependent patterns, then to TAGE and its descendants, which combine multiple history lengths and tagged tables to reach 97-98% accuracy.

Direction prediction is half the problem; target prediction handles the other half. The Branch Target Buffer caches predicted targets for direct branches; indirect target predictors and return address stacks handle the harder cases. A decoupled front end with a fetch-target queue lets the predictor run ahead of the I-cache; multi-block prediction sustains the high fetch rates of wide modern cores. Misprediction recovery flushes speculative state and redirects the front end, all within a handful of cycles, but the cumulative cost of the inevitable mispredictions is still significant.

With branches handled and the pipeline kept fed, the next opportunity for performance is to issue more than one instruction per cycle. Chapter 24 develops superscalar execution.

Book mode