Part VII·Advanced and Frontier·Chapter 51 of 62

Part VIIAdvanced and Frontier

Advanced Branch Prediction and Speculation

May 16, 2026·14 min read·advanced

This chapter has two halves. The first is the modern frontier of branch prediction — how cutting-edge predictors are built, where they fall short, and how they interact with deep pipelines. The…

This chapter has two halves. The first is the modern frontier of branch prediction — how cutting-edge predictors are built, where they fall short, and how they interact with deep pipelines. The second is the family of speculative-execution attacks that emerged starting in 2018 with Spectre and Meltdown, the mitigations, and the broader implications for how we think about microarchitectural security.

This chapter is referenced from Chapters 23, 26, 31, and 36.

01.State-of-the-Art Branch Prediction

Chapter 23 introduced two-bit counters, two-level predictors, and gshare. Modern predictors are far more elaborate. The two dominant production designs are TAGE and perceptron, sometimes hybridized.

TAGE

TAGE (TAgged GEometric history length predictor), developed by André Seznec around 2006, dominates contemporary branch prediction. The structure:

A base bimodal predictor (simple two-bit counters indexed by PC).
A series of tagged tables, each indexed by a different (geometric) length of the global branch history.
Each tagged table entry has: a partial tag (so we know which branch trained it), a prediction counter, and a useful counter.

Lookup: index each tagged table; check tag matches; pick the prediction from the table with the longest matching history (the "provider"). If no tagged table matches, fall back to the base.

Update: train the provider's counter; under specific conditions, allocate new entries in tables with longer history when the prediction was wrong.

The geometric history lengths — often something like 0, 5, 14, 32, 63, 110, 200, 360 — let the predictor capture both short-range and very-long-range correlations. Some branches need only the last few branches' outcomes; some need 200 or more.

A modern high-end TAGE has 8-12 tagged tables, total budget of tens of KB. Mispredict rates on standard benchmarks are typically below 3%, often below 1.5%.

Perceptron Predictors

Perceptron predictors, originally proposed by Jiménez and Lin (2001), use a single-layer perceptron neural network. Each branch has a vector of weights, one per branch in the history. The prediction is the sign of the dot product of weights and history (where each history bit is +1 or −1).

A perceptron predictor scales linearly with history length (vs. the exponential resource demands of pure two-level predictors). Modern variants like Hashed Perceptron and Multiperspective Perceptron achieve accuracy competitive with TAGE.

AMD has used perceptron-based predictors in Zen-family cores. The Zen 4 predictor reportedly combines perceptron-style learning with TAGE-like geometric history.

Hybrid Predictors

Production cores use multiple predictors and a meta-predictor that chooses among them per branch. Different branches respond best to different predictor styles; the meta-predictor learns this.

An example partitioning:

TAGE for branches with strong history correlation.
Bimodal for branches that need no history.
Loop predictor for loop-controlling branches (predicts iteration count).
Statistical Corrector for branches near the predictor's decision boundary.

Indirect Branch Prediction

Indirect branches (branches whose target is computed) are predicted separately. The dominant design is ITTAGE (Indirect Target TAGE), the TAGE structure adapted for predicting targets rather than directions. Each tagged entry stores a predicted target.

A complementary structure, the Branch Target Buffer (BTB), caches recent branch targets keyed by PC. The BTB is consulted in parallel with the direction predictor. Modern BTBs have thousands to tens of thousands of entries.

Return addresses get their own structure: the Return Address Stack (RAS), a hardware stack tracking call/return pairs. Push on call, pop on return. A typical RAS has 16-32 entries; deep recursion can overflow.

Front-End Bandwidth

A wide core (8-decode) needs the front end to deliver 8 instructions per cycle. The branch predictor has to keep up: if a branch is predicted every few instructions on average, the predictor must produce a prediction every cycle without stalling fetch. Decoupled fetch lets the predictor run ahead, generating a queue of predictions that fetch consumes.

When the predictor produces the wrong prediction, the front end must redirect — but the wrong instructions are already in the pipeline. They will be flushed when the misprediction is detected later. Limiting the cost of mispredicts is one of the main motivations for fast resolution paths (Chapter 23, branch resolution).

02.Microarchitectural Side Channels

Now to the second half of the chapter: speculative-execution attacks.

A side channel is a way for an attacker to learn information they shouldn't have, by observing some indirect signal — timing, power, electromagnetic emissions, cache state. Microarchitectural side channels exploit the CPU's internal optimizations that leak information through cache state, branch predictor state, or other shared microarchitectural resources.

The basic primitive: an attacker can observe whether a particular cache line was recently accessed by measuring the time to access it themselves. Cached lines respond quickly; uncached lines take longer (a measurable difference, often 50-200 cycles).

This timing primitive predates Spectre by years. Flush+Reload (Yarom and Falkner, 2014) and Prime+Probe are well-known cache side-channel techniques used to break crypto implementations.

What changed in 2018 was the realization that the CPU's speculative execution could be coerced into bringing secret data into the cache, even from regions the attacker should not be able to read. The attacker then uses the timing primitive to learn what was speculated.

03.Meltdown

Meltdown (Lipp et al., 2018) exploits a specific Intel design choice: when a load executes speculatively and accesses a page the user does not have permission to read, the access fault is deferred until retirement. During the speculation window, the load returns the data (from cache or memory) and dependent instructions can use it. When the load retires, the fault fires and the speculation is squashed — but cache state changes from the dependent instructions remain visible.

The classic Meltdown gadget:

Assembly

mov  rax, [kernel_address]    ; speculatively reads kernel data into rax
shl  rax, 12                   ; rax = secret * 4096
mov  rbx, [probe_array + rax]  ; touches a specific cache line based on secret

The first instruction faults (user can't read kernel memory), but speculatively executes anyway. Dependent instructions touch a cache line indexed by the secret. After the squash, the attacker measures access times to probe_array entries; whichever is fast reveals the secret.

Meltdown affected most Intel CPUs (Sandy Bridge through Coffee Lake) and some ARM cores (Cortex-A75 in particular). AMD CPUs were largely unaffected because their permission check happens earlier in the pipeline — the load doesn't return data on a privilege violation.

Mitigation: KPTI

The deployed mitigation for Meltdown is Kernel Page Table Isolation (KPTI). In normal execution, every process's page table includes a kernel mapping (the upper canonical region) for fast syscall handling. KPTI removes this — each process gets two page tables, a "user" table without the kernel and a "kernel" table with both. On every syscall and trap, the kernel switches between them.

The cost: switching CR3 invalidates the TLB (mostly mitigated by PCID/ASID where available); each kernel entry/exit has additional overhead. Workloads with high syscall rates (fileservers, databases) saw 5-15% slowdowns; compute-bound workloads barely noticed.

Newer Intel CPUs (Cascade Lake and later) silicon-fix the underlying issue, eliminating the need for KPTI. Linux detects this via a CPU bit and disables KPTI on safe hardware.

04.Spectre

Spectre (Kocher et al., 2018) is broader than Meltdown. The key insight: the branch predictor can be trained by the attacker, and a victim's speculative execution under the trained prediction can leak data.

The original Spectre paper described two variants:

Spectre v1: Bounds-Check Bypass

if (x < array1_size)
    y = array2[array1[x] * 4096];

If x is attacker-controlled and array1_size is in memory, the check might miss in the cache. While waiting for the check, the CPU speculatively executes the body using a predicted "true" branch outcome. With x larger than array1_size, the speculative read of array1[x] accesses memory outside the array — potentially a secret. Then array2[secret * 4096] touches a cache line based on the secret.

When the bounds check resolves (false), the speculation is squashed. But the cache state changes persist. The attacker measures access times to array2 to recover the secret.

Spectre v1 affects essentially all out-of-order CPUs from major vendors. Mitigation: insert a barrier (LFENCE on x86, SSBB on ARM, fence.i is not sufficient on RISC-V — the Zicbom and Zicboz extensions and explicit ordering are needed) after the bounds check, or use array index masking (x & (size-1) after bounds check, which limits speculative access).

Spectre v2: Branch Target Injection

The attacker trains the indirect branch predictor in their own process to predict a victim address as the target of an indirect branch. When the victim runs, the branch predictor (shared between contexts in some designs) speculatively redirects an indirect branch to the attacker's chosen "gadget" address. The gadget speculatively reads secret data and leaks via cache state.

Spectre v2 is more dangerous than v1 because it can exploit branches the victim never intended to be exploitable.

Mitigations:

IBRS (Indirect Branch Restricted Speculation, Intel/AMD): a model-specific bit that disables cross-context predictor sharing. High performance cost.
IBPB (Indirect Branch Predictor Barrier): clear the predictor on context switches.
STIBP (Single Thread Indirect Branch Predictors): prevent SMT siblings from sharing predictor state.
Retpoline: a software pattern that converts indirect branches into a return-based sequence, exploiting the RAS for predictability. Retpolines avoid the leaky indirect predictor altogether.

A typical retpoline:

Assembly

; Replace: jmp *%rax
call    .Ltrampoline
.Lcapture:
    pause
    lfence
    jmp     .Lcapture
.Ltrampoline:
    mov     %rax, (%rsp)
    ret

The CALL pushes a return address (to .Lcapture); the trampoline overwrites it with the desired target; the RET pops the new target. The RAS predicts the RET to go to .Lcapture (the next instruction after CALL), which is harmless. The actual target is loaded from memory, not predicted.

Newer Intel CPUs have eIBRS (enhanced IBRS), an always-on hardware mitigation that's cheaper than the original IBRS. AMD has AutoIBRS in Zen 4.

Spectre v4: Speculative Store Bypass

Even later: the CPU's store-to-load forwarding (Chapter 26) can speculatively bypass an aliased store, returning stale data. If the stale data is sensitive and the speculation leaks it via cache, similar attacks ensue. Mitigated by SSBD (Speculative Store Bypass Disable) bits on Intel and AMD, ARM PSSBB.

05.The Broader Family

After Spectre and Meltdown, dozens of related vulnerabilities have been disclosed. A non-exhaustive list:

MDS (Microarchitectural Data Sampling), 2019. Variants RIDL, ZombieLoad, Fallout, Store-to-Leak Forwarding. Internal CPU buffers (line-fill buffers, store buffers, load ports) leak data across security boundaries. Affected various Intel CPUs through Cascade Lake.

L1TF (L1 Terminal Fault) / Foreshadow, 2018. Speculative execution past a non-present PTE allowed reading L1 cache contents, including SGX enclave data and other VMs. Defeats SGX's confidentiality.

TAA (TSX Asynchronous Abort), 2019. TSX transactions abort but speculatively executed code leaks data.

SRBDS (Special Register Buffer Data Sampling), 2020. RDRAND/RDSEED leak via shared buffer.

Zenbleed, 2023. AMD Zen 2: the YMM registers can leak across processes due to a register-rename bug.

Inception / SRSO (Speculative Return Stack Overflow), 2023. AMD Zen: training the RAS can cause speculative returns to attacker-chosen addresses.

Downfall, 2023. Intel: GATHER instructions on certain CPUs leak data via the SIMD register file's transient state.

Reptar, 2024. Intel: redundant prefix encoding causes incorrect speculation effects.

Indirector / GhostRace, 2024. Indirect predictor and SMT-related gadgets.

The pattern: every microarchitectural optimization that crosses what should be a security boundary becomes a candidate for side-channel attack. Mitigations layer on; performance overhead accumulates; new variants keep appearing.

06.SGX, SEV, and Side Channels

Trusted execution environments are particularly affected. SGX enclaves and SEV/TDX confidential VMs both rely on the assumption that the hardware enforces isolation. Spectre-class attacks can violate this assumption in subtle ways.

Several papers (Foreshadow, MDS variants, SGAxe) have exfiltrated SGX attestation keys. Intel issued microcode updates and (in some cases) deprecated SGX on consumer hardware.

SEV-SNP added countermeasures for many side channels but cannot eliminate them all. The security model of confidential VMs explicitly excludes side channels — they are out of scope for the threat model. In practice this means workloads with very strong security needs may not be safe even on hardware-attested confidential computing.

07.Mitigation Mindset

The collective response across Intel, AMD, ARM, and others has converged on a few patterns:

Hardware fixes for Meltdown-class issues. Permission checks earlier in the pipeline; explicit squashing of dependent ops on a faulting load. Now standard.

Predictor state isolation. SMT siblings don't share predictor state; context switches barrier predictor state; predictor entries are tagged with security context.

Microcode updates. Patch existing silicon to add fences, clear buffers on context switches, restrict speculation.

Software hardening. Compilers insert speculative barriers in security-sensitive code; OS uses techniques like retpoline, KPTI, and various flushes; user code uses constant-time crypto patterns.

Architecture-level patterns. New instructions for explicit speculation control (e.g., ARM's CSDB / SSBB / PSSBB / DSB SY / ISB, Intel's LFENCE in specific roles, RISC-V's eventual hardening primitives).

The cost has been real: 5-15% performance loss in syscall-heavy workloads from KPTI; smaller but measurable losses from retpolines and various flushes; increased complexity in compiler and OS.

08.Structural Solutions

Some research explores rebuilding speculation to be safe by construction. Examples:

InvisiSpec, SafeSpec, MuonTrap (academic research): speculative loads access a separate "speculation buffer" that is invisible to other agents until the load retires. On squash, the buffer is discarded with no state effect.

Delay-on-Miss: speculative loads that miss in cache are delayed until the speculation resolves. Removes the cache-state side-channel for misses, at a perf cost.

Constant-time by construction: hardware that guarantees constant-time execution for marked instructions (e.g., ARM DIT — Data Independent Timing — bit, which forces certain ops to be data-independent in latency).

These have not become standard but inform how new microarchitectures are designed.

09.Comparison Across Vendors

Vendor	Primary Approach	Notes
Intel	Microcode + silicon fixes	Most heavily affected by Meltdown/MDS class. eIBRS in modern CPUs.
AMD	Less Meltdown exposure, but Spectre v2 / Inception affected	AutoIBRS in Zen 4+.
ARM	Hardware fixes + DIT	Less affected on Cortex-A53 / A55 (in-order); affected on A75 / Apple cores.
Apple	Aggressive silicon fixes; tight microarchitectural disclosure	Generally less affected; M1 had some Spectre v1 / data sampling reports.
RISC-V	Mostly research; commercial cores adopting standard mitigations	Younger ecosystem; some H ext. cores adopt strict speculation barriers.

10.Implications

Spectre and friends marked a paradigm shift. Previously, "memory safety" and "permission enforcement" were thought to be hardware-guaranteed properties. The class of attacks shows that microarchitectural state — caches, predictors, ports, buffers — can leak data even when the architectural state is correctly protected.

The implications:

Crypto code must be constant-time at every level — including not having data-dependent branches or memory accesses, ever.
Browsers run untrusted JavaScript: the SharedArrayBuffer and high-resolution timers were briefly disabled to make timing measurements harder; site isolation (each origin in its own process) became standard.
Trusted execution boundaries are softer than expected: SGX deprecated on consumer hardware; SEV-SNP and TDX explicit about side-channel scope.
Performance vs. security trade-off has been pushed back into hardware design: future cores are more conservative about speculation across security boundaries.

The full lessons are still being absorbed. New variants are likely. The architectural community now views speculative execution as a feature with security implications, not just a performance feature.

11.Summary

Modern branch predictors (TAGE, perceptron, hybrids) achieve mispredict rates below 2% on typical workloads; they are some of the most sophisticated machine-learning structures in CPUs, learning hundreds of bits of branch history per branch. Indirect branch prediction, return-address stacks, and BTBs round out the front-end prediction.

But speculative execution — long viewed as a transparent optimization — turned out to leak data through microarchitectural side channels. Spectre, Meltdown, MDS, L1TF, Zenbleed, Inception, Downfall, and many others form a family of attacks exploiting the gap between architectural and microarchitectural state. Mitigations layer up — KPTI, retpoline, IBRS/eIBRS, microcode updates, software hardening — at measurable but tolerable performance cost. The attack class is not closed; new variants continue to be discovered.

The next chapter shifts to physical concerns: power, thermal, and the physics of running a fast CPU. We've talked about cycles and pipelines; now we examine what actually happens at the silicon level when those pipelines run.

Book mode

	mov rax, [kernel_address] ; speculatively reads kernel data into rax
	shl rax, 12 ; rax = secret * 4096
	mov rbx, [probe_array + rax] ; touches a specific cache line based on secret

	; Replace: jmp *%rax
	call .Ltrampoline
	.Lcapture:
	pause
	lfence
	jmp .Lcapture
	.Ltrampoline:
	mov %rax, (%rsp)
	ret