Part IV·Microarchitecture·Chapter 28 of 62

Part IVMicroarchitecture

Instruction-Level Parallelism

May 16, 2026·18 min read·advanced

Part V developed the machinery a single processor uses to execute many instructions concurrently: pipelining, superscalar issue, out-of-order scheduling, register renaming, branch prediction, memory disambiguation. All of those techniques exploit the same underlying resource: instruction-level parallelism (ILP), the parallelism that exists within a single sequential instruction stream among instructions that happen not to depend on one another.

ILP is what makes a wide deep modern processor faster than a narrow shallow one running the same program. The program is sequential as written; the hardware finds independence inside it and runs the independent pieces in parallel. The amount of ILP a program offers, and how much of it the hardware can extract, sets the upper bound on single-thread performance.

This chapter steps back from the implementation techniques of Part V and asks the broader questions. What is ILP, exactly? How is it limited? How much is there in real programs? What do compilers do to expose it? What are the techniques — beyond out-of-order execution — that have been tried? And why has single-thread performance roughly stagnated in recent years even as transistor counts have continued to grow?

01.What ILP Is

ILP is the property of a sequential program that lets some of its instructions execute concurrently without violating any dependencies. The instructions are written one after another, but their dependencies form a graph, not a linear chain. Instructions that lie on independent branches of the graph can execute in any order, including simultaneously.

A simple example:

int a = x + y;     // (1)
int b = z - w;     // (2)
int c = a * b;     // (3)

Instructions 1 and 2 have no dependence on each other; they can execute in parallel. Instruction 3 depends on both. The dependence graph is:

Figure: Dependence graph for three instructions: nodes 1 and 2 are independent, and node 3 depends on both, so it cannot issue until both have produced their results

LaTeX

\begin{tikzpicture}[font=\small, >=Stealth, line cap=round]
  \draw[thick, fill=white] (0, 0) circle (0.45); \node at (0, 0) {(1)};
  \draw[thick, fill=white] (0, -2) circle (0.45); \node at (0, -2) {(2)};
  \draw[thick, fill=white] (3, -1) circle (0.45); \node at (3, -1) {(3)};
  \draw[->] (0.4, -0.1) -- (2.6, -0.9);
  \draw[->] (0.4, -1.9) -- (2.6, -1.1);

The graph has three instructions but only two levels of depth: 1 and 2 at the first level, 3 at the second. A processor with two ALUs could run 1 and 2 in one cycle and 3 in the next, finishing in two cycles instead of three. The ILP available to extract here is 1.5 (three instructions over two cycles).

ILP is bounded by two things:

Critical-path depth. The longest dependency chain in the graph determines the minimum time, regardless of how many parallel resources are available. If the program has a 1000-instruction dependency chain, even infinite hardware cannot complete it in less than 1000 cycles.
Width of the graph. At any given level, only as many instructions can run concurrently as there are independent ones. If the graph is narrow (few independent instructions at each level), the processor's wide pipeline is partially idle.

The ratio of total instructions to critical-path length is sometimes called the program's inherent ILP or available parallelism. It is a property of the program, not of any particular hardware.

02.Sources of Parallelism

Programs vary widely in how much ILP they offer. Several common sources:

Independent computation in a loop body. A loop body that performs several independent operations on each iteration's data has parallelism within each iteration:

for (int i = 0; i < n; i++) {
    a[i] = x[i] + y[i];
    b[i] = z[i] * w[i];
    c[i] = a[i] + b[i];
}

Within one iteration, the addition and multiplication are independent and can run in parallel. The third operation depends on both and runs after.

Independent iterations of a loop. A loop whose iterations are independent of each other has parallelism across iterations:

for (int i = 0; i < n; i++) {
    a[i] = x[i] + y[i];
}

Iteration $i$ does not depend on iteration $i-1$ . In principle, all iterations can execute in parallel. ILP techniques (loop unrolling, software pipelining) try to overlap iterations within a single thread; data parallelism (Chapter 29) and thread parallelism (Chapter 30) parallelize across iterations more aggressively.

Independent function calls. When several function calls have no dependencies between them, they can run in parallel.

Independent statements. Even outside loops, many programs have stretches of independent statements that can be reordered or run concurrently.

The amount of ILP available is highly workload-dependent. A scientific computation traversing dense arrays may have ILP of 10-100. A pointer-chasing program walking a linked list has ILP near 1: every iteration depends on the previous. Database operations, with their mix of computation and memory access, fall somewhere in between.

03.Measuring ILP

How much ILP does a real program have? The classic measurement, due to Wall (1991) and Postiff et al., simulates an ideal machine — one with perfect branch prediction, infinite execution units, infinite registers, and instant memory — and measures the resulting IPC (instructions per cycle). The number is an upper bound on what any real implementation could achieve.

Wall's results, on a mix of integer and floating-point benchmarks, found ILP in the range of 5-50 with varying assumptions. Programs with regular structure (numerical kernels, matrix multiplication, FFT) showed higher ILP than irregular control-flow-heavy code (compilers, interpreters, OS kernels).

The ILP exploitable by a real processor is much lower. The MIT limit study by Lam and Wilson (1992) showed that real-world ILP is bounded by:

Branch misprediction. Even small misprediction rates cap ILP because mispredictions force serialization at the branch boundary.
Memory aliasing. Loads and stores that may alias must be ordered, limiting parallelism.
Window size. A finite ROB cannot see far enough ahead to find distant parallelism.
Limited execution units. Even when parallelism exists, the hardware can only run as many ops per cycle as it has execution units.

A modern wide OoO processor extracts effective IPC of 2-4 on typical software, sometimes more on regular code, sometimes much less on memory-bound or branchy code. The ratio of extracted IPC to inherent ILP is a measure of how good the implementation is at finding parallelism, and how good the program is at exposing it.

04.Compiler Techniques for Exposing ILP

The compiler is the program's first optimizer. Even before the hardware tries to extract parallelism, the compiler reorganizes the code to maximize the parallelism the hardware can see.

Loop Unrolling

A small loop, repeated, has a per-iteration overhead (the branch, the increment) and limits the parallelism the hardware can find within a single iteration. Unrolling several iterations into one larger body amortizes the overhead and exposes more parallelism:

// original
for (int i = 0; i < n; i++) {
    a[i] = x[i] + y[i];
}

// unrolled by 4
for (int i = 0; i < n; i += 4) {
    a[i+0] = x[i+0] + y[i+0];
    a[i+1] = x[i+1] + y[i+1];
    a[i+2] = x[i+2] + y[i+2];
    a[i+3] = x[i+3] + y[i+3];
}

Four independent additions per iteration, instead of one. Modern processors with multiple ALUs can execute them simultaneously. The branch overhead is paid every fourth iteration instead of every iteration.

The cost of unrolling is code size: the binary is larger. Larger code may miss the I-cache or the µop cache more often, hurting performance. Compilers tune the unroll factor for each loop based on heuristics.

Software Pipelining

A more sophisticated transformation: software pipelining overlaps the execution of multiple iterations within a single loop body. Each iteration's first operation overlaps with the previous iteration's second, and the previous-previous iteration's third, and so on, like an instruction-level pipeline written into the program text.

// original
for (int i = 0; i < n; i++) {
    int t = load(a + i);
    int u = compute(t);
    store(b + i, u);
}

// software pipelined (sketch)
int t0 = load(a + 0);            // prologue
int u0 = compute(t0);
int t1 = load(a + 1);
for (int i = 2; i < n; i++) {
    store(b + i - 2, u0);        // store from 2 iters ago
    u0 = compute(t1);             // compute from 1 iter ago
    t1 = load(a + i);             // load this iter
}
store(b + n - 2, u0);            // epilogue
store(b + n - 1, compute(t1));

Each iteration of the steady-state loop now contains a load, a compute, and a store from three different original iterations. The dependencies are between iterations: the store waits for two-iterations-ago's compute, the compute waits for one-iterations-ago's load. The independent operations within each steady-state iteration can run in parallel.

Software pipelining is essential on simple in-order processors that cannot reorder dynamically. On modern OoO processors it is less critical — the hardware does similar reordering automatically — but it still helps by enlarging the effective window of independent work the hardware sees.

Instruction Scheduling

The compiler reorders instructions to fit a model of the target processor's pipeline. Instructions with long latency (loads, divides, FP operations) are scheduled early, so dependent instructions can wait without blocking the pipeline. Independent instructions are placed close together to fill the wide issue slots.

A compiler's instruction-scheduling pass has a model of the target processor — its issue width, its execution-port assignments, its instruction latencies — and it reorders instructions to fit that model. Modern compilers like GCC and LLVM have detailed schedulers for each major target. The output is the same program, expressed in machine instructions that match the target's pipeline.

Vectorization

A loop that performs the same operation on many independent data elements can be transformed to use vector (SIMD) instructions, which operate on multiple data elements per instruction. We will treat vectorization in detail in Chapter 29; for now, note that it is one of the compiler's most powerful tools for extracting parallelism. A loop summing 1000 floats can be transformed to use 8-wide SIMD operations, processing 8 floats per instruction, multiplying throughput by ~8.

Constant Propagation, Strength Reduction, and Friends

Many compiler optimizations indirectly help ILP by reducing the number of operations or breaking up dependency chains:

Constant propagation replaces a variable known to hold a constant with that constant, often allowing further optimizations.
Strength reduction replaces expensive operations with cheaper equivalents (multiplication by a constant becomes shifts and adds).
Common-subexpression elimination computes a shared value once instead of repeatedly.
Loop-invariant code motion moves code that does not depend on the loop variable out of the loop.

Each of these reduces the work in the program's critical path or shortens dependency chains, both of which improve ILP.

05.Hardware Techniques Beyond Out-of-Order

Out-of-order execution is the dominant ILP-extraction technique, but it is not the only one. Several alternatives have been tried.

VLIW: Very Long Instruction Word

VLIW machines push all the scheduling work into the compiler. The hardware is wide and parallel, but it does no dynamic scheduling: each very long instruction word contains several operations, and the hardware just dispatches them in lockstep, one slot to each functional unit.

Plain Text

A VLIW instruction word, 5 slots:
  | ALU op | ALU op | LOAD/STORE | FP op | BRANCH |

The compiler is responsible for filling each slot with an independent operation. If it cannot find one, the slot is filled with a NOP.

VLIW's advantages: the hardware is simpler (no rename, no scheduler, no ROB), the cycle time is faster, and the energy efficiency is higher. Its disadvantages: the compiler must know everything about the target's parallelism, including memory-aliasing behavior and runtime variability, which it cannot. NOPs accumulate when ILP runs out; binary compatibility across different VLIW widths is hard.

The Intel Itanium (mid-2000s) was a famous VLIW design. It struggled commercially: compilers couldn't extract enough ILP statically to match what OoO machines extracted dynamically, and recompilation was needed for every micro-architectural change. VLIW lives on in DSPs and certain embedded processors (TI's C66x family, Qualcomm's Hexagon DSP) where workloads are regular and recompilation is acceptable.

Trace Caches and Trace Scheduling

A trace cache stores instructions not in their original program-counter order but in executed order, including taken branches. If the same trace is taken repeatedly, the trace cache delivers the instructions in their predicted execution order, smoothing the front end and exposing more parallelism.

Trace scheduling, the compiler-side cousin, identifies a likely path through the program and schedules instructions across basic-block boundaries to fit that path. The resulting code is faster on the predicted path, slower on the unpredicted path. Combined with profile-guided optimization, it can improve ILP exposure significantly.

The Pentium 4 had a trace cache; modern Intel processors use a µop cache (Chapter 27) that combines some of the benefits without being a true trace cache.

Speculation Beyond Branches

OoO machines speculate past branches, but they could speculate further:

Value speculation. Predict a load's result before it returns and proceed; if the prediction is wrong, replay. Researched extensively but rarely deployed because mispredictions are too frequent and replays too costly.
Control speculation. Hoist instructions across branches at the compiler level, marking them so they can be undone. The Itanium ISA had explicit predicated execution and control speculation features for exactly this.
Address speculation. A load whose address is not yet known might be issued speculatively against a predicted address. Again, extensively researched.

These techniques have not made it into mainstream high-performance designs because the win/loss tradeoff has not been favorable: the win on correct speculation is modest, and the loss on incorrect speculation (replay, energy waste) is significant.

Helper Threads and Run-Ahead

A different angle: while the main thread is stalled on a cache miss, run another thread on the same core that prefetches the data the main thread will need. The main thread, when it resumes, finds the data in cache and runs faster.

Helper threads can be explicit (programmer- or compiler-generated) or implicit (the hardware itself spawns a speculative continuation when the main thread stalls). Implicit run-ahead has been studied; some early Sun chips did this. The technique is not common in current designs, but the idea recurs.

06.Pollack's Rule and the Economics of Wider Cores

A rough empirical observation, attributed to Fred Pollack of Intel, captures why wider single cores have such steep diminishing returns. Pollack's rule states that the performance of a processor design scales roughly as the square root of its complexity (measured in transistors or area):

$\text{performance} \propto \sqrt{\text{area}}.$

If a 4-wide core occupies area $A$ , an 8-wide core occupies somewhere around $4A$ but delivers only about $2\times$ the per-thread performance of the 4-wide. Power scales worse than area, because the additional logic is on the active path; the wider core also runs at slightly lower frequency because of longer wires and bigger structures. The result is that single-thread performance per unit area declines as cores grow wider, and single-thread performance per watt declines even faster.

The rule is not a law of physics; it is a regularity that has held across decades of high-performance CPU design and that captures the cumulative effect of all the diminishing-returns mechanisms we have described — the limited ILP in real programs, the difficulty of finding more parallelism without a larger window, the wire-delay penalty of bigger structures, the verification cost of more aggressive scheduling.

For a fixed transistor budget, the implication is clear: instead of one core of area $4A$ , build four cores of area $A$ . The total throughput on parallel workloads is roughly $4 \times \sqrt{A}$ versus $\sqrt{4A} = 2\sqrt{A}$ for the single big core — a factor-of-two advantage for the multi-core layout, in addition to the power-density advantage. This calculation, more than any architectural argument, drove the multi-core transition of the mid-2000s. It also explains why even "big" modern cores rarely exceed 8-wide: beyond that point, the area cost of the extra width is not justified by the modest ILP gains, and the transistors are better spent on more cores or larger caches.

A closely related observation is that heterogeneous designs, mixing big and little cores on the same die, can outperform homogeneous designs on workloads with a mix of single-thread and parallel demand. We will return to this in Chapter 30.

07.The ILP Wall

For decades, the recipe for faster processors was: more transistors, more aggressive ILP extraction, faster clocks. By the mid-2000s, this recipe ran into trouble.

Frequency stagnation. Clock speeds plateaued around 3-4 GHz in the mid-2000s. Higher frequencies caused unsustainable power and heat. The "easy" frequency improvements ran out.

ILP saturation. Even with deeper pipelines, larger ROBs, and wider issue, real programs revealed only finite parallelism. The marginal IPC gain from doubling the ROB became smaller and smaller. Pollack's rule — performance scales as the square root of complexity — quantifies the diminishing returns.

Power efficiency. Aggressive ILP extraction (deep pipelines, large ROBs, complex schedulers) costs power disproportionately. Each fractional IPC gain costs more watts than the last.

The industry's response was the multi-core revolution of the mid-2000s. Rather than building bigger and bigger single cores, chip designers put multiple medium-sized cores on a die. The total transistor budget grew, but it was spent on more cores rather than wider/deeper individual ones. Single-thread performance growth slowed dramatically; multi-thread performance grew with the core count.

The shift had software implications: applications that ran on one core were now stuck at single-core speed, while parallel applications scaled. Programmers had to learn to write parallel software. This is the world we have lived in since roughly 2005.

08.Single-Thread Performance Today

Single-thread performance has continued to improve, but slowly. Apple's M-series, AMD's Zen lineage, and Intel's recent generations have each delivered modest IPC gains and frequency gains, but nothing like the doublings of the late 1990s.

Where do the recent gains come from?

Larger ROBs and issue queues. More in-flight instructions, more visible parallelism.
Better branch predictors. TAGE-based designs reach 97-98% accuracy; further gains are at the margins.
Wider front ends. Apple's M-series decodes 8 instructions per cycle.
Smarter caches. Larger L1 and L2, better prefetching.
Lower-latency execution. Smaller chips and better processes give shorter cycle times.
Architectural extensions. New instructions (SIMD widening, vector extensions, specialized crypto and ML ops) shorten the dynamic instruction count.

But the era of doubling single-thread performance every two years is over. The growth in compute capability now comes mostly from parallelism (more cores, wider SIMD, GPUs and accelerators) rather than from extracting more ILP from a single thread.

09.When ILP Runs Out

The diminishing returns from ILP set the stage for the rest of Part VI. If a single thread cannot run faster, the alternatives are:

More threads. Run multiple threads simultaneously, either on separate cores or interleaved on a single core (SMT, simultaneous multithreading).
More data per instruction. Use SIMD/vector instructions to process multiple data elements at once.
More cores. Replicate the single-core machinery many times on a chip.
Specialized accelerators. Hand off specific workloads (graphics, ML, crypto) to fixed-function hardware.

Chapter 29 covers data-level parallelism: the SIMD and vector techniques that exploit the regular parallelism in numeric and graphics workloads. Chapter 30 covers thread-level parallelism: multi-core CPUs, simultaneous multithreading, and the programming models that target them. Chapter 31 covers the cache coherence and memory consistency that make multi-threaded programming tractable.

The ILP techniques of Part V are not obsolete — every modern core is OoO, with deep pipelines and aggressive speculation — but they are no longer the only lever, or even the dominant one, for getting more performance.

10.Summary

Instruction-level parallelism is the parallelism inherent in a sequential program: instructions whose dependencies allow them to run concurrently. The amount of ILP available depends on the program's dependency graph, varying from near 1 (pointer chasing) to dozens or hundreds (regular numerical code). The compiler exposes ILP through unrolling, software pipelining, instruction scheduling, vectorization, and a host of classic optimizations. The hardware extracts it through the techniques of Part V: pipelining, superscalar issue, OoO execution, register renaming, branch prediction.

The field has matured. The remaining gains from pure ILP are incremental, costly in power, and bounded by program structure. Performance growth has shifted toward exploiting parallelism the program does not express in a single instruction stream — across data elements, across threads, across cores, across machines.

The next chapter takes the first of those steps: data-level parallelism, where a single instruction operates on many data elements simultaneously.

Book mode

	int a = x + y; // (1)
	int b = z - w; // (2)
	int c = a * b; // (3)

	\begin{tikzpicture}[font=\small, >=Stealth, line cap=round]
	\draw[thick, fill=white] (0, 0) circle (0.45); \node at (0, 0) {(1)};
	\draw[thick, fill=white] (0, -2) circle (0.45); \node at (0, -2) {(2)};
	\draw[thick, fill=white] (3, -1) circle (0.45); \node at (3, -1) {(3)};
	\draw[->] (0.4, -0.1) -- (2.6, -0.9);
	\draw[->] (0.4, -1.9) -- (2.6, -1.1);

	for (int i = 0; i < n; i++) {
	a[i] = x[i] + y[i];
	b[i] = z[i] * w[i];
	c[i] = a[i] + b[i];
	}

	A VLIW instruction word, 5 slots:
	\| ALU op \| ALU op \| LOAD/STORE \| FP op \| BRANCH \|