Part IV·Microarchitecture·Chapter 24 of 62

Part IVMicroarchitecture

Superscalar Execution

May 16, 2026·17 min read·intermediate

A pipelined processor can complete one instruction per cycle in the best case. To go further, the processor must do more than one thing at a time — issue, execute, and complete multiple instructions per cycle. This is superscalar execution. The hardware widens the pipeline so that each stage handles several instructions in parallel, multiplying throughput by the width of the pipeline.

The idea is conceptually simple. Instead of one ALU, have two or three. Instead of one decode slot, have four. Instead of one writeback per cycle, have several. If the front end can supply enough instructions and the back end has enough independent work, the throughput rises proportionally to the width.

The execution is anything but simple. Issuing several instructions per cycle multiplies the number of inter-instruction interactions. Hazards become more numerous and more elaborate. Register-file ports must scale with width. Data dependencies between instructions issued in the same cycle have to be detected and handled. The designer who succeeded at building a 1-wide pipeline now has to build something several times more complex without slowing the cycle time.

This chapter develops superscalar execution as the next layer above pipelining. We start with the static, in-order superscalar machines that introduced the idea, then look at the issues — fetch and decode bandwidth, scheduling, register-file ports, dependency checking — that any wide design must solve. Out-of-order execution, the technique that makes wide designs really effective, is the subject of Chapter 25.

01.What Width Means

A processor's width is the maximum number of instructions it can issue per cycle. Width applies to several places along the pipeline:

Fetch width: instructions fetched from the I-cache per cycle.
Decode width: instructions decoded per cycle.
Issue width: instructions sent to execution units per cycle.
Execute width: the number of execution units that can operate concurrently.
Retire width: instructions completed per cycle.

These widths do not have to be the same. A processor might fetch 16 bytes per cycle (which is, on x86, 4-6 instructions of average length), decode 4 instructions per cycle, issue 6 µops per cycle, execute on 10 execution ports, and retire 8 µops per cycle. Each stage's width is sized to the typical workload's needs at that point in the pipeline.

A "4-wide" processor usually means the issue width is 4. The other widths are sized to keep that issue rate fed and drained.

A useful refinement of the iron law:

$\text{IPC}_{\max} = \min(\text{fetch}, \text{decode}, \text{issue}, \text{retire})$

bounded by the narrowest stage. Real IPC is below this bound because of stalls, dependencies, and other inefficiencies, but the bound is a hard ceiling. To raise the ceiling, every stage must widen together.

02.Fetch and Decode Bandwidth

The front end must supply enough instructions per cycle to feed a wide back end. This is harder than it sounds.

Fixed-Width ISAs

For RISC ISAs with fixed-width instructions (RISC-V's 32-bit instructions, AArch64's 32-bit instructions), the fetch problem is straightforward. The I-cache delivers a cache line of bytes (typically 64 bytes), which contains 16 instructions. The fetch unit picks the right starting offset within the line based on the PC's low bits and forwards the next $W$ instructions to the decode stage, where $W$ is the fetch width.

The catch is taken branches. If a branch in the middle of the fetch group is taken, the instructions after the branch are discarded. A branch in instruction position 0 of a 4-wide fetch wastes 3 of the 4 fetched instructions. Average usable fetch bandwidth is therefore less than the peak; a typical number is 75-85% of peak after branches.

Some designs use trace caches or µop caches that store sequences of instructions in their decoded form, eliminating the need to re-fetch and re-decode them. Intel's Sandy Bridge introduced a µop cache that became standard in subsequent generations.

Variable-Length ISAs

For x86, with its variable-length instructions (1 to 15 bytes), fetch is much harder. The fetch unit cannot tell how long an instruction is until it has begun to decode it. The front end has to scan the fetched bytes, identify instruction boundaries, and dispatch the resulting instructions to the decoders.

Modern x86 cores include a length-decode stage that pre-scans the fetched bytes for instruction boundaries. The result is then fed to several parallel decoders. Intel's complex decoders typically have one complex decoder that can handle any instruction (including those that produce multiple µops) and several simple decoders that handle only single-µop instructions.

To avoid paying the variable-length-decode cost on every loop iteration, modern Intel and AMD cores cache the decoded µops. The first time a sequence of bytes is fetched, it is decoded normally and the resulting µops are stored in the µop cache (also called DSB, Decoded Stream Buffer, on Intel). Subsequent fetches of the same instructions hit the µop cache and skip the variable-length decode entirely. The µop cache typically delivers 6-8 µops per cycle, more than the legacy decoders, and at lower power.

The Front-End Bandwidth Wall

The front end is bandwidth-limited even before we count branches. A 4-wide processor needs to fetch 16 bytes per cycle of x86 instructions or 16 bytes (4 instructions) of fixed-width ISA. The I-cache must therefore deliver at least 16 bytes per cycle, and the prediction infrastructure must produce a target every cycle.

For very wide processors (8-wide and beyond), the front end is often the bottleneck. The I-cache and predictor cannot deliver enough useful bytes per cycle to feed the back end. This is one reason why width does not scale indefinitely. Apple's M-series chips, with their 8-wide decode, devote enormous transistor budgets to the front end to keep up.

03.Static Scheduling: In-Order Superscalar

The simplest superscalar design issues multiple instructions per cycle but only when they are ready and independent. The hardware does not reorder instructions; it just looks at the next several in program order and issues as many as it can.

A 2-wide in-order superscalar processor in operation:

Assembly

add  x1, x2, x3      # group 1, issue slot 0
add  x4, x5, x6      # group 1, issue slot 1
mul  x7, x1, x4      # group 2, issue slot 0 (depends on both above)
sub  x8, x9, x10     # group 2, issue slot 1

Group 1 issues two independent adds in cycle N. Group 2 issues a mul (which depends on x1 and x4 from group 1) and a sub (independent) in cycle N+1, after both adds have completed (or after their results have been forwarded).

This works only when the program offers enough parallelism between adjacent instructions. If the very next instruction depends on the current one, the processor cannot issue them in the same cycle and the second slot goes empty:

Assembly

add  x1, x2, x3      # issue alone
add  x4, x1, x5      # waits because it depends on the first

The slot used by the second instruction in the previous example is empty here. The processor's effective IPC is well below its peak width.

Static scheduling, when the compiler has enough freedom, can rearrange instructions to fit the issue slots. The compiler emits independent instructions in adjacent positions, deliberately scheduling the code for the target processor's width and latencies:

Assembly

# original
ld   x1, 0(x10)
add  x4, x1, x5
# scheduled
ld   x1, 0(x10)
add  x6, x7, x8     # filler instruction
add  x4, x1, x5

If the compiler can find enough independent work to fill the slots, an in-order superscalar processor can approach its peak IPC.

In-order superscalar designs were the entry point into multi-issue execution. The Intel Pentium (1993) was an early example: 2-wide in-order. The early ARM Cortex-A7 and Cortex-A53 are in-order with 2-wide issue. The first RISC-V designs were in-order. They are simple, area-efficient, and power-efficient, but their performance ceiling is set by the compiler's scheduling skill and the program's parallelism. Real workloads often have just enough irregular dependencies that in-order designs leave many slots empty.

04.Dependency Checking in Hardware

For a wide in-order processor, the hardware must check whether a group of instructions can issue together. The check is straightforward but grows quickly with width.

For a 4-wide group, the hardware compares each instruction's source registers against the destination registers of every older instruction in the group:

Instruction 0: no checks (it's the oldest).
Instruction 1: 2 source-register reads, each compared against 1 destination.
Instruction 2: 2 source-register reads, each compared against 2 destinations.
Instruction 3: 2 source-register reads, each compared against 3 destinations.

In total: 0 + 2 + 4 + 6 = 12 register-number comparisons. For an 8-wide group: 0 + 2 + 4 + 6 + 8 + 10 + 12 + 14 = 56 comparisons. The number scales roughly with $W^2$ , where $W$ is the width.

If any source register matches an older destination in the same group, the issue logic must either stall the dependent instruction (issuing only the older ones in this cycle) or, in some designs, wire the older instruction's result directly into the dependent instruction's input via in-group forwarding.

This $O(W^2)$ scaling is one of the reasons very wide in-order designs are uncommon. Most go up to 4 wide; beyond that, the dependency-check logic becomes a critical-path concern.

05.Multiple Execution Units

A wide processor needs multiple execution units to run several instructions concurrently. A typical modern back end might have:

4 integer ALUs.
2 load units and 1 store unit.
1 or 2 integer multipliers.
1 or 2 dividers.
4 floating-point / SIMD units (each capable of FP add, FP multiply, or vector operations).
1 or 2 branch units.

These units are organized into issue ports (Intel calls them "execution ports", AMD "pipes"). Each port can dispatch one instruction per cycle. An instruction's port assignment depends on its type: an integer add can go to any of the integer ALU ports, a load goes to one of the load ports, etc.

A modern Intel core might have 10 execution ports, structured roughly as:

Port	Capabilities
0	Integer ALU, integer multiply, FP/SIMD
1	Integer ALU, FP/SIMD, branch
2	Load
3	Load
4	Store data
5	Integer ALU, FP/SIMD
6	Integer ALU, branch
7	Store address
8	Load
9	Store address

(The exact structure varies by generation.) A program that uses a mix of operations distributes naturally across ports; a program that uses only one type (say, only integer adds) is bottlenecked by the number of ports that can do that type.

The scheduler — the logic that decides which instruction goes to which port each cycle — has to be fast and small enough to fit in a single cycle. A scheduler that takes two cycles to make an issue decision adds two cycles to the back end's pipeline, hurting performance.

For an in-order machine, the scheduler is simple: take the next several instructions and issue each one to a port that can handle it, subject to dependency constraints. For an out-of-order machine (Chapter 25), the scheduler is among the most complex pieces of logic on the chip.

06.Register-File Ports

The register file must serve all the in-flight instructions. A 4-wide processor that issues 4 ALU ops per cycle needs 8 read ports (2 sources per ALU op) and 4 write ports. Counting load values being forwarded, store values being read, and other paths, the actual port count is even higher.

Each register-file port costs area and power, and scales roughly with the square of the port count for a multi-ported design — every additional port adds wires through the entire array, so capacity drops as ports grow.

For very wide designs, a single multi-ported register file becomes impractical. Several solutions exist.

Banked register files. Split the register file into multiple banks, each with fewer ports. Instructions are routed to whichever bank holds their operands. This works well when register usage is well-distributed across banks, but introduces conflicts when several instructions need the same bank simultaneously.

Clustered execution. Group execution units into clusters, each with its own register file. Instructions and operands are routed to the cluster best suited to handle them. Forwarding between clusters is slower than within a cluster, so the scheduler tries to keep dependent instructions in the same cluster. The Alpha 21264 (1998) used a two-cluster design; Apple's M-series uses some clustered structure as well.

Physical register files in OoO designs. With register renaming (Chapter 25), the physical register file is much larger than the architectural one (perhaps 200+ entries vs 32 architectural registers). A large physical register file with many ports is a major engineering challenge. Modern designs use a combination of techniques: banked physical register files, separate FP and integer registers, and replication of frequently-read entries.

07.Forwarding in a Wide Pipeline

The forwarding network we saw in Chapter 22 — wires from later pipeline registers back to the EX inputs — gets much bigger in a wide superscalar.

For a 4-wide pipeline with several pipeline stages where a value might be available, the number of forwarding paths to each ALU input is large. Each ALU input needs to be able to take its value from:

The register file (the default).
Any of the 4 ALU outputs from the previous cycle.
Any of the 4 ALU outputs from two cycles ago (for instructions still in MEM or WB).
Any of the load-result paths.

The total number of forwarding paths is roughly $W \times \text{(number of forwarded sources)}$ , and each path is a wire crossing through the entire execution datapath. Wire delay in modern processes is significant; long forwarding wires can become critical paths.

To bound the wire complexity, designers limit the forwarding network: not every output forwards to every input. Some forwarding is between adjacent ports only; others are full broadcast. The scheduler is aware of which forwarding paths exist and accounts for them when deciding when to issue.

08.Clustered Execution

A fully-connected forwarding network across $W$ pipes scales as $W^2$ in wires and as worse than that in physical layout: every ALU output has to reach every ALU input, every load-result path has to reach every register-file write port, and the wires must cross the active datapath. At small widths (2 or 4) the cost is tolerable; at 8 or wider, the cost in area, power, and critical-path delay is enough to bound the design.

Clustered designs split the back end into two or more groups, each with its own register-file copy and its own forwarding network. Within a cluster, forwarding is fast (short wires, single cycle); between clusters, forwarding takes an extra cycle to cross the boundary, and the register-file copies must be kept synchronized by writing every result into both. The scheduler then tries to issue dependent instructions to the same cluster, exploiting the locality of register dependencies in real code.

The Alpha 21264 (1998) was the first prominent commercial clustered design, with two integer clusters of two pipes each. Its IPC suffered slightly from the cross-cluster latency but its frequency benefited substantially from the shorter wires. Several modern designs use related techniques: AMD's Bulldozer and its descendants used clustered integer pipes; Apple's M-series cores reportedly use a clustered organization at their 8-wide back end; many GPU shader cores are clustered by design. Intel's recent Xeon and Core designs partition the integer and FP/SIMD pipes into separately-scheduled clusters, partly for the same reason.

The clustering decision is one of the more consequential micro-architectural choices in a wide design. It directly trades single-cluster latency for total width, and the right balance depends on the workload mix and the process technology's wire-delay characteristics.

09.Limitations of In-Order Superscalar

In-order superscalar designs have a fundamental ceiling: they can only run as fast as the program's local parallelism allows. If a long-latency operation — a cache miss, a divide, an FP square root — is on the path, the dependent instructions stall behind it, and the wide back end goes idle.

Consider:

Assembly

ld    x1, 0(x10)         # cache miss, ~300 cycles
add   x2, x1, x3         # depends on x1
mul   x4, x5, x6         # independent
sub   x7, x8, x9         # independent
xor   x11, x12, x13      # independent

The load misses in the cache and waits 300 cycles for memory. The next instruction depends on x1 and stalls behind the load. In an in-order design, all subsequent instructions also stall, even though mul, sub, and xor are independent of x1 and could execute right away. The wide pipeline runs at zero IPC for the entire 300-cycle window.

Modern programs have these long stalls frequently. Cache misses to DRAM are common; pointer chasing in linked data structures generates cascades of misses; memory-bound workloads are fundamentally limited. An in-order processor cannot hide any of this latency.

The fix is to let the processor reorder instructions: pull independent younger instructions past the stalled older one, execute them while the load is in flight, and only stall when there is genuinely nothing else to do. This is out-of-order execution, the subject of Chapter 25.

In-order superscalar processors still have their place — in low-power cores, in microcontrollers, in cores where simplicity and predictability matter more than peak throughput. But essentially every high-performance core since the late 1990s has been out-of-order. The simplest path to high IPC is to combine wide issue with out-of-order scheduling.

10.A Concrete Example: 4-Wide In-Order

To make the mechanics concrete, consider a 4-wide in-order machine running:

Assembly

add  x1, x2, x3
ld   x4, 0(x10)
sub  x5, x1, x6
mul  x7, x8, x9
xor  x11, x4, x12
add  x13, x14, x15

The issue logic looks at the first 4 instructions and checks for dependencies:

Instruction 1: independent. Issue.
Instruction 2: independent. Issue.
Instruction 3: depends on x1 (from instruction 1). Cannot issue this cycle (assume forwarding is not from same-cycle issue). Stop here.

Cycle N: instructions 1, 2 issue. Instructions 3, 4 wait.

In cycle N+1, the front end advances. Instruction 1's result is now available via forwarding. The issue logic looks at instructions 3, 4, 5, 6:

Instruction 3: x1 forwarded. Issue.
Instruction 4: independent. Issue.
Instruction 5: depends on x4, which is being loaded. Load result not yet available (assume 3-cycle load latency). Cannot issue. Stop.

Cycle N+1: instructions 3, 4 issue. Instructions 5, 6 wait. Two of four slots used.

Cycle N+2: nothing new can issue (still waiting for load). Zero of four slots used.

Cycle N+3: load completes, x4 available. Instruction 5 can issue, then 6. Two of four slots used.

Total: 4 instructions over a window of 4 cycles where peak would have been 16. Effective IPC ≈ 1.5, well below the 4-wide peak. This is typical of in-order superscalar performance: the width is wasted whenever dependencies prevent dense packing.

Out-of-order execution would let the processor pull instruction 6 forward, executing it while waiting for the load. We will see how in the next chapter.

11.Summary

Superscalar execution widens the pipeline so multiple instructions occupy each stage. A 4-wide processor can in principle complete 4 instructions per cycle, multiplying throughput by 4 over a single-issue pipeline. The hardware cost is significant: more execution units, more register-file ports, more forwarding paths, more dependency-check logic, all roughly scaling with width.

In-order superscalar designs issue several instructions per cycle in program order, stalling whenever the next instruction has an unresolved dependency. They achieve good performance on regular code with abundant local parallelism, but suffer on programs with frequent stalls — cache misses, long-latency operations, irregular dependency patterns. The ceiling on IPC is bounded by what the compiler's static scheduling can extract.

Out-of-order execution removes that ceiling by letting the hardware reorder instructions dynamically, executing independent younger instructions while older ones wait. The price is a much more elaborate back end. Chapter 25 develops it in full.

Book mode

	1: add x1, x2, x3
	2: ld x4, 0(x10)
	3: sub x5, x1, x6
	4: mul x7, x8, x9
	5: xor x11, x4, x12
	6: add x13, x14, x15

	add x1, x2, x3 # group 1, issue slot 0
	add x4, x5, x6 # group 1, issue slot 1
	mul x7, x1, x4 # group 2, issue slot 0 (depends on both above)
	sub x8, x9, x10 # group 2, issue slot 1

	add x1, x2, x3 # issue alone
	add x4, x1, x5 # waits because it depends on the first

	# original
	ld x1, 0(x10)
	add x4, x1, x5

	# scheduled
	ld x1, 0(x10)
	add x6, x7, x8 # filler instruction
	add x4, x1, x5

	ld x1, 0(x10) # cache miss, ~300 cycles
	add x2, x1, x3 # depends on x1
	mul x4, x5, x6 # independent
	sub x7, x8, x9 # independent
	xor x11, x12, x13 # independent