Part IVMicroarchitecture

Load/Store Micro-Architecture

May 16, 2026·24 min read·intermediate

Of all the operations a processor performs, memory accesses are the most fragile. They are slow (a cache miss costs hundreds of cycles), they have complex ordering requirements (the program's…

Of all the operations a processor performs, memory accesses are the most fragile. They are slow (a cache miss costs hundreds of cycles), they have complex ordering requirements (the program's correctness can depend on the order in which loads and stores observe memory), and they interact with every other piece of the machine — caches, TLBs, virtual memory, coherence protocols, the architectural memory model. The out-of-order machinery of Chapter 25 makes register operations fast, but applying the same techniques to memory operations requires a substantial extra apparatus.

This chapter zooms in on the memory side of the back end. We look at the load/store queue, the structure that tracks in-flight memory operations; memory disambiguation, the problem of deciding when a load and an older store may alias; store-to-load forwarding, the optimization that lets a load read from a store still in flight; and the techniques modern processors use to speculate aggressively on memory while still preserving the architectural memory model. The chapter also describes how the data cache integrates into the back end and what kinds of replays and recoveries happen when memory speculation fails.

01. Why Memory Is Hard

Register operations are easy because the operands are named. Two instructions that write x1 are obvious to detect; two that read x1 are obvious to satisfy. Renaming and the issue queue handle them straightforwardly.

Memory operations are not named in the same way. A store at address [x10] and a load at address [x11] may or may not access the same byte of memory; the answer depends on the runtime values of x10 and x11. The hardware cannot tell at decode time whether a load needs to wait for an older store. It can only know after both addresses are computed.

This is memory aliasing ambiguity, and it complicates several things:

  1. Ordering. A load must see the value of the most recent older store to the same address. Out-of-order execution cannot let the load skip past that store unless it knows they do not alias.
  2. Speculation. A processor may speculate that a load does not alias any older store and let it execute early. If the speculation turns out wrong, the load and its dependents must be replayed.
  3. Forwarding. A store still in flight (not yet committed to memory) may have produced the value a younger load needs. The hardware can forward the value directly, but only if the addresses match.
  4. The memory model. The architecture specifies what orderings are visible to other CPUs. The hardware must respect those constraints even when reordering memory operations within a single core.

A large fraction of an OoO processor's complexity sits in the memory subsystem to handle these issues correctly and quickly.

02. The Load/Store Queue

In-flight memory operations live in a structure called the load/store queue (LSQ). Some designs split this into two: a load queue (LQ) holding loads and a store queue (SQ) holding stores. The terminology and exact structure vary, but the function is the same.

Each LSQ entry holds:

  • The kind of operation (load or store).
  • The age (program order, typically encoded as a ROB index).
  • The memory address (filled in once address generation completes).
  • The size (1, 2, 4, 8 bytes, or wider for vector loads).
  • The data value (for stores; the value to be written).
  • Status flags: address-resolved, data-ready, completed, retired, etc.

Memory operations enter the LSQ at rename time, in program order. They leave the queue at different times depending on type:

  • Stores stay in the SQ until they retire from the ROB. Only then are they written to the cache (or sent to a write buffer below the cache). This in-order retirement of stores is what makes precise exceptions possible: if an older instruction faults, the store has not yet committed and can be discarded.
  • Loads complete (return their value to the destination physical register) as soon as their address and data are available, even before retirement. They stay in the LQ until they retire — not for the data, but to track them for ordering and recovery purposes.

The LSQ's size, like the ROB's, is a major performance lever. A larger LSQ allows more memory operations in flight, which means more memory-level parallelism (more cache misses outstanding simultaneously). Modern LSQs hold 100-200+ entries split between loads and stores.

03. Address Generation and Translation

A memory operation's lifetime in the back end has several phases:

  1. Issue. The instruction issues from the issue queue when its source registers (the base and index for the address) are ready.
  2. Address generation. A dedicated address generation unit (AGU) computes the effective address from the base, index, scale, and displacement. This is a simple add, taking 1 cycle. After AGU, the address is known.
  3. Translation. The virtual address is translated to a physical address by the TLB. We will cover the TLB in detail in Chapter 50; for now, assume it is a small cache of recent translations and produces a physical address in 1 cycle (or stalls on a TLB miss while the page-table walker fetches the translation).
  4. Cache access. The physical address is sent to the L1 data cache. The cache returns the data in 3-5 cycles for a hit, or initiates a miss-handling sequence that takes much longer for a miss.
  5. Writeback (for loads). The loaded value is written to the destination physical register and broadcast for waking up dependents.

For a store, address generation, translation, and a partial cache lookup happen at issue time, but the store does not actually write the cache until it retires.

The AGU is a special functional unit that can be a separate execution port or share a port with simple ALU operations (since the address calculation is just an integer add). High-performance cores have multiple AGU ports — typically two or three, paired with two or three load ports and one or two store-address ports.

04. Memory Disambiguation

When a load issues, several older stores may still be in the SQ with their addresses not yet computed. The hardware must decide:

  • Does the load alias any older unresolved store? If so, the load should wait.
  • Does the load alias any older resolved store? If so, the load should forward from that store rather than read the cache.
  • Are all older stores resolved and non-aliasing? Then the load can read the cache freely.

The simplest correct policy is to wait: a load does not issue until all older stores have computed their addresses. This always preserves correctness, but it serializes loads behind stores excessively, hurting performance.

Modern processors instead speculate. When a load issues, it checks against all older resolved stores in the SQ. If no resolved store aliases, the load proceeds. The hardware bets that no older unresolved store will eventually alias.

When does this bet go wrong? When an older store, after the load has executed, computes its address and finds that it aliases the load. The load read its data from somewhere (the cache, perhaps, or an even older store), but the correct data was the value of this older store, which the load failed to forward from. The load and all its dependents are now incorrect.

The hardware detects this by, when a store's address is finally known, checking against younger loads that have already executed. If a match is found, the offending load and all instructions that depend on it (transitively) are squashed and re-executed.

The detection structure is sometimes called the memory order buffer (MOB) or the store-load forwarding network. It is a CAM (content-addressable memory) where loads' addresses are stored when they execute, and stores compare their addresses against all younger entries when they resolve.

The speculation is right most of the time. Real programs rarely have a load and an older unresolved store that genuinely alias — the compiler usually arranges things so that aliasing pairs are close together and resolved in order. The cost of an occasional load-replay is much less than the cost of always serializing loads behind stores.

Some recent processors take this further with a memory dependence predictor (sometimes called a store-set predictor). Loads and stores that have been observed to alias in the past are recorded; future instances of those loads are forced to wait for the corresponding stores. Loads not in the predictor's table speculate freely. This catches the small set of repeatedly-aliasing operations (which would otherwise replay over and over) without paying the cost of always-waiting on the rest.

05. Store-to-Load Forwarding

A store still in the SQ has not yet written to the cache, but its data is sitting in the SQ entry. A younger load to the same address can read directly from the SQ rather than waiting for the store to retire and the cache to be updated.

This store-to-load forwarding is essential for performance. A typical pattern: a function pushes a value to the stack and immediately uses it in the called function:

Assembly
mov [rsp - 8], rax ; spill rax to the stack
...
mov rcx, [rsp - 8] ; reload it

The reload from [rsp - 8] should not have to wait for the store to write to the cache. With forwarding, the SQ holds the value, and the load reads it from there — typically in 4-6 cycles, similar to a cache hit.

For forwarding to work, the load's address and size must overlap exactly with a single older store's address and size. Several complications arise:

  • Partial overlap. If a 4-byte load reads bytes 0-3 of a location and an older 8-byte store wrote bytes 0-7 of the same location, the forwarding can in principle extract the right 4 bytes from the store data. Most modern processors handle this case.
  • Multi-store overlap. If a load reads bytes that come from two different older stores (rare but possible), forwarding becomes much harder. Most processors detect this case and stall the load until both stores commit.
  • Cross-line. If the load straddles a cache-line boundary and the load's two halves come from different stores, similarly hard. Most processors handle aligned forwarding well and stall on misaligned cases.

Forwarding misses (cases where the load cannot forward and must wait) are a real performance pothole. Compilers try to avoid emitting code patterns that hit these limitations (e.g., overlapping stores of different sizes followed by a load that spans both).

06. The Memory Model and Speculation

The architecture's memory model dictates what orderings of memory operations are visible to other CPUs. We will cover memory models in depth in Chapter 31; for this chapter, the key fact is that the memory model constrains how aggressively the back end can reorder memory operations.

The two ends of the spectrum are:

  • Sequential consistency (SC). Every CPU sees memory operations in some single global order, and within each CPU's stream, in program order. Very strict; very easy to reason about.
  • Total store ordering (TSO). As in x86: stores from one CPU are seen in program order by every other CPU, but a CPU can see its own loads pass its own older stores. Slightly relaxed; hardware can hold stores in a buffer.
  • Weak / release consistency. As in AArch64 and RISC-V: very few constraints by default; the program inserts explicit fences or uses atomic instructions to enforce ordering when needed. Very relaxed; hardware can reorder freely.

Most modern processors are TSO or weaker. This gives the back end freedom to:

  • Issue loads early (past older loads and even past older stores, as long as no alias is detected).
  • Hold stores in a buffer until they retire and even after retirement.
  • Forward stores to loads from the buffer.

But the model still imposes constraints. On x86 (TSO), a load cannot pass an older load (loads must complete in program order). On AArch64 (weak), even loads can be reordered, but acquire/release semantics on atomic and fence instructions impose ordering selectively.

The hardware enforces these constraints by tracking the relative order of memory operations in the LSQ and inserting barriers in the queue at appropriate points. A barrier instruction holds up younger operations until all older ones have completed; a release-store waits for older loads and stores; an acquire-load makes younger operations wait for it.

A subtle issue arose with speculation: if a load executes early (speculatively, before all older operations have settled) and then sees a value that another CPU writes to the same location before the original load's program-order time, the load may see a value that the strict memory model would not allow. To detect this, the LSQ snoops cache-coherence traffic from other CPUs: if the cache line a load read is invalidated by another CPU before the load retires, the load may have observed a stale value, and on TSO/SC architectures it must be replayed. This is sometimes called cache snoop replay or memory ordering machine clear.

The Spectre/Meltdown family of attacks (Chapter 51) exploits these speculative memory operations: even a transient load that reads a value, fails the architectural memory model check, and is replayed can leave traces in the cache. The detection mechanism prevents the architectural state from being affected, but the micro-architectural state (cache contents) carries information across.

07. Stores and the Write Buffer

A store that has retired no longer occupies an SQ entry, but it has not necessarily been committed to the cache yet. Most processors have a small write buffer below the SQ that holds retired stores, allowing them to drain to the cache lazily.

The benefit: the back end can retire many stores quickly even if the cache is busy with other work. A store that retires goes into the write buffer; the write buffer drains to the cache opportunistically over the next few cycles.

The cost: extra coherence machinery to make these in-flight stores visible to other CPUs as required by the memory model. On TSO, the write buffer must enforce that all retired-but-unwritten stores from a single CPU are seen in order by other CPUs. On weak models, the write buffer can reorder freely, but explicit fences must drain it.

The write buffer is also where store-to-load forwarding can occur for retired stores. A young load may forward from a store that has retired but not yet drained. Most processors do this transparently.

08. Cache Hierarchy and the Back End

The data cache, especially the L1, is intimate with the back end. The L1 D-cache typically:

  • Is 32-64 KB, 8-way associative.
  • Has 3-5 cycles of access latency.
  • Has 2-3 read ports (matching the load-port count).
  • Has 1-2 write ports (matching the store-port count).
  • Indexes with virtual address bits and tags with physical address bits (a virtually-indexed, physically-tagged or VIPT design that lets the index lookup proceed in parallel with TLB translation).

The L1 D-cache delivers data to loads on a hit. On a miss, it allocates a miss status holding register (MSHR) that tracks the outstanding miss. The miss is sent to the L2 cache; when the L2 returns the line, the L1 fills the line and forwards the data to the waiting load.

While the miss is in flight, the load occupies its LSQ entry and waits for the data. Other loads to the same line can hit on the MSHR — they don't need their own miss; they share the in-flight one. Loads to different missing lines can have their own MSHRs, up to the maximum number the cache supports (typically 8-16).

This is where memory-level parallelism comes from: if the OoO back end can find independent loads and issue them past a stalled one, multiple cache misses can be in flight concurrently. The total memory latency is amortized across the misses, and the effective bandwidth is multiplied.

09. Prefetching from the Back End

The back end can also issue prefetches: speculative loads that bring data into the cache without writing to a register. Prefetches do not consume LSQ slots in the same way as real loads (some designs do, others don't); their results just populate the cache.

Hardware prefetchers automatically issue prefetches based on observed access patterns (Chapter 17). Some ISAs also expose prefetch instructions to software (PREFETCHT0 on x86, PRFM on AArch64, PREFETCH.R/W on RISC-V). The compiler can emit these to bring data in ahead of when it will be needed.

A prefetch's job is to hide latency, not to deliver a value to a register. The prefetched data sits in the cache; the actual load that uses it later finds the data already present and hits.

The line between a prefetcher and the back end's speculative loads is fuzzy. An OoO processor with a deep ROB and aggressive load speculation is, in effect, a sophisticated software-driven prefetcher: it issues real loads earlier than they would normally execute, and the cache fills happen ahead of when the dependent code actually needs the data.

10. Memory Reordering and Replays

A modern back end issues loads aggressively, sometimes wrong, and recovers when needed. The recovery is a replay of the load and its dependents.

Several kinds of replays exist:

  • Memory ordering violation. A younger load executed before an older store; the store, when its address resolved, was found to alias. The load and its dependents are squashed and replayed.
  • Cache snoop replay. A load executed and got a value; another CPU then invalidated the line; the architectural memory model would have required the load to see the new value. Squash and replay.
  • Cache miss data replay. A load was speculatively assumed to hit the cache (and dependents were issued accordingly), but it actually missed; dependents have to be replayed when the data finally arrives.
  • TLB miss replay. A load's translation missed in the TLB; while the page-table walker fetches the translation, the load's downstream is replayed when the translation arrives.

Modern processors have intricate replay infrastructure to handle all these cases efficiently. Replays are part of normal operation; the question is whether they happen rarely enough that the overhead is small.

A processor that experiences too many replays — too many speculative cache misses, too many memory-ordering violations — is paying for the OoO machinery without getting its benefit. Some workloads (e.g., those with frequent atomic operations on contested memory locations) see this happen, and OoO actually hurts performance compared to a simpler in-order machine.

11. Memory Dependence Prediction

The naïve handling of loads in the previous sections has the load wait until every older store's address is known before checking for an alias. The wait is conservative — it guarantees correctness — but it is also a major source of lost performance, because most stores do not alias with following loads. Modern processors aggressively speculate that loads do not alias and issue them as soon as their own address is ready. This is load speculation, and it works most of the time.

When it does not work, the cost is high. A memory-ordering violation forces a replay of the load and its entire dependent chain, sometimes a substantial fraction of the in-flight work. A program in which a particular load reliably aliases a particular store — say, a stack-spill pattern where a value is stored and immediately reloaded — will hit this replay every time, and the speculation hurts more than it helps.

The response is memory dependence prediction. A small predictor learns, per static load (and sometimes per static store), whether the load is likely to alias an older in-flight store. The most cited design is Chrysos and Emer's store-set predictor (1998), in which each load is associated with a set of older stores it has been observed to alias with, and the load is delayed until those stores have computed their addresses. Cleaner variants use simpler tables of "this load aliased; wait" hints. AMD's Zen and recent Intel cores use proprietary refinements on the same idea.

The predictor is updated on every memory-ordering violation: the load that mis-speculated is marked as dependent on the offending store, and future executions of the same load wait for that store. The predictor decays over time so that an old alias relationship does not permanently throttle a load. The aggregate effect is to make the speculative path successful for the great majority of loads while keeping the few problematic ones from suffering repeated squashes.

The analogous mechanism for store-to-load forwarding is forwarding prediction: a small structure predicts, before the store's address has been computed, whether the load will forward from an older store, so that the load can be issued or held accordingly. Failed forwarding (the load issued speculatively, then turned out to need a value from the store buffer that arrived too late) is a particularly expensive replay; predicting it pays off.

These predictors are part of the same family as branch predictors but with a different signal. The architectural lesson is that out-of-order memory execution is fundamentally speculative, and like every speculation in modern hardware, it is steered by a learning predictor that improves as the program runs.

12. Atomic Operations and Memory Barriers

The load/store queue handles ordinary loads and stores; atomic operations and memory barriers require additional machinery and are worth treating separately.

An atomic read-modify-write (RMW) — x86's lock-prefixed instructions, AArch64's LDADD/SWP/CAS family, RISC-V's AMO instructions — must read a memory location, compute a new value, and write it back, with the guarantee that no other agent observes a partial state. Implementing this in an OoO machine is delicate. The naive implementation locks the cache line for the duration of the RMW, which serialises the operation. Modern designs treat an RMW as a special operation in the LSQ that:

  1. Acquires the cache line in the modified coherence state (Chapter 31).
  2. Performs the read, the computation, and the write while holding the line in the modified state.
  3. Releases the line, making the new value visible to other CPUs.

Atomicity comes from holding the line locked against external probes for the duration of the RMW. The line is briefly blocked in the L1, with snoop responses delayed until the RMW completes. The cost is a cache-line-lock-and-unlock around each atomic, which is much faster than an architectural lock prefix on older x86 (a bus lock that locked the entire memory bus) but still substantially slower than an ordinary store.

A load-linked / store-conditional (LL/SC) implementation, used on AArch64 (LDXR/STXR) and RISC-V (LR/SC), achieves the same effect with a different protocol. The load-linked records the loaded line in a small per-CPU monitor; the store-conditional checks the monitor and writes only if no other CPU has modified the line in the interim. The monitor is invalidated by any incoming coherence message for the line, so a contended LL/SC sequence may fail and must be retried. AArch64 also includes a non-LL/SC atomic family (CAS, LDADD, SWAP from the LSE extension) for cases where the LL/SC retry is too costly.

Memory barriers — x86's MFENCE, AArch64's DMB/DSB, RISC-V's FENCE — do not perform memory operations themselves; they constrain how the LSQ may reorder them. A typical barrier:

  1. Stalls the front end from issuing further memory ops past the barrier into the LSQ until the LSQ ahead of the barrier has drained.
  2. Forces the store buffer to flush all stores older than the barrier to the cache before any younger store can be issued.
  3. May force in-flight loads to wait for any pending coherence messages to be processed.

The LSQ tracks the position of each barrier as a special pseudo-entry, and ordering checks consult it when deciding whether a younger memory op may proceed. Barriers are expensive — they serialise a portion of the memory pipeline — which is why programs that use them in inner loops (incorrectly-coded synchronization, debugging primitives) can be substantially slower than the same program with the barriers in their proper place.

The interaction of atomics and barriers with the LSQ, the store buffer, and the coherence machinery is the hardest part of memory-system design and one of the largest sources of subtle bugs in real chips. The architectural memory model (Chapter 31) defines what the program is allowed to observe; the LSQ and its supporting structures are what enforce it.

13. A Concrete Example

Walk through the lifecycle of two operations on a modern OoO processor:

Assembly
1: st x1, 0(x10) # store value x1 to *x10
2: ld x2, 0(x11) # load *x11 into x2
3: add x3, x2, x4 # uses x2

Assume x10 == x11 (the load aliases the store), and the cache currently holds neither line.

Cycle 1-3: Front end fetches and decodes 1, 2, 3. Rename allocates physical registers, ROB entries, and LSQ entries.

Cycle 4: Issue queue dispatches.

  • Instruction 1 (store) issues to the store-address port and the store-data port. AGU computes x10. SQ entry filled with address. Data (x1) waits if x1 is not yet ready, otherwise filled too.
  • Instruction 2 (load) issues to a load port. AGU computes x11.
  • Instruction 3 waits — depends on x2.

Cycle 5: Address comparison happens.

  • The load's address [x11] matches the SQ entry for the store.
  • If the store's data is ready in the SQ, forward it: the load gets x1's value directly from the SQ. No cache access needed.
  • If the store's data is not yet ready, the load waits in the LSQ.

Cycle 6: Forwarded data is written to x2's physical register. Instruction 3 wakes up.

Cycle 7: Instruction 3 issues, executes.

Cycle 8 (or whenever): Instruction 1 retires. The store data is sent to the write buffer. The cache line for [x10] is fetched (cache miss to L2 or beyond). When the line arrives, the store is committed.

The load was completed before the store committed. The load got the right value via forwarding. Out-of-order memory execution has hidden the cache miss for the store, and the load completed in just a few cycles.

If the store and load did not alias (different addresses), the load would have proceeded directly to the cache, missing, and waited the full memory latency. But it would not have waited behind the store; it would have waited only for its own data. Memory-level parallelism in action.

14. Summary

The memory subsystem of an OoO processor is large, complex, and central to performance. The load/store queue tracks all in-flight memory operations; address generation and TLB translation produce the physical address; the cache supplies data to loads or absorbs data from stores. Memory disambiguation handles the question of whether a load aliases an older store; speculation lets loads execute aggressively, with a replay mechanism for the cases where speculation was wrong. Store-to-load forwarding lets a load read a value from an older store still in the SQ, avoiding the cache round-trip.

The architectural memory model imposes constraints on how aggressively the back end can reorder memory operations; the hardware enforces them through a combination of LSQ ordering tracking, snoop-based replay, and barrier instructions. Cache misses, served by miss-status registers, support multiple outstanding misses simultaneously, exposing memory-level parallelism that hides the latency of memory.

The combination of all this machinery — large ROBs, large LSQs, aggressive speculation, multiple AGUs, multi-port caches, large write buffers, hardware prefetchers — is what turns a wide deep pipeline into a fast machine on memory-bound workloads. Together with the OoO core covered in Chapter 25, the memory subsystem is what we mean when we say a processor is high performance.

The final piece of the front end story remains: how complex instructions are decomposed into the simple µops the back end expects. Chapter 27 covers decode and microcode.

Book mode
computer-architecturemicro-architecturepipelining
Was this helpful?