Cache Coherence and Memory Consistency
May 16, 2026·25 min read·advanced
A program running on a single core sees memory as a simple thing: when it writes, the next read returns what it wrote. Multi-core hardware is not so simple. Each core has its own L1 cache holding…
A program running on a single core sees memory as a simple thing: when it writes, the next read returns what it wrote. Multi-core hardware is not so simple. Each core has its own L1 cache holding copies of memory locations. When core 0 writes a location and core 1 reads the same location, what does core 1 see? The answer depends on the cache coherence protocol that keeps the copies consistent and the memory consistency model that defines what orderings of memory operations are visible across cores.
These two concepts are often confused but are distinct. Coherence is about individual memory locations: at any given time, what value does each cache hold for this location, and how do writes propagate? Consistency is about orderings of memory operations: when one core does write A then write B, in what orders may another core observe those writes? Coherence is typically transparent to software (the hardware handles it); consistency is exposed by the architecture and constrains how programs must be written.
This chapter develops both. We start with coherence: the MESI protocol that has been the workhorse of multi-core caches for decades, its variants, the coherence traffic on the interconnect, and the implementation challenges. Then we turn to consistency: sequential consistency as the conceptual baseline, the relaxed models (TSO, weak, release-acquire) that real architectures provide, and the synchronization primitives programs use to establish ordering when needed.
01. Cache Coherence: The Problem
Core 0 reads location X, getting its L1 a copy of the cache line containing X. Core 1 also reads X, getting its own L1 a copy. Both copies are identical to the value in main memory.
Core 0 then writes X. Now core 0's copy differs from core 1's copy and from memory. If core 1 reads X next, it gets the stale value. If core 0 reads X, it gets the new value. The two cores see different values for the same location.
This is incoherence. The hardware must ensure that, eventually, all caches agree on the current value of X, and that the agreement happens in some defined order.
A practical definition: the cache hierarchy is coherent if, for any single memory location:
- Write propagation. A write by one core eventually becomes visible to all other cores.
- Write serialization. All cores see writes to the same location in the same order. If core 0 writes 1 to X and core 1 writes 2 to X, every observer either sees the order (1, 2) or (2, 1), but no observer sees one order while another sees the other.
These are local-per-location properties. Coherence does not say anything about orderings across different locations — that is the consistency model's job.
02. Snoop-Based Coherence
The classic mechanism is snooping: every cache observes (snoops) memory transactions on a shared bus or interconnect, and updates its own state accordingly. When core 0 writes to a location whose line is also cached by core 1, core 1's cache sees the snoop and either invalidates its copy or updates it.
The shared bus that snooping originally relied on is no longer practical for many cores; modern designs use a switched interconnect that delivers snoops to the relevant caches without flooding the network. The protocol logic is the same; only the physical realization differs.
MESI: The Standard Protocol
The dominant protocol is MESI: each cache line is in one of four states:
- Modified (M). The cache has the only copy of the line. The line is dirty (modified relative to memory). To service a read by another cache, this cache must transfer the line.
- Exclusive (E). The cache has the only copy of the line. The line is clean (matches memory). The cache can write without telling anyone (transitioning to M); other caches don't have the line.
- Shared (S). The cache has a copy of the line; other caches may also have copies. The line is clean. To write, the cache must invalidate other copies.
- Invalid (I). The cache does not have a valid copy of the line. Reads must fetch from elsewhere.
Each cache line tag has 2 bits of state information (one for each MESI state, or 2 bits encoding the four states). The state transitions are driven by local actions (this cache's reads and writes) and by snooped actions (other caches' reads and writes).
The transition diagram, in summary:
| Current | Action | Next | Notes |
|---|---|---|---|
| I | local read miss | E or S | E if no other cache has it; S if shared |
| I | local write miss | M | invalidate any other copies |
| S | local write | M | invalidate any other copies |
| E | local write | M | no traffic needed |
| M | local read/write | M | no change |
| E | snooped read | S | another cache is taking a copy |
| M | snooped read | S | transfer the modified line, downgrade |
| S | snooped write/invalidate | I | invalidate our copy |
| E | snooped write/invalidate | I | invalidate our copy |
| M | snooped write/invalidate | I | transfer modified line, then invalidate |
The key property: at most one cache is in M or E for a given line at any time. Multiple caches can be in S simultaneously. Any cache wanting to write must first acquire M state, which involves invalidating all other copies.
MESI Variants
MOESI. Adds an Owned state: a cache that has the only modified copy but is also serving reads to other caches. Useful when a line is modified and then frequently read by other cores; the modifying cache stays in O state and other caches are in S, sharing the data without write-back to memory until eviction. Used by AMD.
MESIF. Adds a Forward state: among caches in S state, one is designated as the "forwarder" — the one that responds to read requests on behalf of the group. Avoids redundant responses. Used by Intel.
These variants reduce traffic in specific patterns but the core MESI logic is universal. The choice of variant is implementation-specific and not usually visible to software.
03. Directory-Based Coherence
For systems with many cores or multiple sockets, snooping becomes prohibitive: every cache transaction would need to be broadcast to every cache. A directory-based protocol is used instead.
A directory is a structure that, for each cache line in memory, records which caches currently have copies and in what state. When a core wants to access a line, it asks the directory; the directory looks up which caches have copies and arranges the necessary invalidations or transfers.
The directory is typically distributed: each memory controller maintains the directory entries for the lines it serves. When a core sends a read or write request, it goes to the home memory controller, which consults its directory and forwards as needed.
Directory protocols are more complex than snooping but scale much further. They are standard in large multi-socket servers and many-core chips (Intel's mesh-based servers, AMD EPYC, IBM POWER large systems). The directory itself is a significant hardware structure — its size scales with memory size and the number of caches that may share a line.
A hybrid approach is common: snooping within a single chip (where the interconnect is fast and the number of cores is moderate) and directory-based across sockets.
04. Coherence Traffic
The cost of coherence is the bandwidth and latency of the snoop and invalidation traffic. Several patterns are particularly expensive.
Write-after-read sharing. Many cores read a line, then one core writes it. The write must invalidate all the readers' copies. If 64 cores have the line, 64 invalidates are sent.
Ping-pong. Two cores alternately write a line. Each write invalidates the other core's copy; the other core's next access misses and pulls the line back; the cycle repeats. The line bounces between the two caches at coherence-protocol speed (typically a few hundred cycles per round trip).
False sharing. Two cores access different variables that happen to lie in the same cache line. Even though there is no logical sharing, the hardware-level sharing causes ping-pong as if the variables overlapped. A common bug: putting two thread-local counters in the same struct, so they share a line.
Atomic contention. A heavily-contested atomic variable (e.g., a single global counter) becomes a coherence-traffic bottleneck. The line ping-pongs between every contesting core.
These patterns are often the dominant performance cost of multi-threaded code. Eliminating false sharing (by padding shared structures), reducing contention (sharded counters, thread-local accumulators), and avoiding ping-pong (clear ownership boundaries) are standard performance-tuning techniques.
A tool like Intel's cache coherence telemetry or Linux's perf c2c can identify cache-line-level contention in a running program, helping pinpoint these patterns.
05. Inclusive, Exclusive, and Non-Inclusive Hierarchies
The coherence machinery interacts in subtle ways with the inclusion policy of the cache hierarchy mentioned in Chapter 16. The choice has real consequences for both correctness and performance.
In an inclusive hierarchy (every line in L1 is also in L2, every line in L2 is also in L3), the largest level acts as a complete coherence directory for the cache contents below it. An incoming snoop or invalidate from another core can be answered by checking only the L3: if the L3 does not hold the line, no inner cache can either, and the snoop completes without disturbing the L1 or L2. This is a major simplification — inner caches are insulated from external snoop traffic — and is why Intel server chips have used inclusive L3s since Nehalem.
The price of inclusion is capacity: the L3 must hold a copy of everything in the L1s and L2s below it, so the effective capacity for unique data is L3 size minus the sum of L2 sizes. With many cores and large L2s, the L3 spends a substantial fraction of its capacity duplicating data already cached elsewhere. A worse problem is back-invalidation: when a line is evicted from the L3 (because the L3 is set-associative and a new line takes its slot), inclusion forces the corresponding line to be invalidated in every L1 and L2 too, regardless of whether those caches were finished with it. Pathological access patterns where the L3 is heavily contended can cause inner caches to lose lines they would otherwise have kept, hurting performance.
An exclusive hierarchy stores each line in exactly one level: a line in L1 is not in L2, and vice versa. Eviction from the inner cache is a fill into the outer; promotion from outer to inner is a swap. AMD's K7/K8 family used exclusive L1/L2; current AMD designs use a non-inclusive L3 that is neither inclusive nor exclusive but managed flexibly. Exclusive hierarchies have larger effective capacity (the levels add up rather than overlap) but require more complex eviction logic and more snoop traffic, since the largest level no longer acts as a coherence directory.
The non-inclusive middle path tracks coherence through a separate snoop filter or directory structure that records which lines may be cached anywhere in the system, without requiring those lines to be physically held in the L3. The snoop filter is a small, dedicated structure with one entry per cacheable line; it answers external snoops and triggers invalidations only when a line is genuinely cached. This is the dominant approach in modern Intel mesh-based servers (Skylake-SP and successors): the L3 is non-inclusive, and a separate snoop filter handles coherence. The architectural pattern is exactly what we will see again at the multi-socket level, where directory-based protocols achieve the same effect at a coarser granularity.
We will return to inclusion policies more thoroughly in Chapter 50; for the consistency story, the takeaway is that the structure of the cache hierarchy and the structure of the coherence protocol are deeply entangled, and the design choices in one constrain the choices in the other.
06. Memory Consistency: The Model Problem
Coherence solves the problem of agreement on individual locations. But programs care about orderings across locations. Consider two cores running concurrently:
| Core 0: Core 1: | |
| store 1 to X load Y -> r1 | |
| store 1 to Y load X -> r2 |
If X and Y were initially 0, what values can r1 and r2 take?
Several possibilities:
- r1 = 0, r2 = 0. Both loads happen before both stores. (Reasonable if core 1 ran first.)
- r1 = 1, r2 = 1. Both loads see both stores. (Reasonable if both stores happened before both loads.)
- r1 = 0, r2 = 1. The X store completed before core 1's loads, but the Y store had not. (Reasonable if core 0's stores propagated in order.)
- r1 = 1, r2 = 0. Core 1 saw the Y store but not the X store.
The fourth case is the interesting one. It can happen if the two stores propagate to core 1 in reverse order — Y first, then X. The intuition that core 0's stores happen "in program order" is correct from core 0's perspective, but other cores may see them in a different order, depending on the hardware's coherence protocol and write-buffer behavior.
This is what a memory consistency model formalizes: which orderings of operations are allowed.
07. Sequential Consistency
The simplest model is sequential consistency (SC), defined by Lamport (1979):
The result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program.
In other words: imagine an interleaving of all cores' operations into a single global sequence, with each core's operations appearing in program order. The execution is SC if some such interleaving produces the observed behavior.
In our example: case 4 (r1 = 1, r2 = 0) is not SC. There is no interleaving consistent with program order that produces it. (Try: any interleaving has X stored before Y; for r2 = 0, the load of X must happen before the store of X; for r1 = 1, the load of Y must happen after the store of Y. So load Y > store Y, store Y > store X (program order), store X > load X, load X > load Y (program order on core 1). Cycle, contradiction.)
So under SC, the four-case enumeration drops to three. Programs are easier to reason about under SC: orderings within each thread are preserved, and any inconsistency must come from interleaving.
The downside of SC: the hardware cannot reorder memory operations even when the program would not notice. Every store must be visible to all cores before any subsequent operation can proceed. Every load must wait for older stores to complete (or for some equivalent guarantee). The performance cost is severe.
Real architectures have therefore moved to relaxed memory models, where the hardware is allowed to reorder some memory operations, in exchange for explicit synchronization instructions when ordering is needed.
08. Relaxed Models
Different architectures adopt different relaxations. The choices form a spectrum.
Total Store Ordering (TSO): x86
x86 (and the older SPARC) use TSO. The relaxation: a load may execute before older stores have committed (specifically, before older stores by the same core have become visible to other cores). All other orderings are preserved.
Concretely: a core has a write buffer that holds retired stores until they drain to the L1 cache. Loads can be served from the buffer (forwarded from in-flight stores) or skip past it (if no buffered store aliases the load's address). A load on core 0 may therefore see core 0's recent stores even though those stores have not yet propagated to core 1.
TSO admits the case 4 outcome above: core 0 retires both stores; both sit in core 0's write buffer; core 1 reads X (still 0) and Y (still 0)... wait, that gives 0,0, not 1,0. Let me rework.
Actually, TSO admits a subtler outcome. Consider this asymmetric example:
| Core 0: Core 1: | |
| store 1 to X store 1 to Y | |
| load Y -> r1 load X -> r2 |
Under SC: at most one of r1, r2 is 0 (the loads must see at least one of the stores). Under TSO: both r1 and r2 can be 0, because each core's load can be reordered before the other core's store has propagated. The store sits in the local write buffer; the load reads memory directly and sees the old value.
This is the classic store-buffer reordering allowed by TSO. It can surprise programmers who expect SC.
To restore SC behavior, x86 provides the MFENCE instruction: a full memory fence that drains the write buffer and waits for all older operations to complete before younger ones may proceed. Programs that need SC semantics insert MFENCE between the store and the load.
TSO is the strongest relaxation in common use. Most x86 software relies on TSO without explicit fences, and the architecture's relatively strong guarantees are part of why x86 is considered "easy" to program for compared with weaker models.
Weak Models: ARM, RISC-V
ARM (AArch64) and RISC-V use weak memory models. Almost any reordering is allowed, including loads passing other loads, stores passing stores, and loads passing stores in either direction. Hardware is free to reorder memory operations as long as single-thread program semantics are preserved (which only requires honoring within-thread RAW dependencies).
Without explicit synchronization, programs can observe almost any outcome — including ones that surprise SC-trained intuitions. For example:
| Core 0: Core 1: | |
| store 1 to X load Y -> r1 | |
| store 1 to Y load X -> r2 |
Under SC: if r1 = 1, then r2 = 1. Under TSO: same. Under a weak model: r1 = 1, r2 = 0 is allowed, because core 1 may reorder the loads (it reads Y first, sees 1; then reads X, sees 0 because core 0's first store had not yet propagated even though core 0's second store had).
To restore ordering, weak architectures provide several barrier instructions:
DMB(data memory barrier): orders memory operations preceding it before memory operations following it. Various flavors restrict to specific operation types (loads-only, stores-only, full).DSB(data synchronization barrier): stronger; waits for all preceding operations to complete before allowing any following operations.ISB(instruction synchronization barrier): flushes the pipeline.
RISC-V has analogous instructions: FENCE, with operands selecting which kinds of operations to order.
In typical programs, explicit barriers are rarely used directly. Instead, the program uses synchronization primitives (locks, atomics with acquire-release semantics) that internally use the right barriers.
Release-Acquire Semantics
A more refined model, popular in C++ atomics and used by many modern architectures, is release-acquire:
- A release operation (typically a store, often the unlock of a mutex) ensures that all preceding operations from this thread are visible before the release becomes visible.
- An acquire operation (typically a load, often the lock of a mutex) ensures that all subsequent operations from this thread happen after the acquire's result is observed.
Together, a release on one thread followed by an acquire that reads its value on another thread ensures that everything the first thread did before the release is visible to the second thread after the acquire.
Release-acquire is the standard model for synchronization primitives. It is weaker than SC but strong enough for the vast majority of multi-threaded programming idioms.
ARM AArch64 supports release-acquire directly with load-acquire and store-release instructions:
| ldar x0, [x1] ; load with acquire semantics | |
| stlr x2, [x3] ; store with release semantics |
These are cheaper than full barriers and match the C++ memory_order_acquire and memory_order_release exactly. The compiler can emit them when the source code uses std::atomic operations with the appropriate ordering.
RISC-V has similar variants on its atomic instructions: lr.aq (load-reserved with acquire), sc.rl (store-conditional with release), and combinations.
09. Litmus Tests and Formal Memory Models
The per-architecture descriptions above (TSO, weak, release-acquire) are informal English. For the past two decades, the academic and industrial community has worked on formal memory models that specify, with mathematical precision, exactly which executions a program may produce. The formalization matters because the differences between models are subtle, and informal text routinely fails to anticipate edge cases that real hardware implementations exhibit.
The primary tool is the litmus test: a small program with a few threads, a few memory operations each, and a question about which final register values are observable. The test we used above for store-buffer reordering —
| Core 0: Core 1: | |
| store 1 to X store 1 to Y | |
| load Y -> r1 load X -> r2 |
— is a litmus test, conventionally called SB (Store Buffer). Several dozen such tests cover the full space of two- and three-thread reorderings: MP (Message Passing), LB (Load Buffer), WRC (Write-to-Read Causality), IRIW (Independent Reads of Independent Writes), and so on. For each test, the architecture's memory model says whether the surprising outcome is allowed.
Formal memory models are written in two main styles. Operational models describe an abstract machine — with explicit per-thread store buffers, partial-order propagation rules, and so on — whose behaviours are by construction the allowed ones. Axiomatic models describe a set of constraints on the happens-before and coherence relations across operations and define an execution as allowed if there is a relation satisfying the constraints. Both are equivalent in the cases that matter; axiomatic models tend to be easier for programmers to reason about, while operational models are easier to translate into hardware.
ARM's AArch64 memory model has had multiple formalizations — ARMv8 flowing and POP models from the early days, and the current axiomatic AArch64 memory model maintained by ARM. RISC-V's RVWMO (Weak Memory Ordering) is axiomatic and was developed alongside the ISA. Intel and AMD have published formal models of x86-TSO; the academic x86-TSO paper by Sewell and others (2010) is the canonical reference.
The practical use of formal models is that tools can verify concurrent algorithms against them. Given a litmus test, tools like herd7, litmus7, or dat3m enumerate the executions allowed by a model and report which ones are observable. For programmers writing lock-free code, this is the only reliable way to know whether an algorithm is actually correct on a target architecture; informal reasoning has been wrong often enough that the field has come to rely on the formal tools.
A related effort is the formalization of the language-level memory models that compilers target: the C/C++ memory model, the Java memory model, the LLVM memory model. The compiler's job is to map source-level orderings (acquire, release, seq_cst) onto hardware-level instructions (LDAR, STLR, FENCE, MFENCE) in a way that preserves all the language model's guarantees on every target architecture. This translation is non-trivial; bugs in compiler memory-model handling have been a recurring source of subtle concurrency issues.
The lesson for working programmers is that memory models, once a niche academic interest, have become essential infrastructure. Languages, compilers, libraries, and hardware all expose them, and tools to reason about them rigorously now exist. Concurrent code that targets weak memory models without using these tools or relying on well-vetted libraries is gambling with correctness.
10. The C++ Memory Model
C++ (since C++11) defines a memory model that programs can target. The model is built around std::atomic types with explicit memory orderings:
memory_order_relaxed— no ordering guarantees beyond atomicity.memory_order_acquire— ordering for loads (subsequent operations don't move before).memory_order_release— ordering for stores (preceding operations don't move after).memory_order_acq_rel— both, for read-modify-write operations.memory_order_seq_cst— sequential consistency (the default).
The compiler maps these orderings to the appropriate hardware primitives:
- On x86 (TSO), most orderings need no extra fence; only
seq_cststores require a fence. - On AArch64 (weak),
acquiremaps toLDAR,releasetoSTLR,seq_cstto combinations. - On RISC-V, similar mapping with the .aq and .rl suffixes.
This abstraction lets portable programs write the correct ordering once and have the compiler emit hardware-appropriate code. It is the basis of essentially all modern concurrent programming in C++ and (with similar models) in Rust, Java, C#, and other languages.
11. Implementation: Memory Ordering Buffers
The hardware enforces the architectural memory model through structures we have already met (Chapter 26): the load queue, store queue, and the snoop/invalidation logic.
Under TSO, the LSQ ensures that:
- Stores from one core are visible to all other cores in program order. The store queue drains to the cache in FIFO order.
- Loads on a single core are not reordered with each other (loads remain in program order).
- Loads can pass older stores to different addresses (store-load reordering allowed).
Under a weak model:
- Stores can be reordered if they target different addresses.
- Loads can be reordered if they target different addresses.
- Loads can pass stores freely.
- The barrier and acquire-release instructions are the only constraints.
The LSQ tracks the program-order position of each operation and uses snoop responses to detect when a load's value has been invalidated before the load retires (the cache snoop replay of Chapter 26). On TSO architectures, this catches violations like a load that read a value before another core's store invalidated the line; the load and its dependents are squashed and replayed.
Implementing TSO on top of an aggressively speculative OoO core is non-trivial. The cost is a few percent of performance for the extra ordering checks, plus occasional replays. Implementing a weak model is cheaper, but the software side bears more responsibility for ordering.
12. Examples of Memory-Ordering Bugs
A few canonical bugs illustrate the issues.
The double-checked locking pattern. A common mistake when initializing a lazy singleton:
static MyClass* instance = nullptr;
static std::mutex m;
MyClass* get_instance() {
if (instance == nullptr) { // (1) unprotected check
std::lock_guard<std::mutex> lock(m);
if (instance == nullptr) {
instance = new MyClass(); // (2) unprotected publish
}
}
return instance;
}The thread that constructs the object writes:
- The MyClass internal data.
- The pointer
instance.
A reader thread that reaches the unprotected check (1) and sees instance != nullptr then reads *instance. Under a weak memory model, the reader can see the pointer before seeing the object's data (because the writes can be reordered as observed). The reader sees a partially-constructed object.
The fix: make instance a std::atomic<MyClass*> and use memory_order_release for the store, memory_order_acquire for the load. The compiler emits the right barriers / fences.
Peterson's algorithm bug. Peterson's mutual-exclusion algorithm relies on SC. It works on x86 (TSO) but is broken on a weak model without explicit barriers — the loads of the other thread's flag may be reordered before the local store. This bug is a classic in concurrency textbooks.
Reordering across non-atomic accesses. The compiler can also reorder accesses to non-atomic variables. A program that reads/writes a shared variable without using std::atomic or other synchronization has data races, and the compiler is free to reorder, eliminate, or duplicate the accesses. The result is often surprisingly broken, with bugs that appear only at high optimization levels or on specific hardware. The C++ standard says data races are undefined behavior; the program may do anything.
These bugs are a major reason concurrent programming is hard. The programmer's intuition often assumes SC; the hardware delivers something weaker; the gap manifests as rare, hard-to-reproduce bugs.
13. The Practical Picture
Most programmers writing multi-threaded code do not interact with memory models directly. They use:
- Locks (mutexes). A lock implementation internally uses atomic primitives with appropriate barriers. As long as the program uses locks correctly (no data races), the memory model is invisible.
- Atomic libraries.
std::atomicin C++,java.util.concurrent.atomicin Java,std::sync::atomicin Rust. These provide the right barriers automatically. - Concurrent data structures. Lock-free queues, hash maps, and so on, implemented by experts who understand the memory model. Application code uses them without worrying about ordering.
- Higher-level frameworks. Task graphs, message passing, futures/promises — frameworks that abstract the synchronization away.
Direct manipulation of memory orderings is reserved for performance-critical low-level code: kernels, runtimes, lock-free libraries, very hot paths.
The hardware's job is to make the standard primitives fast and the bug-free patterns correct. Modern processors, with their LR/SC or LDXR/STXR atomic mechanisms, their cache coherence protocols, and their explicit acquire-release instructions, do this well. The cost of a typical lock acquire-release on uncontested memory is a few tens of cycles; under contention, it is dominated by the coherence traffic and the wait for the lock to become free.
14. Summary
Cache coherence keeps multiple caches consistent on individual memory locations: when one core writes, all cores eventually see the new value, and all cores see the same order of writes to that location. The MESI protocol (and its variants MOESI, MESIF) is the standard implementation, with snooping for small systems and directory-based protocols for large ones. The cost of coherence shows up as cache-line ping-pong, false sharing, and atomic contention — major performance pathologies in multi-threaded code.
Memory consistency models specify what orderings of operations across different locations are visible to other cores. Sequential consistency is the simplest and strictest; TSO (x86) is moderately relaxed; weak models (ARM, RISC-V) are most relaxed. Programs that need ordering guarantees beyond what the model provides must use synchronization primitives — barriers, atomics with acquire-release semantics, locks — that the hardware supports with specific instructions.
The memory model is one of the few low-level architectural details that consistently leaks into high-level programming. Concurrent code must be aware of the model — directly via atomics or indirectly via well-designed libraries — to be correct.
This concludes Part VI. Parts I through VI have built the conceptual framework: from digital logic to the von Neumann machine, from ISAs to micro-architectures, and from single-thread parallelism to multi-thread coordination. Part VII begins a tour of specific real-world architectures, starting with x86-64.