Part IFoundations

Basic CPU Organization

May 16, 2026·36 min read·beginner

The previous chapter sketched the high-level organization of a computer: a CPU on one side, memory on the other, and an interconnect carrying instructions and data between them. We are now going to step inside the CPU. The goal of this chapter is to show, with no hand-waving, what a processor is actually made of at the level just above gates and just below assembly language. By the end you should be able to draw the basic blocks of a working CPU, name the wires that connect them, and describe what happens to each value as an instruction passes through.

We will keep the discussion mostly at the organization level rather than the micro-architecture level. The processor we sketch here is a simple, single-cycle machine without pipelines, caches, or branch predictors. It is not what any modern chip looks like in detail. But every modern chip contains, somewhere inside, a structure that maps onto the picture we are about to build, and the names introduced here — datapath, control unit, program counter, register file, ALU — will recur unchanged for the rest of the book.

01.The Datapath

The datapath is the part of the CPU through which data physically flows. It is the collection of registers, multiplexers, adders, shifters, memory ports, and wires that move bits around and combine them. The datapath does not, by itself, decide what to do. It is the stage on which instructions perform; the control unit, discussed in the next section, is the director.

A useful way to picture the datapath is to ask: where can a bit live, and how can it get from one place to another? In the simple machine we are building, a bit can live in the program counter, in the instruction register, in one of the general-purpose registers, in a memory location, or on a wire passing through a functional unit. Every instruction the processor executes is, at this level, a particular pattern of bit movements among these places.

A first-cut block diagram of a single-cycle datapath looks like this:

Figure: Single-cycle datapath with PC, instruction memory, IR, register file read and write ports, sign-extended immediate, ALU, and data memory in a feed-forward chain

LaTeX

\begin{tikzpicture}[font=\small, >=Stealth, line cap=round,
  blk/.style={draw, thick, fill=white, minimum width=2.6cm, minimum height=0.8cm, align=center}]
  % Origin (0,0) at top-left.
  \node[blk] (pc)   at (4, -0.5) {PC};
  \node[blk] (im)   at (4, -2)   {I-mem};
  \node[blk] (ir)   at (4, -3.5) {IR};
  \node[blk] (rf)   at (1.5, -5) {register file (read)};
  \node[blk] (imm)  at (6.5, -5) {immediate / sign-extend};
  \node[blk] (alu)  at (4, -6.5) {ALU};
  \node[blk] (dm)   at (4, -8)   {D-mem};
  \node[blk] (wb)   at (4, -9.5) {register file (write)};
  \draw[->] (pc) -- (im) node[midway, right, font=\footnotesize] {instr addr};
  \draw[->] (im) -- (ir) node[midway, right, font=\footnotesize] {instr bits};
  \draw[->] (ir.south) -- (rf.north);
  \draw[->] (ir.south) -- (imm.north);
  \draw[->] (rf.south) -- (alu.north -| rf);
  \draw[->] (imm.south) -- (alu.north -| imm);
  \draw[->] (alu) -- (dm);
  \draw[->] (dm) -- (wb);
\end{tikzpicture}

The arrows are the wires; the boxes are the storage and functional units. Notice three things. First, every box that holds state — the program counter, the instruction register, the register file, the data memory — is updated on a clock edge. Second, between two state-holding elements there is some chunk of combinational logic doing the actual work. Third, certain connections are shown as if they were single wires, but they are in fact wide buses: 32 or 64 wires running in parallel.

The control unit, not yet drawn, sits above all this and tells each block what to do on each cycle: which register to read, whether the ALU should add or subtract, whether the memory operation is a load or a store, where the result should go. We will draw it in shortly, after we have introduced the rest of the pieces.

The reason for separating datapath from control is largely organizational. The datapath is the part that moves and operates on data; the control unit is the part that decides which operations to perform. In practice the two are deeply intertwined — every multiplexer in the datapath has its select line coming from the control unit, every register has its enable signal coming from there — but the conceptual split is enormously useful when reasoning about a processor. It lets you ask "is there a path for this bit to get there?" without simultaneously asking "and what tells it to take that path?"

02.The Control Unit

The control unit is the part of the CPU that interprets the current instruction and produces the control signals that drive the datapath. It is, in the language of Chapter 5, a (sometimes very large) finite state machine. Its inputs are the bits of the instruction currently in the instruction register, plus a few status signals from the datapath such as the zero output of the ALU. Its outputs are the dozens or hundreds of control signals that flow into every multiplexer, register, and functional unit.

For a simple single-cycle machine, the control unit is purely combinational. The instruction goes in, and the control signals come out, all on the same cycle. For multi-cycle and pipelined machines the control unit becomes sequential, with its own state register, but the principle is the same.

A small example will make this concrete. Suppose our processor has a 32-bit instruction format with a 6-bit opcode in the top bits, and suppose three of the possible opcodes are ADD, LOAD, and STORE. The control unit, on seeing the opcode, must decide:

whether the ALU should perform an addition or simply pass through one of its inputs;
whether the register file should write back a value at the end of the cycle, and if so, from where (the ALU output or the memory output);
whether the data memory should perform a read, a write, or nothing;
which input to the ALU's left operand mux is selected (a register or the program counter);
which input to the ALU's right operand mux is selected (a register or an immediate).

A small slice of the control unit's truth table looks like this:

Opcode	ALUop	RegWrite	MemRead	MemWrite	MemToReg	ALUSrc
ADD	add	1	0	0	0	0 (register)
LOAD	add	1	1	0	1	1 (immediate)
STORE	add	0	0	1	—	1 (immediate)

Each row says what the control signals should be when the corresponding opcode is in the instruction register. For ADD, the ALU adds two registers and writes the result to a register. For LOAD, the ALU adds a base register and an immediate to compute an address, the data memory is read, and the loaded value is written to the destination register. For STORE, the ALU again computes an address, but no register is written and the memory is written instead. The "—" entry for MemToReg on a store reflects that, since RegWrite is 0, the value of MemToReg does not matter.

There are two main implementation styles for control units, and you will see references to both throughout the book.

A hardwired control unit is built as ordinary combinational logic. The opcode and other instruction bits go in as inputs, and gates produce the control signals as outputs. Hardwired control is fast and small for simple ISAs, and it is what every RISC processor and most modern CPUs use for the bulk of their instructions.

A microprogrammed control unit treats each instruction as a tiny program of its own. Inside the control unit lives a small ROM, the control store, holding sequences of micro-instructions, each of which directly specifies one cycle's worth of control signals. The opcode of the visible instruction picks a starting address in the control store, and the unit walks through the corresponding micro-instructions one per cycle. Microprogramming was invented by Maurice Wilkes in 1951 and was the dominant style of complex CPUs from the 1960s through the 1980s; modern x86 processors still use it for their most complex instructions, but rely on hardwired logic for the common simple ones.

The choice between the two styles is a tradeoff. Hardwired control is faster and uses less area for simple ISAs. Microprogrammed control is more flexible — it makes it easy to add new instructions, fix bugs, and implement very complex multi-step operations — but is slower for simple operations because of the extra layer of indirection. Real processors today often combine the two: a fast hardwired path for common simple instructions and a microcoded path for rare or complex ones.

We will return to microcode in detail in Chapter 27. For now, the important point is that the control unit is the source of every signal that decides what the datapath does on each cycle.

03.Program Counter

The program counter, or PC, is the register that holds the address of the next instruction to be fetched. It is one of the very few registers in the machine that is named in the architecture itself. (The other examples vary by ISA: a stack pointer, a status register, sometimes a link register.) The program counter is the engine of the fetch–execute loop, because updating its value is what causes the CPU to move from one instruction to the next.

In ordinary execution, the program counter is updated by simply adding the size of the current instruction. On a 32-bit fixed-width ISA like RISC-V or ARM AArch64, this means adding 4 each cycle. On a variable-width ISA like x86-64, the increment is whatever length the decoder determined the current instruction to be. The PC update is performed by a small adder dedicated to that purpose, with one input wired to the current PC and the other to a constant equal to the instruction size.

Figure: PC update path: a dedicated adder produces PC plus instruction size, and a PC mux selects between the sequential value and a branch target

LaTeX

\begin{tikzpicture}[font=\small, >=Stealth, line cap=round,
  blk/.style={draw, thick, fill=white, minimum width=2.6cm, minimum height=0.7cm}]
  \node[blk] (pc)  at (3, -0.5) {PC};
  \node[blk] (add) at (3, -2)   {+4 (instr size)};
  \node[blk] (mux) at (3, -3.5) {PC mux};
  \draw[->] (pc) -- (add);
  \draw[->] (add) -- (mux);
  \draw[->] (mux.south) -- (3, -4.5) -- (0.5, -4.5) -- (0.5, -0.5) -- (pc.west);
  \node[anchor=west, font=\footnotesize] at (4.5, -3.5) {$\leftarrow$ branch target};
  \node[anchor=west, font=\footnotesize] at (4.5, -4.5) {(back to PC on next clock edge)};
\end{tikzpicture}

When a branch or jump instruction is executed, the PC is loaded with a different value. The new value comes from the datapath — it might be the result of a small calculation involving the current PC and an offset (a PC-relative branch), or it might be a value read from a register (an indirect jump), or it might be an absolute address embedded in the instruction. A multiplexer in front of the PC selects between the incremented value and the branch target, and the control unit decides which to take based on the instruction and on whatever condition the branch tests.

Two architectural details of the program counter deserve attention.

The first is that the PC is read by some instructions as a source of data. PC-relative addressing, which lets a program reference a location at a known offset from itself, is used for almost every memory access in position-independent code and for almost every branch in any modern ISA. Some architectures expose the PC as an ordinary readable register; others provide special instructions or addressing modes to access it. RISC-V uses an instruction called auipc (add upper immediate to PC) for this purpose; x86-64 added RIP-relative addressing for the same reason. ARM exposes the PC as register R15 in AArch32 and uses dedicated PC-relative load instructions in AArch64.

The second is that "the program counter" in a pipelined or out-of-order processor is no longer a single, well-defined value. The fetch stage has its own notion of the PC; the decode stage works on an instruction whose address is somewhat older; the execute stage may be working on something older still; and the retirement stage has the architectural PC of the most recently completed instruction. We will say much more about this in Chapter 22 and beyond. For the simple machine of this chapter, all of these collapse to a single register, but the multiplicity of "PC values" inside a real processor is a source of subtle bugs and clever tricks.

04.Instruction Register

The instruction register, or IR, holds the instruction currently being executed. It is loaded once per instruction by the fetch step: the program counter is sent to memory, the memory returns the bits, and those bits are latched into the IR. From there, the bits flow into the control unit, which interprets them, and into various parts of the datapath, which extract operands and addresses from them.

A typical 32-bit instruction in a fixed-width ISA might be carved into fields like this:

Figure: A 32-bit fixed-width instruction divided into opcode, two source registers, a destination register, and an immediate field, with bit ranges labeled

LaTeX

\begin{tikzpicture}[font=\small, line cap=round]
  % Five fields with widths proportional to bit-counts
  % Origin (0,0) top-left of the bit field bar
  % opcode 6 bits, rs1 5, rs2 5, rd 5, immediate 11; total 32 bits
  \draw (0,-1) rectangle (2,-0.2);   \node at (1, -0.6) {opcode};   \node[font=\tiny] at (1, 0)    {31..26};
  \draw (2,-1) rectangle (3.4,-0.2); \node at (2.7, -0.6) {rs1};   \node[font=\tiny] at (2.7, 0)  {25..21};
  \draw (3.4,-1) rectangle (4.8,-0.2);\node at (4.1, -0.6) {rs2};   \node[font=\tiny] at (4.1, 0)  {20..16};
  \draw (4.8,-1) rectangle (6.2,-0.2);\node at (5.5, -0.6) {rd};    \node[font=\tiny] at (5.5, 0)  {15..11};
  \draw (6.2,-1) rectangle (8.8,-0.2);\node at (7.5, -0.6) {immediate}; \node[font=\tiny] at (7.5, 0) {10..0};
\end{tikzpicture}

The opcode says what kind of instruction this is. The rs1 and rs2 fields name the source registers. The rd field names the destination register. The immediate field, when present, encodes a small constant that the instruction can use directly. (Different instruction formats lay these fields out differently — load and store instructions carry a different immediate, branches a different one again. We will return to formats in Chapter 11, when we discuss the ISA proper.)

The instruction register is a passive holder. It has no smarts of its own. Its value is interpreted by:

the control unit, which examines the opcode and, in many ISAs, a few additional function-code bits, to decide what to do;
the register file, whose read-port addresses come directly from the rs1 and rs2 fields and whose write-port address comes from rd;
the immediate generator, which extracts the appropriate immediate field, sign-extends it if necessary, and presents it to the ALU;
the branch logic, which uses the immediate field of branch instructions, combined with the PC, to compute the branch target.

In a single-cycle implementation, the IR is loaded at the start of the cycle and remains stable for the whole cycle while the rest of the datapath uses its bits. In a pipelined implementation, each pipeline stage has its own pipeline register that carries forward the parts of the instruction it still needs. In neither case is the IR doing computation; it is just a place where the instruction's bits sit while they are being acted upon.

05.Register File

The register file is a small, fast bank of storage internal to the CPU. It holds the architectural general-purpose registers — the named storage that the programmer (and the compiler) sees. The number of registers varies by ISA: 16 in classic ARM AArch32, 32 in RISC-V and AArch64, 16 in x86-64, hundreds in some VLIW machines. The width of each register is the word size of the processor: 32 bits or 64 bits in modern designs.

A register file is typically implemented as an array of D flip-flops with read and write ports built around it. A read port takes a register number as input and produces the contents of that register on its output. A write port takes a register number, a data value, and a write-enable signal as inputs, and writes the data to the named register on the next clock edge if the enable is asserted.

For a typical RISC instruction such as add rd, rs1, rs2, the register file must produce two source values in a single cycle and accept one destination value at the end of that cycle. So the register file usually has two read ports and one write port. Wider machines that issue multiple instructions per cycle need more: a four-wide superscalar processor that can issue four arithmetic instructions per cycle needs eight read ports and four write ports. Register-file design becomes one of the most demanding parts of a high-performance CPU because of this multi-ported requirement.

A small detail with large consequences: in nearly every RISC ISA, register 0 is hardwired to the value zero. Reads from register 0 always return zero, and writes to register 0 are silently discarded. This costs almost nothing in hardware but gives the assembly programmer a "free constant zero" without using any of the immediate-encoding budget, and it makes a number of common idioms more efficient. RISC-V, MIPS, and the original AArch64 designs all share this trick. x86-64 does not — none of its 16 GPRs is fixed — but x86 has its own historical reasons for the asymmetry.

A schematic view of a 32-register register file with two read ports and one write port:

Figure: Register file with two read ports and one write port: five address and data inputs at the top, two read-data outputs at the bottom

LaTeX

\begin{tikzpicture}[font=\footnotesize, >=Stealth, line cap=round]
  \draw[thick, fill=white] (0, -2) rectangle (10, -3.4);
  \node[align=center] at (5, -2.7) {register file\\(32 $\times$ 64-bit registers)};
  % Inputs at top
  \node at (1, 0) {read addr 1};
  \node at (3, 0) {read addr 2};
  \node at (5, 0) {write addr};
  \node at (7, 0) {write data};
  \node at (9, 0) {write enable};
  \draw[->] (1, -0.3) -- (1, -2);
  \draw[->] (3, -0.3) -- (3, -2);
  \draw[->] (5, -0.3) -- (5, -2);
  \draw[->] (7, -0.3) -- (7, -2);
  \draw[->] (9, -0.3) -- (9, -2);
  % Outputs at bottom
  \draw[->] (1, -3.4) -- (1, -4.5);
  \draw[->] (3, -3.4) -- (3, -4.5);
  \node at (1, -4.8) {read data 1};
  \node at (3, -4.8) {read data 2};
\end{tikzpicture}

The register file is the small, fast tier of storage closest to the ALU. Almost every arithmetic instruction reads two values from it and writes one back. Almost every load reads its base address from it and writes the loaded value back. Almost every store reads both its base address and the value to be stored from it. The bandwidth of the register file therefore directly limits the bandwidth of the datapath, which is one reason it is implemented in custom logic rather than as ordinary memory.

06.Special-Purpose Registers

The general-purpose register file is the most heavily used storage in the CPU, but it is not the only register storage. A working processor has a small constellation of special-purpose registers — each one dedicated to a particular role, sometimes named explicitly in the ISA, sometimes hidden from software entirely. The exact set varies by architecture, but a working architect should recognize the common ones.

The program counter and instruction register are the most fundamental of these and have already been covered above. They are special because they sit directly on the fetch path; the program counter is the address from which to fetch, and the instruction register is the value that came back.

The stack pointer holds the address of the top of the call stack. Every ISA either has a dedicated stack-pointer register (rsp on x86-64, sp or x2 on RISC-V, SP on AArch64) or a software convention assigning the role to one of the GPRs. Push and pop instructions, where they exist, implicitly read and update the stack pointer; on architectures without explicit push/pop, the same effect is achieved by an ordinary register update plus a load or store. The stack pointer is the spine of the calling convention covered in Chapter 14.

The link register holds the return address of the current function call on architectures where call instructions write the return address into a register rather than pushing it onto the stack. AArch64 dedicates x30 (also called LR) to this purpose; RISC-V's jal writes the return address to a destination register that is conventionally x1. Function returns then jump to the value held in this register. x86-64 takes the older approach of pushing the return address onto the stack on call and popping it on ret, so it has no link register at all. The choice between the two styles affects how cheap a leaf-function call is and how easily a stack trace can be reconstructed.

The frame pointer holds the base address of the current function's stack frame, used to access local variables and saved registers at known offsets. It is sometimes a hardware-recognized register (like rbp on x86-64 in older calling conventions) and sometimes just a software convention; many modern compilers omit it entirely to free up an extra GPR.

The status register or flags register holds a small collection of single-bit values describing the most recent ALU result and the current operating mode. The arithmetic flags — zero, negative, carry, overflow — are produced as a side effect of every ALU operation on architectures that have them. Other bits in the status register record whether interrupts are enabled, what privilege level the CPU is running at, what the floating-point rounding mode is, and so on. The full list varies enormously: x86-64's RFLAGS has dozens of fields, AArch64 splits the equivalent state across NZCV, DAIF, SPSR_*, and other system registers, and RISC-V dispenses with arithmetic flags entirely (its conditional branches compare two registers directly). The Spectre/Meltdown story of Chapter 51 is in part a story of what happens when speculative execution is allowed to read or write parts of this register without proper isolation.

The control and status registers of a privileged ISA hold the configuration of the CPU itself: the address of the page-table root, the addresses of the various exception handlers, the masks of which interrupts are enabled, the IDs of the current process and address space, and the like. These registers are not usually visible to ordinary application code; reading or writing them requires a privileged instruction such as mrs/msr on AArch64 or csrr/csrw on RISC-V. We will return to them in Chapter 15 when we discuss exceptions and traps and again in Chapter 44 for the RISC-V privileged architecture.

Finally, the floating-point and vector register files are separate banks of registers used by the floating-point and SIMD execution units. They are typically wider than the integer registers — 64 bits for double-precision scalars, 128 to 512 bits for SIMD vectors, even more for the recent ARM Scalable Vector Extensions and RISC-V Vector extensions. They have their own read and write ports, often share the integer load–store pipeline for memory traffic, and have their own status fields recording exceptions like invalid, underflow, overflow, and inexact. We will encounter them again in Chapter 29 (data-level parallelism) and in the architecture-specific chapters of Parts VII–IX.

07.Condition Codes vs. Compare-and-Branch

A particularly important architectural choice that hides inside the discussion of flags is how the CPU expresses conditions for branching. Two main styles exist, and they shape the surrounding micro-architecture in subtle ways.

In the condition-code or flags-based style, every arithmetic instruction sets a small set of status bits (zero, negative, carry, overflow), and conditional branches consult those bits without doing any arithmetic of their own. To branch when two values are equal, software writes

Plain Text

cmp  rax, rbx     ; subtracts, sets flags, discards numeric result
je   target       ; branches if zero flag is set

x86-64 and AArch64 are the dominant flags-based architectures. The advantage is that several conditional branches can use the result of a single comparison, and very compact loop idioms (dec rcx; jnz top) become possible. The disadvantage is that the flags become a global piece of state that every flag-setting instruction must produce and every conditional branch must consume. Out-of-order processors have to rename the flags as if they were a register and track dependencies on them across speculation; preserving the flags across a function call requires saving and restoring them.

In the compare-and-branch style, conditional branches name their two operands directly and do their own comparison. RISC-V's beq rs1, rs2, label and MIPS's beq are canonical examples. There are no architectural arithmetic flags at all; an instruction like "branch if less than" includes its own subtractor in the branch logic. The advantage is that conditions are local to the instruction that needs them; one comparison cannot accidentally interfere with another, and renaming and dependency tracking become simpler. The disadvantage is a slightly larger instruction encoding and the loss of the very compact decrement-and-branch idiom.

Neither style is decisively better. Most modern processors include hardware tricks that paper over the differences: x86-64's macro-op fusion combines a cmp/test with a following conditional branch into a single internal operation, and AArch64's cbz/cbnz instructions provide compare-and-branch behaviour for the common case of comparing against zero. The point worth retaining is that the choice between flags and compare-and-branch is a real ISA-level decision, not just a notational one, and it propagates into every layer of the implementation.

08.Multiple Register Files and Register Banks

The single, monolithic register file pictured earlier is a simplification. Most real high-performance processors split their architectural registers across several banks for both correctness and performance.

At the architectural level, integer GPRs, floating-point registers, and vector registers are usually held in separate register files because they have different widths, different read/write rates, and different consumers. A 64-bit integer register file does not need 512-bit ports; a 512-bit vector register file does not need single-cycle access from every integer pipeline. Splitting the storage matches the ports to the demand.

Within a single register file, modern designs often use banked implementations to provide more ports at lower cost. A four-issue CPU that nominally needs eight read ports may instead have a register file split into two banks of four ports each, with logic that detects when both reads of a single instruction land in the same bank and delays one cycle if so. The detection-and-delay logic costs less area than a true eight-port file would.

Some architectures multiply the register file by privilege level. AArch64 provides a separate stack pointer for each exception level (SP_EL0, SP_EL1, SP_EL2, SP_EL3), so that an exception handler runs on its own stack without disturbing the interrupted thread. ARM AArch32 went further, providing banked general-purpose registers in IRQ, FIQ, and supervisor modes; the FIQ-mode banking was deliberately designed so that interrupt handlers could begin execution without saving any state. We will return to these mechanisms in Chapter 15 (exceptions) and in the AArch64 chapters of Part VIII.

The deepest extension of this idea is register renaming, in which the architectural register names visible to software are decoupled from a much larger pool of physical registers used by the hardware. Every architectural write to x1 allocates a new physical register; reads of x1 are routed to whichever physical register most recently bound the name. Renaming eliminates a class of false dependencies between independent instructions and is essential to any out-of-order processor. We will treat it carefully in Chapter 25; here it is enough to know that the simple register file of this chapter is, in a high-end processor, only the visible tier of a much more elaborate underlying structure.

09.ALU

The arithmetic and logic unit, or ALU, is the centerpiece of the datapath. It is a combinational block that takes two operands, a small operation code, and produces a result and a few status outputs. Its operations include:

arithmetic: add, subtract, sometimes increment and decrement;
logical: AND, OR, XOR, NOR, NOT;
shifts: left, right logical, right arithmetic;
comparisons: less-than, equal, greater-than (often expressed as a subtraction with the result thrown away).

A first-cut block diagram of a simple ALU:

Figure: ALU drawn as a trapezoid with operand A and operand B at the top, ALUop control on the side, and result and status flags at the bottom

LaTeX

\begin{tikzpicture}[font=\small, >=Stealth, line cap=round]
  % ALU drawn as a trapezoid (manual polygon), origin top-left of bounding box
  % Top edge wider, bottom narrower, point down
  \draw[thick, fill=white]
    (1, -1) -- (5, -1) -- (4.5, -2) -- (4, -3) -- (2, -3) -- (1.5, -2) -- cycle;
  \node[align=center] at (3, -2) {ALU\\\footnotesize add, sub, AND, OR,\\\footnotesize XOR, shift, compare};
  % Operand A (top-left input)
  \draw[->] (1.7, 0) -- (1.7, -1);
  \node at (1.7, 0.3) {operand A};
  % Operand B (top-right input)
  \draw[->] (4.3, 0) -- (4.3, -1);
  \node at (4.3, 0.3) {operand B};
  % ALUop control (right side)
  \draw[->] (6, -1.5) -- (4.7, -1.5);
  \node[anchor=west, font=\footnotesize] at (6, -1.5) {ALUop};
  % Result (bottom-left)
  \draw[->] (2.5, -3) -- (2.5, -4);
  \node at (2.5, -4.3) {result};
  % Flags (bottom-right)
  \draw[->] (3.5, -3) -- (3.5, -4);
  \node[align=center] at (3.5, -4.5) {flags\\\footnotesize(zero, carry, neg, ovf)};
\end{tikzpicture}

Inside, the ALU is a collection of the building blocks from Chapter 4 — adders, shifters, logic gates — with multiplexers selecting which one's output to forward. The ALUop signal, generated by the control unit from the instruction's opcode and function bits, picks the operation. The result is the answer; the flags (sometimes called the condition codes) summarize properties of the result that branch instructions and conditional moves use.

The width of the ALU is the word size of the processor: 32 or 64 bits in modern designs. On a 64-bit processor, the ALU adds two 64-bit numbers in a single cycle (or two, depending on the depth of the adder). Multiplication and division are usually separate units, because their delays are too large to fit in the same cycle as ordinary addition and because they are used much less frequently. Floating-point arithmetic almost always lives in a separate floating-point unit with its own register file and its own pipeline.

A few subtleties about the ALU are worth flagging here, even though we will return to them later.

The ALU does not, by itself, distinguish signed from unsigned arithmetic for addition and subtraction; as we saw in Chapter 2, two's complement makes them the same operation. The distinction shows up only in the interpretation of the carry and overflow flags. The ISA exposes two different sets of branch instructions to consult these flags, so a branch if less than for unsigned numbers checks the carry while one for signed numbers checks a combination of the negative and overflow flags.

The ALU is often the longest-delay block in the datapath. A 64-bit adder, even a sophisticated prefix design, takes a measurable fraction of the clock period. In a single-cycle machine, the clock period must accommodate a memory read, a register-file read, an ALU operation, and a register-file write all in series. As clock frequencies climbed in the 1990s, this serial budget became unaffordable, which is part of why every modern processor is pipelined.

The ALU is typically used by far more than just arithmetic instructions. Branches use it to compare values. Loads and stores use it to compute addresses (base + offset). Even a move from one register to another might be implemented as add rd, rs, x0 — adding the source to register zero. Reusing one general-purpose ALU keeps the datapath small.

10.Load and Store Paths

The last pieces of the datapath we need are the load path and the store path, the wires and logic that connect the CPU to the data memory.

In the simple machine of this chapter, the data memory has a single port. To perform a load, the CPU drives an address onto the memory's address input and a "read" signal on its control input; on the next cycle (or, in our single-cycle model, after a delay within the same cycle) the memory drives the requested value onto its data output, and the CPU latches it into the destination register. To perform a store, the CPU drives the address, drives the data, and asserts a "write" signal; the memory updates the named location on the appropriate edge.

Figure: Memory access path: ALU output drives the data memory address, the writeback mux chooses between load data and the ALU bypass, then writes the register file

LaTeX

\begin{tikzpicture}[font=\small, >=Stealth, line cap=round,
  blk/.style={draw, thick, fill=white, minimum width=2.6cm, minimum height=0.8cm}]
  \node[blk] (alu) at (3, -0.5) {ALU};
  \node[blk] (dm)  at (3, -2)   {D-mem};
  \node[blk] (wb)  at (3, -3.5) {writeback mux};
  \node[blk] (rf)  at (3, -5)   {register file (write)};
  \draw[->] (alu) -- (dm) node[midway, right, font=\footnotesize] {address};
  \draw[->] (dm)  -- (wb) node[midway, right, font=\footnotesize] {load data};
  \draw[->] (wb)  -- (rf);
  % bypass ALU result around D-mem to writeback mux
  \draw[->] (alu.east) -- (6.5, -0.5) -- (6.5, -3.5) -- (wb.east);
  \node[anchor=west, font=\footnotesize] at (4.5, -2) {$\leftarrow$ store data};
\end{tikzpicture}

A few small details deserve naming.

The address calculation for a load or store is performed by the ALU, just as for an arithmetic instruction. A typical addressing mode is base + immediate offset: the instruction specifies a base register and a small constant, and the address is their sum. The ALU does this addition; the result goes to the memory's address port instead of to the register-file write port.

The immediate field in a load or store instruction is fed to the ALU's right input through the immediate generator. Different ISAs use different layouts of the immediate to keep the encoding compact, but functionally the immediate is just sign-extended to the full word width and fed in alongside the base register.

The writeback mux at the register file's write port selects between two possible sources: the ALU's output (for ordinary arithmetic instructions) or the data memory's output (for loads). The control unit's MemToReg signal is the select line for this mux.

The byte enables on the memory port handle loads and stores of sizes smaller than the word. A load-byte instruction reads a single byte from memory and zero- or sign-extends it to a full word. A store-byte writes only the relevant byte of the addressed word, leaving the others unchanged. The memory's interface includes a small byte-enable mask for stores, plus a few bits of control to tell loads how to extend their result.

In a real machine, the data memory is not a simple, single-cycle structure. It is the top of a memory hierarchy: a level-1 data cache backed by a level-2 cache, a last-level cache, and main memory beyond. Loads that miss in the L1 cache take many cycles to satisfy; stores are queued and may be retired out of order. We will spend most of Part IV on these issues. The basic load and store paths sketched here, however, remain unchanged in spirit. The CPU drives an address and a request; eventually it gets a response.

11.Putting It Together: A Single-Cycle Machine

We can now draw the basic CPU one more time, with the control unit included and every block named.

Figure: Complete single-cycle CPU with control unit on top driving the datapath: PC, I-mem, decode, register file, ALU muxes, ALU, D-mem, and writeback mux

LaTeX

\begin{tikzpicture}[font=\small, >=Stealth, line cap=round,
  blk/.style={draw, thick, fill=white, minimum width=2.2cm, minimum height=0.7cm, align=center}]
  % Origin (0,0) at top-left.
  % Control unit at top spans full width
  \node[blk, minimum width=11cm] (cu) at (5.5, -0.5) {control unit (combinational)};
  \node[font=\footnotesize] at (5.5, -1.4) {ALUop / RegWrite / MemRead / MemWrite / MemToReg / ALUSrc / BranchOp};
  % Datapath row 1: PC, I-mem, IR, register file
  \node[blk] (pc)  at (1, -3) {PC};
  \node[blk] (im)  at (3.5, -3) {I-mem};
  \node[blk] (ir)  at (6.5, -3) {IR / decode};
  \node[blk] (rf)  at (9.5, -3) {register file};
  % Muxes
  \node[blk] (muxA) at (8.5, -5) {ALU mux A};
  \node[blk] (muxB) at (6, -5) {ALU mux B};
  % ALU
  \node[blk] (alu) at (7.25, -6.5) {ALU};
  \node[blk] (dm)  at (7.25, -8) {D-mem};
  \node[blk] (wb)  at (7.25, -9.5) {writeback mux};
  % PC mux
  \node[blk] (pcm) at (1, -5.5) {PC mux};
  % Connections
  \draw[->] (pc) -- (im);
  \draw[->] (im) -- (ir);
  \draw[->] (ir) -- (rf);
  \draw[->] (rf.south) -- (muxA.north);
  \draw[->] (ir.south) -- (muxB.north);
  \draw[->] (muxA.south) -- (alu.north);
  \draw[->] (muxB.south) -- (alu.north);
  \draw[->] (alu) -- (dm);
  \draw[->] (dm) -- (wb);
  \draw[->] (alu.east) -- (10.5, -6.5) -- (10.5, -9.5) -- (wb.east);
  \draw[->] (wb.west) -- (3, -9.5) -- (3, -3) -- (rf.west);
  \draw[->] (pcm.north) -- (pc.south);
  \draw[->] (alu.west) -- (1, -6.5) -- (pcm.south);
\end{tikzpicture}

This is, in outline, a complete simple processor. Every box is something we have already discussed. The control unit, on each cycle, looks at the instruction in the IR and produces a vector of control signals that drives every multiplexer and every enable. The datapath, in response, moves the right bits along the right wires and either writes a result back into a register or writes a value into memory.

A typical instruction passes through this picture as follows. The current PC indexes the instruction memory; the returned bits are latched into the IR at the next clock edge (in single-cycle designs, conceptually within the same cycle). The IR's fields fan out: the opcode goes to the control unit, the source-register fields to the register file's read ports, the destination-register field to the write port, the immediate field to the immediate generator. The register file reads two source values; the ALU operates on one of them and either the other source or an immediate; the result either flows to the data memory (for a load or store) or to the register-file write port (for an arithmetic instruction); for loads, the memory's data is selected by the writeback mux instead. Finally, the PC is updated, either to PC+4 or to a branch target, and the cycle repeats.

What we have built is functional but slow. Every instruction takes one full clock cycle, and that cycle must be long enough to accommodate the instruction-memory access, the register-file read, the ALU operation, the data-memory access, and the register-file write, all in series. The clock frequency of a single-cycle design is therefore set by the worst-case instruction. Real processors split this work into pipeline stages, so that different parts of different instructions can proceed simultaneously, and they add caches, out-of-order execution, and many other refinements on top. The basic skeleton, however, never goes away. Every CPU you will ever meet has, somewhere inside, a datapath, a control unit, a program counter, an instruction register, a register file, an ALU, and a load–store path. This chapter has named each of them.

12.Clocking and the CPU's Critical Path

The single-cycle picture suppresses one of the most important physical facts about a working processor: every block in it takes time. The instruction memory does not produce its output the instant its address arrives; it takes a few hundred picoseconds at best. The register file does not produce its read data instantly either, and neither does the ALU. The clock period of any synchronous CPU must be long enough to allow signals to propagate from one set of flip-flops, through whatever combinational logic the cycle assigns to them, and into the next set of flip-flops with their setup margin satisfied. The longest such path is the critical path, and on a single-cycle design it is essentially the entire datapath.

For the simple machine of this chapter, a typical critical path runs through five blocks in series:

Plain Text

PC → I-mem → IR (decode) → register-file read → ALU → D-mem → register-file write

With realistic per-block delays — say, 200 ps for the I-memory access, 150 ps for register-file read, 250 ps for a 64-bit ALU, 200 ps for the D-memory access, and 100 ps for setup at the destination flip-flop — the total is on the order of 900 ps, which limits the clock to about 1.1 GHz. Modern processors run several times faster than that, and they do so largely by attacking this path: pipelining splits it into stages each of which is much shorter, allowing a much faster clock; caches replace the slow main-memory access with a fast on-chip access; bypass networks let an instruction's result flow directly into the next instruction's ALU input without going through the register file. Each of these techniques is the subject of a later chapter; the reason to mention them here is to make explicit that the simple datapath of this chapter is also a slow datapath, and that the rest of the book is in large part the story of what to do about that.

A related issue is that not every block on a chip runs at the CPU's main clock frequency. The L2 and L3 caches typically run at a fraction of the core clock, the memory controller runs at the DRAM clock, and the I/O subsystem runs at frequencies set by external standards. Every clock domain crossing inside the CPU requires the careful synchronizer discipline introduced in Chapter 5. From the architectural picture we have built, these crossings are invisible — a load instruction either gets its data or it does not — but they are part of why the latency of a memory access is so much larger than the latency of a register-file read, and they will reappear in Part IV when we look at the memory hierarchy in detail.

13.Summary

The CPU is built from two complementary parts. The datapath is the network of registers, multiplexers, adders, and wires through which data flows; it does not, by itself, decide what to do. The control unit is the finite state machine that interprets the current instruction and produces the control signals that steer the datapath cycle by cycle. The program counter holds the address of the next instruction; the instruction register holds the instruction currently being executed; the register file holds the architectural general-purpose registers, while a constellation of special-purpose registers — stack pointer, link register, frame pointer, status/flags, and privileged control registers — fills out the architectural state, often spread across multiple banks for performance and exception handling. The ALU performs the arithmetic, logical, and shift operations on which almost every instruction relies, and the choice between condition-code and compare-and-branch styles for expressing conditional control flow shapes the surrounding micro-architecture. The load–store path connects the processor to data memory. Wrapping all of this is a synchronous clock whose period is bounded below by the longest path through the datapath — the critical path — the management of which motivates much of the architectural sophistication in later chapters. Together these blocks form a complete, if simple, processor: the kind of machine you could build on an FPGA over a long weekend, and the kind of skeleton that the rest of this book will progressively refine into the modern CPU.

Chapter 8 takes the next step, walking through what happens during a single instruction's execution in finer detail and beginning to think about how to make the cycle faster.

Book mode

	cmp rax, rbx ; subtracts, sets flags, discards numeric result
	je target ; branches if zero flag is set