Part IFoundations

Digital Building Blocks

May 16, 2026·36 min read·beginner

The previous chapter showed that any Boolean function can in principle be assembled from a handful of gates. In practice, no working engineer designs a 32-bit adder by drawing thirty-two columns of NAND gates by hand. Real digital design proceeds at a higher level of abstraction, in terms of blocks: small, reusable circuits that perform a single, named function. This chapter introduces the blocks that show up most often in computer architecture. Each one is itself just a clever arrangement of gates, but once you know its truth table and its rough internal structure, you can use it as a unit in larger drawings without thinking about transistors at all.

The blocks fall into two camps. The combinational blocks — adders, comparators, multiplexers, encoders, decoders — compute pure functions of their inputs. The sequential blocks — latches, flip-flops, registers, counters — remember things. The chapter ends with the issue that ties them together: the question of when signals are sampled and how a design copes with the fact that no real wire is instantaneous.

01.Adders and Subtractors

Addition is the workhorse operation of any processor, and the adder is one of the oldest and most studied blocks in digital design. The simplest version, a half adder, takes two single-bit inputs $A$ and $B$ and produces a sum bit $S$ and a carry bit $C_{\text{out}}$ .

$A$	$B$	$S$	$C_{\text{out}}$
0	0	0	0
0	1	1	0
1	0	1	0
1	1	0	1

Reading the truth table, $S = A \oplus B$ and $C_{\text{out}} = A \cdot B$ . So a half adder is one XOR gate and one AND gate, in parallel. Its name says everything: it is half of what we really need, because it has no input for a carry coming in from a previous column.

A full adder fixes that. It takes three single-bit inputs — two operand bits and a carry-in $C_{\text{in}}$ — and produces a sum and a carry-out.

$A$	$B$	$C_{\text{in}}$	$S$	$C_{\text{out}}$
0	0	0	0	0
0	0	1	1	0
0	1	0	1	0
0	1	1	0	1
1	0	0	1	0
1	0	1	0	1
1	1	0	0	1
1	1	1	1	1

The two outputs simplify to

$S = A \oplus B \oplus C_{\text{in}}, \qquad C_{\text{out}} = A B + (A \oplus B) C_{\text{in}}.$

You can also read the carry as "the carry-out is 1 when at least two of the three inputs are 1," which is the majority function from the last chapter.

Ripple-carry addition

To add multi-bit numbers, you chain $n$ full adders together, feeding each column's carry-out into the next column's carry-in. The carry into the lowest column is set to 0 for plain addition. This arrangement is called a ripple-carry adder, because the carry literally ripples from the least significant bit up to the most significant.

Figure: 4-bit ripple-carry adder: four full adders chained right to left, with each carry-out feeding the next column's carry-in

LaTeX

\begin{tikzpicture}[font=\small, >=Stealth, line cap=round,
  fa/.style={draw, thick, fill=white, minimum width=1.4cm, minimum height=1cm}]
  % Origin (0,0) at top-left. FA3 leftmost, FA0 rightmost.
  % Inputs A_i B_i
  \node at (0.7, 0) {$A_3\ B_3$};
  \node at (2.7, 0) {$A_2\ B_2$};
  \node at (4.7, 0) {$A_1\ B_1$};
  \node at (6.7, 0) {$A_0\ B_0$};
  % FA boxes
  \node[fa] (FA3) at (0.7, -1.5) {FA};
  \node[fa] (FA2) at (2.7, -1.5) {FA};
  \node[fa] (FA1) at (4.7, -1.5) {FA};
  \node[fa] (FA0) at (6.7, -1.5) {FA};
  % Sums S_i below
  \node at (0.7, -3) {$S_3$};
  \node at (2.7, -3) {$S_2$};
  \node at (4.7, -3) {$S_1$};
  \node at (6.7, -3) {$S_0$};
  % Vertical arrows in/out
  \draw[->] (0.7,-0.3) -- (FA3.north);
  \draw[->] (2.7,-0.3) -- (FA2.north);
  \draw[->] (4.7,-0.3) -- (FA1.north);
  \draw[->] (6.7,-0.3) -- (FA0.north);
  \draw[->] (FA3.south) -- (0.7,-2.7);
  \draw[->] (FA2.south) -- (2.7,-2.7);
  \draw[->] (FA1.south) -- (4.7,-2.7);
  \draw[->] (FA0.south) -- (6.7,-2.7);
  % Carry chain (right to left)
  \draw[->] (7.7, -1.5) -- (FA0.east); \node[right] at (7.7, -1.5) {$0$};
  \draw[->] (FA0.west) -- (FA1.east);
  \draw[->] (FA1.west) -- (FA2.east);
  \draw[->] (FA2.west) -- (FA3.east);
  \draw[->] (FA3.west) -- (-0.3, -1.5); \node[left] at (-0.3, -1.5) {$C_{\text{out}}$};
\end{tikzpicture}

Ripple-carry adders are easy to draw and easy to build, and they explain why addition is fundamentally a serial operation in the worst case: the high bit of the sum cannot settle until the carry from every lower bit has finished propagating through the chain. For an $n$ -bit ripple-carry adder, the worst-case delay is roughly $n$ times the delay of one full adder. For an 8-bit number this is fine; for a 64-bit number it is unacceptably slow on a fast clock.

Faster adders

To break the carry chain, designers use carry-lookahead adders, carry-skip adders, carry-select adders, and prefix adders such as the Kogge–Stone or Brent–Kung designs. The core trick in all of them is to compute, in parallel, two signals for each bit: a generate signal $G_i = A_i B_i$ that is 1 if column $i$ produces a carry on its own, and a propagate signal $P_i = A_i \oplus B_i$ that is 1 if column $i$ would pass an incoming carry through. Once these are known, the carry into column $i+1$ can be written without waiting for the lower bits to finish:

$C_{i+1} = G_i + P_i C_i.$

Recursive expansion of this equation, together with some clever circuitry, lets the carry network settle in time proportional to $\log_2 n$ rather than $n$ . Practical 64-bit adders inside modern processors use prefix structures that finish in well under a nanosecond. We will not draw out the gate-level details — that is a topic for a digital design course — but the principle is worth keeping in mind, because it recurs throughout architecture: when you cannot make a serial chain shorter, restructure the problem to expose parallelism.

Subtraction

A separate subtractor block can be built directly from the truth table of subtraction with a borrow-in and a borrow-out, but no real machine bothers. Two's complement, introduced in Chapter 2, lets a single adder handle both addition and subtraction. To compute $A - B$ , the hardware sets the carry-in of the lowest column to 1 and feeds $\overline{B}$ instead of $B$ through the adder. Since two's-complement negation is "flip all the bits and add one," and the carry-in supplies that "add one," the result is exactly $A + (-B) = A - B$ .

Figure: Adder-subtractor: an add/sub control bit drives XOR gates on B and the adder's carry-in, so the same adder computes A plus B or A minus B

LaTeX

\begin{tikzpicture}[font=\small, >=Stealth, line cap=round]
  \node[draw, thick, fill=white, minimum width=1.4cm, minimum height=0.8cm] (xor) at (2, -1.5) {XOR};
  \node[draw, thick, fill=white, minimum width=1.6cm, minimum height=1.2cm] (add) at (5, -1) {Adder};
  % Inputs
  \node at (0, -2) {$B$};
  \node at (0, -0.5) {$A$};
  \node at (2, 0) {add/sub};
  \node at (7, -1) {result};
  % B -> XOR
  \draw[->] (0.2, -2) -- (xor.west |- 0,-2) -- (xor.west);
  % XOR out -> adder bottom input
  \draw[->] (xor.east) -- (5, -1.5) -- (add.south);
  % A -> adder top input
  \draw[->] (0.2, -0.5) -- (add.west);
  % control -> XOR
  \draw[->] (2, -0.2) -- (xor.north);
  % control -> adder carry-in (right side top)
  \draw[->] (2, -0.2) -- (3, -0.2) -- (3, -0.5) -- (add.north -| 3,0);
  \draw[->] (3,-0.5) -- (add.north);
  \draw[->] (add.east) -- (6.5, -1);
\end{tikzpicture}

A row of XOR gates conditionally inverts $B$ . With the control bit at 0, $B$ passes through unchanged and the carry-in is 0, giving plain addition. With the control bit at 1, $B$ is inverted and the carry-in is 1, giving subtraction. One adder, one row of XORs, and one control wire — that is the entire integer arithmetic core of a simple processor.

02.Comparators

A comparator answers questions about two numbers: are they equal, is one greater than the other, less than the other? At one bit, equality is a single XNOR — the output is 1 when the bits match. At $n$ bits, equality is the AND of all $n$ XNOR outputs:

$A = B \iff \bigwedge_{i=0}^{n-1} \overline{A_i \oplus B_i}.$

Magnitude comparison is more interesting. The cleanest construction reuses the adder we already have. To check whether $A < B$ for unsigned numbers, compute $A - B$ and look at the carry-out: if there was a borrow, $A$ was smaller. For signed numbers, you have to also consider whether the subtraction overflowed, because a negative result with overflow actually means $A$ was the larger number.

In practice, modern processors fold comparison into the ALU: every arithmetic operation sets a small set of flags — zero, negative, carry, overflow — that summarize the result. A separate dedicated compare instruction is just a subtraction whose output is thrown away, used only for the flags it leaves behind. This is why x86 and ARM have so much logic devoted to flags; they are the standard interface between the arithmetic block and the branch logic that comes later.

03.Shifters and Barrel Shifters

Alongside addition, the ability to shift bits left or right is one of the most heavily used operations in any processor. A left shift by one position is multiplication by 2; a right shift by one is integer division by 2 (with the rounding rules from Chapter 2 depending on whether the shift is logical or arithmetic). More general shifts and rotates appear in cryptography, hashing, bit-field extraction, address calculation, and packing or unpacking compressed data.

The simplest shifter is a single-bit shifter, which moves every bit one position left or right and fills the vacated end with a 0 (or, for an arithmetic right shift, with the sign bit). Cascading $n$ such stages gives an $n$ -bit shift, but the resulting circuit is wasteful: it has $n$ levels of logic regardless of the shift amount, and uses $n$ separate stages even when most shifts in real code are by small constants.

The standard answer is a barrel shifter, a combinational network that shifts by any amount from 0 to $n-1$ in a single pass. The classical design uses $\log_2 n$ stages, where stage $i$ either passes its input through unchanged or shifts it by $2^i$ positions, controlled by bit $i$ of the shift amount. To shift by 13, for example, the stages enabled are those for $8 + 4 + 1$ . Each stage is a row of 2-to-1 multiplexers; the entire shifter for a 64-bit machine has six stages and 384 multiplexers. The depth grows only logarithmically with the word width, which is exactly what you want for a high-frequency processor.

Figure: Barrel shifter for a 16-bit word built as four cascaded stages, each shifting by 0 or by a power-of-two amount controlled by one bit of the shift count

LaTeX

\begin{tikzpicture}[font=\small, >=Stealth, line cap=round,
  stage/.style={draw, thick, fill=white, minimum width=3.6cm, minimum height=0.9cm}]
  \node[stage] (s0) at (0, 0)  {stage 0: shift by 0 or 1};
  \node[stage] (s1) at (0, -1.4) {stage 1: shift by 0 or 2};
  \node[stage] (s2) at (0, -2.8) {stage 2: shift by 0 or 4};
  \node[stage] (s3) at (0, -4.2) {stage 3: shift by 0 or 8};
  \draw[->] (s0.south) -- (s1.north);
  \draw[->] (s1.south) -- (s2.north);
  \draw[->] (s2.south) -- (s3.north);
  \node[anchor=west] at (2.3, 0)    {ctrl bit 0};
  \node[anchor=west] at (2.3, -1.4) {ctrl bit 1};
  \node[anchor=west] at (2.3, -2.8) {ctrl bit 2};
  \node[anchor=west] at (2.3, -4.2) {ctrl bit 3};
  \node at (-2.6, -2.1) {input $\to$};
  \node at (2.7, -5.1) {(16-bit shifter)};
\end{tikzpicture}

A funnel shifter generalizes the barrel shifter to operate on a concatenated pair of words. Conceptually, two $n$ -bit values are placed end-to-end and a $n$ -bit window is extracted from any starting position. With one trivial wiring change, the same hardware then implements left shifts, right shifts, and rotates: a rotate is just a funnel shift where both halves are the same word. Most modern ISAs expose this with a small family of shift instructions whose differences are in how the vacated bits are filled and whether the dropped bits wrap around.

For a few applications — most notably the priority encoder embedded inside floating-point normalization — the shifter is paired with a count-leading-zeros unit so that a value can be shifted into a canonical form in a single cycle.

04.Multipliers

Multiplication is, conceptually, the same long-multiplication algorithm taught in primary school: shift the multiplicand by the position of each bit of the multiplier, AND the result with that bit, and add up all the resulting partial products. For an $n$ -bit by $n$ -bit multiply, this gives $n$ partial products of width up to $2n$ bits. The hardware question is how to add them together quickly.

The most direct construction is an array multiplier, sometimes called a shift-and-add multiplier, in which the partial products are added by a regular two-dimensional array of full adders. The layout is highly regular and easy to draw, but its delay grows linearly with the word width, and a 64-bit multiplier built this way is too slow for a high-frequency processor.

Faster designs attack the problem in two ways. Booth's algorithm and its modern radix-4 variant reduce the number of partial products by recoding adjacent bits of the multiplier so that runs of 1s become a single subtraction followed by a single addition. Radix-4 Booth encoding cuts the partial-product count roughly in half, at the cost of some signed-versus-unsigned book-keeping.

The surviving partial products are then summed with a Wallace tree or Dadda tree: a network of carry-save adders that reduce three operands to two at every level, finishing in time logarithmic in the number of operands. A final fast adder — a Kogge–Stone or similar prefix structure — collapses the last two rows into a single product. The total depth of a 64-bit multiplier built this way is on the order of a dozen carry-save levels plus one final adder, fitting comfortably into a single pipeline stage on a modern processor or two stages on a more conservative one.

For very small word widths, designers often choose a different point on the curve and build a serial multiplier that processes one bit of the multiplier per cycle. A serial multiplier needs only a single adder and a shift register, but takes $n$ cycles per multiplication. The tradeoff is the classic one between area and time, and the right choice depends on whether multiplication is a frequent operation or a rare one.

05.Division and Other Iterative Operations

Division is harder than multiplication. The schoolbook long division algorithm has no straightforward parallel form: each digit of the quotient depends on the remainder produced by all the higher-order digits. Most ISAs accept this and implement integer division as a multi-cycle iterative operation, with one or a few quotient bits produced per cycle. Two algorithm families dominate.

Restoring and non-restoring division mimic the long-division algorithm in hardware. At each step, the divisor is subtracted from the current remainder; depending on the sign, the result is either kept or rolled back, and the quotient bit is set accordingly. A 64-bit divide takes 64 cycles in the simplest design; radix-4 and radix-8 variants produce two or three quotient bits per cycle in exchange for a more complex per-step decision.

SRT division — named after Sweeney, Robertson, and Tocher — generalizes this idea by allowing each quotient digit to be drawn from a redundant set including negative values, which lets the per-step subtraction be approximate and therefore faster. SRT was the algorithm at the heart of the famous 1994 Pentium FDIV bug: a handful of entries in a quotient-selection lookup table were transcribed incorrectly, producing wrong results for a small fraction of divisions and costing Intel roughly half a billion dollars in recall expenses. Division algorithms are not exotic curiosities; they are part of the production microarchitecture of every machine you will use.

For floating-point division and square root, modern processors often use Newton–Raphson or Goldschmidt iteration: an initial approximation, drawn from a small lookup table, is refined by a quadratically converging iteration that reduces the error by a factor each step. Three or four iterations are enough for double-precision accuracy. The arithmetic at each step is dominated by multiplications, which the chip already has fast hardware for, so the overall latency can be lower than a fully iterative SRT divider despite needing more arithmetic per step.

The broader pattern — iterative operations that take many cycles, share hardware with simpler operations, and are sequenced by a small finite-state machine — also covers integer remainder, square root, and various transcendental functions when they are not handed off to a separate floating-point unit.

06.The Arithmetic Logic Unit

The operations covered so far — addition, subtraction, comparison, shifting, and the bitwise Booleans from the previous chapter — are usually packaged into a single combinational block called the arithmetic logic unit, or ALU. An ALU has two data inputs (often called $A$ and $B$ ), an operation-select control input, and a single data output, plus the small set of flag outputs we mentioned for comparators.

Inside, an ALU is the cleanest possible application of the multiplexer: each candidate operation is wired up in parallel, and a multiplexer driven by the operation-select picks one to forward to the output.

Figure: ALU internals: adder, logic unit, and shifter run in parallel on A and B, and an op-select multiplexer forwards the chosen result to the output

LaTeX

\begin{tikzpicture}[font=\small, >=Stealth, line cap=round,
  unit/.style={draw, thick, fill=white, minimum width=2cm, minimum height=0.7cm}]
  \node at (0, 0) {$A$};  \node at (0, -1) {$B$};
  \node[unit] (add) at (3, 0.3) {Adder/Sub};
  \node[unit] (log) at (3, -0.7) {Logic};
  \node[unit] (sh)  at (3, -1.7) {Shifter};
  \node[draw, thick, fill=white, minimum width=1.2cm, minimum height=2.4cm] (mux) at (6, -0.7) {MUX};
  \node at (8, -0.7) {result};
  \draw[->] (0.3, 0) -- (add.west);
  \draw[->] (0.3, 0) -- (log.west);
  \draw[->] (0.3, 0) -- (sh.west);
  \draw[->] (0.3, -1) -- (add.west);
  \draw[->] (0.3, -1) -- (log.west);
  \draw[->] (0.3, -1) -- (sh.west);
  \draw[->] (add.east) -- (mux.west |- add);
  \draw[->] (log.east) -- (mux.west |- log);
  \draw[->] (sh.east)  -- (mux.west |- sh);
  \draw[->] (mux.east) -- (7.7, -0.7);
  \node at (6, 0.9) {op-select};
  \draw[->] (6, 0.6) -- (mux.north);
\end{tikzpicture}

The whole structure is purely combinational: given fixed inputs and a fixed operation-select, the output settles to the corresponding result after the worst-case path through the slowest selected unit. The flags are produced as a side effect of the adder/subtractor and a small amount of post-processing. Larger ALUs add multiplication, division, count-leading-zeros, and bit-manipulation extensions; smaller ones omit the shifter entirely and rely on a separate shift instruction sequence. Every general-purpose processor in this book has an ALU at its heart, and Chapter 7 will show how it sits at the centre of the datapath.

07.Multiplexers and Demultiplexers

A multiplexer, or mux, is a digital switch. It has $2^k$ data inputs, $k$ select inputs, and one output. The select inputs choose which data input is routed to the output.

The simplest case is the 2-to-1 mux: two data inputs $A$ and $B$ , one select $S$ , and an output $Y$ .

$S$	$Y$
0	$A$
1	$B$

The Boolean expression is

$Y = \overline{S} \cdot A + S \cdot B,$

which costs two ANDs, one OR, and one inverter. A 4-to-1 mux has two select bits and four data inputs; an 8-to-1 mux has three select bits and eight inputs; in general, a $2^k$ -to-1 mux is built either from a tree of 2-to-1 muxes (depth $k$ ) or as a single layer of AND-OR logic.

The mux is the most heavily used block in computer architecture. Every register file, every cache, every datapath has muxes selecting between sources. When you read in Chapter 7 that the ALU's left input is "either the register file output or the program counter," what the hardware actually contains is a mux with the two as data inputs and a control bit chosen by the decoder. Most of what we draw later as a single line on a block diagram is, on the chip, a wide mux with a few control signals.

A demultiplexer, or demux, is the reverse operation. One data input is routed to one of $2^k$ outputs, with the others driven to 0. Demuxes are less common as standalone blocks, because the same job is usually done by a mux on the receiving side or by enable signals on the destinations, but they show up in memory write paths and in network-style interconnects.

A useful idiom: an $n$ -input mux with a constant 1 on every data input and an enable on its output is functionally equivalent to a $k$ -to- $2^k$ decoder, which leads us neatly to the next block.

08.Encoders and Decoders

A decoder takes a $k$ -bit binary input and asserts exactly one of its $2^k$ output lines, the one whose index matches the input. A 2-to-4 decoder, for example, has inputs $A_1 A_0$ and outputs $Y_0 Y_1 Y_2 Y_3$ , with

$Y_i = 1 \iff A_1 A_0 = i.$

Decoders are the standard way to translate an address into a select line. Inside a register file, the register-number field of an instruction goes into a decoder whose outputs are the read-enable lines for the individual registers. Inside a memory, the address goes into a decoder whose outputs are the row-select lines for the memory cells. Wherever you see "this 4-bit field selects one of sixteen things," there is a 4-to-16 decoder doing the work.

A real decoder usually has an additional enable input that, when low, forces all outputs to 0. Enable inputs let designers chain small decoders into larger ones: an 8-to-256 decoder can be built from a 3-to-8 decoder choosing among eight 5-to-32 decoders, with the 3-to-8 decoder's outputs serving as enable signals.

An encoder is the inverse: $2^k$ inputs and $k$ outputs, where the output is the binary index of whichever input is high. A plain encoder behaves badly if more than one input is high, so practical designs use a priority encoder, which produces the index of the highest-numbered (or lowest-numbered, depending on convention) input that is high, and an extra "any input high" output to distinguish a real result from "all zeros."

Priority encoders show up in interrupt controllers, where many devices may simultaneously request service and the controller has to pick one to deliver to the CPU. They also appear in floating-point hardware, where finding the position of the leading 1 in a number is part of normalizing the result. Most ISAs expose this as a count leading zeros instruction, which is a priority encoder with a little extra arithmetic.

09.Latches and Flip-Flops

We now cross from combinational to sequential logic. A latch or flip-flop is the basic 1-bit memory cell, the element that lets a circuit remember things from one moment to the next.

The SR latch

The most primitive memory element is the SR latch, two cross-coupled NOR gates introduced briefly in the last chapter:

Figure: SR latch with two cross-coupled NOR gates, showing the S and R inputs, the Q and Q-bar outputs, and the feedback taps that close the loop

LaTeX

\begin{tikzpicture}[circuit logic US,
  every circuit symbol/.append style={thick, draw, fill=white,
    minimum height=0.7cm, minimum width=0.85cm},
  font=\small, line cap=round]

  \node[nor gate] (N1) at (4, -1) {};
  \node[nor gate] (N2) at (4, -3) {};

  \draw (N1.input 1) -- ++(-1.5, 0) node[left] {$S$};
  \draw (N2.input 2) -- ++(-1.5, 0) node[left] {$R$};

  \draw (N1.output) -- ++(2, 0) node[right] {$Q$};
  \draw (N2.output) -- ++(2, 0) node[right] {$\overline{Q}$};

  \coordinate (tap1) at ($(N1.output)+(0.5,0)$);
  \filldraw (tap1) circle (1.2pt);
  \draw (tap1) -- ($(tap1)+(0,-1.5)$) -| (N2.input 1);

  \coordinate (tap2) at ($(N2.output)+(0.5,0)$);
  \filldraw (tap2) circle (1.2pt);
  \draw (tap2) -- ($(tap2)+(0,1.5)$) -| (N1.input 2);
\end{tikzpicture}

A pulse on $S$ forces $Q$ to 1; a pulse on $R$ forces $Q$ to 0; with both inputs at 0, the latch holds whichever state it was last driven to. The bare SR latch has the unpleasant property that asserting $S$ and $R$ simultaneously produces a forbidden state, and it has no notion of when to listen — any glitch on $S$ or $R$ is captured. Both flaws are addressed by adding a clock.

The gated D latch

The gated D latch has a single data input $D$ and a clock or enable input $E$ . While $E$ is high, the output $Q$ follows $D$ — the latch is transparent. While $E$ is low, the latch holds whatever value $D$ had at the moment $E$ went low. The forbidden state is gone, and the timing is at least partly controlled.

A D latch is enough for some applications but is awkward for sequential logic that uses feedback. If a latch's output feeds back into a circuit whose output drives the same latch's input, then while $E$ is high the loop is closed and the value can race around uncontrollably. Synchronous design, the topic of Chapter 5, almost universally avoids transparent latches for general-purpose state.

The edge-triggered D flip-flop

The dominant memory element in modern design is the edge-triggered D flip-flop. It has the same external interface as a D latch — a data input $D$ , a clock input, and an output $Q$ — but it samples $D$ only at the instant of a clock edge, typically the rising edge. The rest of the time, the output is held constant regardless of how $D$ wiggles. The classic implementation is a master–slave pair: two latches in series, one transparent on the clock low phase and the other on the clock high phase, so that the value flowing through them is captured exactly once per cycle.

The behavior is summarized by a timing diagram rather than a truth table:

Figure: D flip-flop timing diagram: Q samples D on each rising clock edge and holds that value until the next edge

LaTeX

\begin{tikzpicture}[font=\small, line cap=round]
  % Origin at top-left. Time goes right.
  % Labels
  \node[anchor=east] at (-0.2, 0) {clock};
  \node[anchor=east] at (-0.2, -1.5) {$D$};
  \node[anchor=east] at (-0.2, -3) {$Q$};
  % Clock waveform: 4 cycles, period 1.4, hi at top y=0.4 lo at y=-0.4
  \draw[thick] (0,-0.4) -- (0,0.4) -- (0.7,0.4) -- (0.7,-0.4) -- (1.4,-0.4) -- (1.4,0.4) -- (2.1,0.4) -- (2.1,-0.4) -- (2.8,-0.4) -- (2.8,0.4) -- (3.5,0.4) -- (3.5,-0.4) -- (4.2,-0.4) -- (4.2,0.4) -- (4.9,0.4) -- (4.9,-0.4) -- (5.6,-0.4);
  % D waveform: low until t=1.6, high until t=3.2, low after
  \draw[thick] (0,-1.9) -- (1.6,-1.9) -- (1.6,-1.1) -- (3.2,-1.1) -- (3.2,-1.9) -- (5.6,-1.9);
  % Q: samples on rising edges at t=0,1.4,2.8,4.2; reflects D at sampling time
  % rising-edge values: t=0:D=0, t=1.4:D=0, t=2.8:D=1, t=4.2:D=1 -> Q=0,0,0,1,1
  \draw[thick] (0,-3.4) -- (2.8,-3.4) -- (2.8,-2.6) -- (5.6,-2.6);
  % rising-edge marks
  \draw[dashed,gray] (0,0.6) -- (0,-3.6);
  \draw[dashed,gray] (1.4,0.6) -- (1.4,-3.6);
  \draw[dashed,gray] (2.8,0.6) -- (2.8,-3.6);
  \draw[dashed,gray] (4.2,0.6) -- (4.2,-3.6);
  \node[font=\tiny] at (2.8,-3.9) {rising edge};
\end{tikzpicture}

There are also JK flip-flops and T flip-flops with slightly different control inputs, and there are flip-flops with set and reset inputs that override the clock for initialization. In contemporary digital design, the D flip-flop is by far the most common; the others are mostly historical or used in specialized counters.

Setup, hold, and metastability

A flip-flop does not capture data instantaneously. For correct operation, the data input must be stable during a window around the clock edge. The data must arrive at least a setup time $t_{\text{setup}}$ before the edge, and it must remain steady for at least a hold time $t_{\text{hold}}$ after the edge. If either is violated, the flip-flop can enter a metastable state in which its output hovers between 0 and 1 for an unpredictable period before resolving.

These three timing parameters — setup, hold, and the clock-to-output delay $t_{\text{cq}}$ — are the entire reason a digital chip has a maximum clock frequency. We will return to them at the end of this chapter and again in Chapter 5.

10.Registers and Counters

A register is just $n$ flip-flops sharing a common clock, treated as a unit. An 8-bit register stores an 8-bit value; a 64-bit register stores a 64-bit value. Almost every register in a processor's register file is a bank of D flip-flops with read and write ports built around them. There is nothing more to a register than that.

Two variants of register show up often.

A shift register is a chain of flip-flops where each one's output feeds the next one's input. On every clock edge, the value shifts by one position. Shift registers are the natural way to implement serial communication (taking parallel data and feeding it out one bit at a time, or vice versa) and to implement multiplication and division by powers of two.

Figure: Four-stage shift register: a chain of D flip-flops driven by a common clock, with the input shifting one position to the right each cycle

LaTeX

\begin{tikzpicture}[font=\small, >=Stealth, line cap=round,
  ff/.style={draw, thick, fill=white, minimum width=1.1cm, minimum height=1cm}]
  \node at (-0.5, 0) {in};
  \node[ff] (FF1) at (1, 0) {FF};
  \node[ff] (FF2) at (3, 0) {FF};
  \node[ff] (FF3) at (5, 0) {FF};
  \node[ff] (FF4) at (7, 0) {FF};
  \node at (8.5, 0) {out};
  \draw[->] (-0.2, 0) -- (FF1.west);
  \draw[->] (FF1.east) -- (FF2.west);
  \draw[->] (FF2.east) -- (FF3.west);
  \draw[->] (FF3.east) -- (FF4.west);
  \draw[->] (FF4.east) -- (8.2, 0);
  \node[font=\footnotesize] at (4, -1.5) {(all driven by the same clock)};
\end{tikzpicture}

A counter is a register whose value increments (or decrements) on each clock edge. The simplest counter is a register whose input is its own output plus 1, computed by an adder. More clever designs save gates by using flip-flops with built-in toggle behavior, but the principle is the same. Counters are everywhere — they implement program counters (the address of the next instruction), loop counters in hardware controllers, prescalers in timers, and refresh counters in DRAM.

A particularly common pattern is the counter with parallel load, which can either increment its value or accept a new value from an external source. The program counter in any processor is exactly this: most of the time it increments, but on a branch or jump it is loaded with a new address. Internally it is a register, an adder hardwired to add the instruction width, and a 2-to-1 mux selecting between the incremented value and the branch target — the same handful of blocks we have already met.

11.Tri-State Buffers and Buses

All the gates discussed so far have outputs that are always actively driving either a 0 or a 1. If two such outputs were wired together and disagreed, one would try to pull the wire low while the other tried to pull it high, and the result would be a short circuit through both gates. Connecting outputs together is therefore forbidden in ordinary logic.

A tri-state buffer relaxes this restriction. It has a data input, an output-enable control, and a data output that takes one of three values: 0, 1, or high-impedance (often written Z). When the output-enable is asserted, the buffer drives its output to match its input. When it is deasserted, the buffer disconnects its output entirely, neither pulling high nor low. Many tri-state outputs can safely share a wire as long as the system guarantees that at most one of them has its enable asserted at any given time.

Figure: Tri-state buffer symbol with a data input, an output-enable control on top, and a shared bus wire that the buffer drives only when enabled

LaTeX

\begin{tikzpicture}[font=\small, >=Stealth, line cap=round]
  \draw[thick] (0,0) -- (0.6, 0.4) -- (0.6, -0.4) -- cycle;
  \draw[->] (-0.6, 0) -- (0,0);
  \draw[->] (0.6, 0) -- (1.4, 0);
  \draw[->] (0.3, 0.6) -- (0.3, 0.2);
  \node[anchor=east] at (-0.6, 0) {in};
  \node[anchor=west] at (1.4, 0) {bus wire};
  \node[anchor=south] at (0.3, 0.6) {OE};

This trick is how a shared bus works. A bus is a set of wires that many devices want to read from and write to. Each device connects to the bus through a tri-state buffer, and a small piece of arbitration logic ensures that only one driver is enabled at any moment. Memory chips, peripheral controllers, and entire backplanes were historically organized this way.

Inside modern integrated circuits, the role of tri-state buses has shrunk dramatically. Long internal tri-state lines are slow, hard to time, and difficult to test, so designers prefer dedicated unidirectional wires combined with multiplexers — logically equivalent but easier for synthesis tools to analyze. Tri-state buffers survive at chip boundaries, where bidirectional pads connect to off-chip buses, and in a few specialized internal structures such as register-file read ports. Even where they are no longer used, they remain part of the standard vocabulary because external standards like memory and I/O buses are described in tri-state terms.

A closely related idea is the open-drain (or open-collector) output, which can actively pull low but never actively pulls high; an external resistor pulls the wire high when no driver is asserting it. Open-drain wires implement a wired-AND: any driver pulling low forces the wire low, and the wire is high only when every driver is releasing it. This is the standard convention on shared signaling lines such as the interrupt-request lines on early PCs and the data line of the I²C bus.

12.Parity Generators and Checkers

The error-detecting and error-correcting codes from Chapter 2 do not appear by magic; they are computed by small combinational blocks that surround every storage and transport path that needs them.

The simplest is a parity generator: an XOR tree that takes $n$ data bits and produces a single parity bit equal to their exclusive-OR. A balanced tree finishes in $\log_2 n$ levels of two-input XOR gates, producing the parity bit of a 64-bit word in six gate delays. A matching parity checker XORs the received data and the received parity bit together; the result is 0 when no single-bit error is present and 1 when one has occurred.

More sophisticated codes — Hamming and SECDED in caches and main memory, CRC on packets and storage blocks — are computed by structurally similar blocks: networks of XOR gates, sometimes augmented with shift registers for serial CRC computation. The point worth retaining is that error coding adds a fixed amount of latency and a small amount of silicon to whichever path it protects, and that this overhead is the price of the reliability guarantees Chapter 2 introduced.

13.Memory Arrays as Building Blocks

We have so far treated memory as a vague upstream resource that holds programs and data. At the building-block level, memory is just a regular array of single-bit storage cells with an address decoder choosing which row to read or write. Three flavours appear so often that they deserve names now.

A register file is a small, fast memory built from flip-flops, with one or more read ports and one or more write ports. The address decoder is a binary-to-one-hot decoder of the kind we saw earlier; a multiplexer selects the chosen row's contents onto each read-port output. A 32-entry, 64-bit register file with two read ports and one write port is a standard ingredient of every modern processor and is described in detail in Chapter 7.

A static random-access memory (SRAM) array uses a denser cell, typically six transistors, in which a pair of cross-coupled inverters holds one bit. SRAM is slower than a register file because the cell drives a longer wire and is read through a sense amplifier rather than a full-strength buffer, but it is much denser — a million bits of SRAM fit in roughly the area of a few thousand flip-flops. Caches are built from SRAM, as are many smaller on-chip buffers.

A read-only memory (ROM) array stores fixed values determined at fabrication or programming time. Each cell is little more than a wire that is either present or absent at the intersection of a row and column line. ROMs are used for boot firmware, microcode, and any lookup table whose contents do not change at runtime.

DRAM, flash, and the other technologies that hold the bulk of a system's data are organized similarly but with very different cell structures. We will treat them at length in Chapter 18. At the building-block level, the abstraction is the same: an address goes in through a decoder, and a value comes out through a sense network. The important point now is that memory, like the adder and the multiplexer, is just another digital block, with a defined latency and a defined interface, that the rest of the design sits on top of.

14.Timing, Delay, and Synchronization

The last topic of the chapter is the one that ties everything together and that beginners most often get wrong: digital signals do not change instantaneously, and the design has to cope with that fact.

Propagation delay

Every gate has a propagation delay, the time between an input change and the corresponding output change. In modern silicon this is on the order of tens of picoseconds per gate. The delay accumulates along a path: a chain of ten gates has roughly ten times the delay of one. A combinational network's worst-case path — the longest sequence of gates from any input to any output — is called its critical path.

Real signals are also messy during their transitions. While a circuit is settling toward its final value, intermediate signals can flip back and forth, producing glitches. A glitch on the input of an asynchronous latch can be captured as a real value, which is one of the reasons designers prefer synchronous flip-flops.

The clock-period equation

In a synchronous design, every state element is driven by the same clock signal. Each clock cycle works as follows: at the rising edge, every flip-flop captures its input. The captured values propagate through the combinational logic between flip-flops. Before the next rising edge, the new values must arrive at the inputs of the downstream flip-flops, settle, and meet the setup-time requirement.

This gives the most important inequality in digital design:

$T_{\text{clk}} \geq t_{\text{cq}} + t_{\text{logic, max}} + t_{\text{setup}} + t_{\text{skew}}.$

In words, the clock period must be at least the time for the source flip-flop to produce its output ( $t_{\text{cq}}$ ), plus the worst-case combinational delay between flip-flops ( $t_{\text{logic, max}}$ ), plus the destination flip-flop's setup requirement, plus a margin for the clock skew — the small differences in arrival time of the clock at different flip-flops. The reciprocal of $T_{\text{clk}}$ is the maximum clock frequency.

There is a corresponding inequality for hold time:

$t_{\text{cq}} + t_{\text{logic, min}} \geq t_{\text{hold}} + t_{\text{skew}}.$

This one says the fastest path through the logic must still be slow enough that data does not race past the destination flip-flop before its hold-time window closes. Hold violations cannot be fixed by slowing the clock down; they require physical changes to the circuit, often by inserting deliberate delay.

These two inequalities are the source of essentially every speed and timing concern in synchronous design, and Chapter 5 will treat them as the basic discipline of RTL.

Synchronization across clock domains

A real chip almost never has just one clock. The CPU runs at one frequency, the memory controller at another, the I/O at a third, and various peripherals at frequencies determined by external standards. When a signal crosses from one clock domain to another, simply wiring it across is dangerous. The receiving flip-flop sees the signal change at an arbitrary time relative to its own clock, and is at risk of metastability.

The standard remedy is a synchronizer, two flip-flops in series in the destination clock domain. The first flip-flop may go metastable on a given edge, but it has nearly a full clock period to resolve before the second flip-flop samples its output. With reasonable parameters, the probability of the synchronizer producing a metastable output to the rest of the design becomes astronomically small — typical mean time between failure numbers are millions of years. Even so, synchronizers are only safe for single-bit control signals; multi-bit data crossing clock domains needs a proper asynchronous FIFO or handshake protocol, because two synchronizers in parallel can resolve at different clock edges and produce inconsistent values.

These rules will not concern us much in the rest of the book, since most discussion of architecture happens within a single clock domain. But every real machine you will meet has dozens of clock domains internally, and a chunk of every chip's silicon area is devoted to handling them correctly. It is worth knowing that the problem exists and that there is a standard solution, even if we will not work through the details.

15.Summary

The digital building blocks of this chapter are the vocabulary in which the rest of computer architecture is written. Adders combine bits with carries, and a single adder, augmented with a row of XORs, performs both addition and subtraction. Comparators reduce to subtractions whose results are interpreted through arithmetic flags. Barrel and funnel shifters slide bits left, right, or around in logarithmic depth, and full-blown multipliers and dividers extend the same idea to fast partial-product reduction and iterative quotient generation. Packaged together, these arithmetic units form the ALU at the heart of every processor. Multiplexers select among inputs and are the most ubiquitous block in any datapath. Decoders translate addresses into select lines; encoders summarize where a 1 has appeared in a wide bus. Tri-state buffers and open-drain outputs let many drivers share a single wire, although on-chip designs now prefer multiplexers for the same job. Parity generators, CRC engines, and similar XOR-tree blocks compute the error-detecting and error-correcting bits that make storage and transport reliable. Latches and flip-flops are the elementary memory cells, and groups of flip-flops form registers, shift registers, counters, register files, SRAMs, and ROMs. Tying all of this together is the discipline of timing: every gate has a delay, every flip-flop has setup and hold requirements, and the clock period must be long enough to honor them all on the worst-case path.

In Chapter 5 we will assemble these blocks into the framework of synchronous design, in which a single clock orchestrates the entire chip and finite state machines describe its behavior at a level higher than gates and lower than instructions.

Book mode

	\begin{tikzpicture}[font=\small, >=Stealth, line cap=round]
	\draw[thick] (0,0) -- (0.6, 0.4) -- (0.6, -0.4) -- cycle;
	\draw[->] (-0.6, 0) -- (0,0);
	\draw[->] (0.6, 0) -- (1.4, 0);
	\draw[->] (0.3, 0.6) -- (0.3, 0.2);
	\node[anchor=east] at (-0.6, 0) {in};
	\node[anchor=west] at (1.4, 0) {bus wire};
	\node[anchor=south] at (0.3, 0.6) {OE};