Digital Design Fundamentals
May 16, 2026·35 min read·beginner
The previous chapter introduced the building blocks: adders, multiplexers, flip-flops, and the timing rules that govern them. This chapter is about how those blocks are organized into entire chips.…
The previous chapter introduced the building blocks: adders, multiplexers, flip-flops, and the timing rules that govern them. This chapter is about how those blocks are organized into entire chips. The leap from a single flip-flop to a working processor is not a leap of new physics; it is a leap of discipline. Modern digital design works because the industry settled, decades ago, on a small set of conventions that make complex chips tractable to design, simulate, and verify. Without those conventions, every clock edge would be a fresh adventure.
We will start with the central convention — synchronous design — and build outward to clocks and resets, finite state machines, the register-transfer-level style in which real designs are actually written, the simulation methodology that supports them, the hardware description languages that express them, and finally the two physical fabrics on which they run: FPGAs and ASICs.
01. Synchronous Design
A digital system is synchronous when every state-holding element in it is updated by the same clock signal, and all combinational logic is sandwiched between flip-flops driven by that clock. Stated this baldly, the rule sounds restrictive, but it is the foundation on which essentially every commercial chip is built. Synchronous design is the reason a billion-transistor processor can be designed by humans at all.
The reason for the discipline becomes clear when you consider the alternative. In an asynchronous circuit, signals can change at any time, race through unpredictable amounts of logic, and arrive at memory elements at any moment. Reasoning about correctness then requires accounting for every possible interleaving of input changes, every glitch on every internal wire, and every hazard introduced by differing path delays. Verification becomes intractable; small errors cause intermittent failures that show up only on certain silicon, at certain temperatures, or after certain workloads.
Synchronous design eliminates this entire class of problems by imposing a simple contract. State changes only at clock edges. Between edges, the combinational logic has a defined budget of time to produce its result, set by the clock period. As long as no path through the combinational logic exceeds that budget, the design behaves predictably regardless of glitches, gate-delay variations, or input timing. The complex problem of timing is reduced to one inequality per path:
A static-timing-analysis tool can check this inequality on every path through the design — millions of them, in a real chip — without ever simulating a single waveform. The tool's verdict is, with reasonable assumptions about manufacturing variation and operating conditions, definitive. This is why synthesis tools refuse to produce a design until timing closes: they have a guarantee they can offer about correctness, and they refuse to give it up.
The price of synchronous design is a slight loss of theoretical performance. A perfectly tuned asynchronous circuit could, in principle, run faster than the worst-case clock period, because each path would proceed at its own pace. In practice, no one knows how to design and verify such circuits at scale, and the lost performance is more than recovered by the architectural cleverness — pipelining, parallelism, deep speculation — that synchronous design makes feasible.
A few small parts of a real chip are deliberately not synchronous. Reset logic, asynchronous FIFOs that bridge clock domains, and certain analog interfaces all step outside the discipline. They are designed by specialists, treated with great care, and isolated from the rest of the chip behind synchronizers. Everything else — the entire datapath, the register file, the cache, the branch predictor — is synchronous.
02. Clocking and Resets
If the clock is the heartbeat of a synchronous chip, the clock distribution network is its circulatory system. The clock signal must arrive at hundreds of thousands or millions of flip-flops, and ideally at all of them at the same instant. In reality, the differences in arrival time, called clock skew, are unavoidable and must be budgeted for.
Clock distribution
A naive design with a single buffer driving every flip-flop would fail immediately: the buffer would be too weak to drive the load, the wires would be too long, and the skew across the chip would dwarf the clock period. Real chips use elaborate clock trees that fan the signal out through a hierarchy of progressively smaller buffers, with the wires laid out so that every leaf in the tree is roughly the same physical distance from the root. High-end designs sometimes use clock meshes instead, a grid of strong drivers that average out the skew at the cost of higher power.
Skew comes in two flavors. Useful skew is intentional and can actually relax timing on critical paths by giving certain flip-flops a slightly later clock. Uncontrolled skew is the variation that remains after all the careful balancing, and it eats directly into the clock period. A chip running at 4 GHz has a clock period of 250 picoseconds; if 30 of those picoseconds are lost to skew, fully 12 percent of the period is gone before any logic has done anything.
Almost every modern chip has more than one clock. A processor core may run at several gigahertz, the memory controller at a fraction of that, and the various I/O blocks at frequencies dictated by external standards. The boundary between two such clock domains, called a clock domain crossing, requires synchronizers, as we saw at the end of the last chapter. Tools that check timing across crossings are an entire subdiscipline of digital design.
Reset
When a chip first powers on, the values inside its flip-flops are random. To get the design into a known state, every flip-flop has a reset input that, when asserted, forces the flip-flop to a known value — usually 0, but sometimes 1 for flip-flops where the natural idle state is 1.
There are two styles of reset, and the choice between them matters.
A synchronous reset is sampled along with the data input on the clock edge. It behaves like an ordinary input to the flip-flop and timing closes on it the same way. The drawback is that it requires the clock to be running before reset can take effect, which is not always the case during power-on.
An asynchronous reset takes effect immediately, regardless of the clock. It can force the flip-flop to its reset value even when no clock is present, which makes it robust during power-up. The drawback is that releasing an asynchronous reset is tricky: if the reset signal goes inactive too close to a clock edge, different flip-flops in the design may exit reset on different cycles, leaving the chip in an inconsistent state.
The standard compromise is asynchronous-assert, synchronous-deassert reset. The reset is asserted asynchronously to guarantee that every flip-flop is forced into its known state immediately, and the release of reset is passed through a synchronizer so that all flip-flops emerge from reset on the same clock edge. This idiom, in some form, is used by virtually every commercial design.
Reset distribution has its own timing problem, often more severe than the clock's, because reset is usually a slower signal driven across the whole chip. Modern designs typically have a tree of reset synchronizers at appropriate points, much like clock buffers, and verification checks that reset reaches every flip-flop in a controlled order.
03. Finite State Machines
Many digital systems are most naturally described not in terms of bit-level equations but in terms of modes or states. A traffic light is in a green, yellow, or red state; a memory controller is idle, reading, writing, or refreshing; a processor's pipeline is in an active or stalled condition. The mathematical framework that captures this style of design is the finite state machine, or FSM.
An FSM consists of a finite set of states, an input alphabet, an output alphabet, a transition function that says which state to enter next given the current state and the input, and an output function that says what to produce. In hardware, the current state is held in a state register — a row of flip-flops — and two combinational blocks, the next-state logic and the output logic, surround it.
Figure: FSM structure: next-state logic and output logic surround a state register, with current state fed back to the next-state logic and inputs entering on the left
\begin{tikzpicture}[font=\small, >=Stealth, line cap=round,
blk/.style={draw, thick, fill=white, minimum width=2.6cm, minimum height=1cm, align=center}]
% Origin (0,0) at top-left.
\node[blk] (ns) at (4, -1) {next-state logic};
\node[blk] (reg) at (4, -3.5) {state register};
\node[blk] (out) at (8.5, -3.5) {output logic};
\node at (0, -1) {inputs};
\node at (12, -3.5) {outputs};
\draw[->] (0.6, -1) -- (ns.west);
\draw[->] (ns.south) -- (reg.north) node[midway, right] {D};
\draw[->] (reg.east) -- (out.west) node[midway, above] {state};
\draw[->] (out.east) -- (11.4, -3.5);
% feedback state -> ns left
\draw[->] (reg.west) -- (1.5, -3.5) -- (1.5, -1) -- (ns.west);
\end{tikzpicture}Two flavors of FSM are distinguished by where their outputs come from. In a Moore machine, the outputs depend only on the current state. In a Mealy machine, the outputs depend on both the current state and the inputs. Moore outputs are slower to react — they change only after the state register updates — but they are immune to glitches on the input, because the combinational logic only sees a stable state. Mealy outputs can react in the same cycle as the input change but are at risk of glitching as the inputs settle. A practical design often mixes the two, choosing whichever is appropriate for each output signal.
A small example will fix the idea. Consider a coin-operated machine that releases a product after receiving two coins. Define three states:
- Idle (): waiting for the first coin.
- One (): one coin received.
- Vend (): two coins received; release product, then return to Idle.
The state diagram looks like this:
Figure: State diagram for a three-state vending controller: S0 to S1 to S2 on each coin, S2 back to S0 on reset or vend, and a self-loop on S0 when no coin arrives
\begin{tikzpicture}[font=\small, >=Stealth, line cap=round,
state/.style={draw, thick, circle, fill=white, minimum size=1.1cm}]
% Origin: top-left. States laid out horizontally.
\node[state] (S0) at (1, -2) {$S_0$};
\node[state] (S1) at (4, -2) {$S_1$};
\node[state] (S2) at (7, -2) {$S_2$};
% S0 -> S1
\draw[->] (S0) -- node[above] {coin} (S1);
% S1 -> S2
\draw[->] (S1) -- node[above] {coin} (S2);
% S2 -> S0 (curved below)
\draw[->] (S2) to[bend left=40] node[below] {reset / vend} (S0);
% S0 self-loop above
\draw[->] (S0) to[out=120, in=60, looseness=8] node[above] {no coin} (S0);
\end{tikzpicture}The corresponding state register holds two bits, encoding , , . The next-state logic reduces to a small Boolean expression. The output, vend, is asserted when the state is . This sort of description maps almost literally onto a few lines of HDL, as we will see shortly.
State assignment — the choice of which bit pattern represents which state — has a real effect on the size and speed of the resulting hardware. Binary encoding uses flip-flops for states, which is compact but produces complex next-state logic. One-hot encoding uses one flip-flop per state, with exactly one bit set at any time, which is wasteful in flip-flops but produces simple, fast next-state and output logic. Gray encoding, where adjacent states differ in a single bit, can reduce switching activity and is sometimes preferred for low-power designs. Modern synthesis tools usually pick an encoding automatically based on the optimization goal.
Real chips contain thousands of FSMs, large and small. The processor's main control unit, the cache controller, the memory arbiter, the interrupt controller, the USB transceiver — every one of them is one or more FSMs. Mastery of FSM design is the first transferable skill of a digital designer.
04. RTL Design
The level of abstraction at which most digital design happens today is register-transfer level, almost universally abbreviated RTL. The name describes the model: a digital system is viewed as a collection of registers, with combinational logic between them describing how data flows from one register to the next on each clock cycle.
The RTL view sits between two extremes. It is higher than the gate level, because the designer does not specify individual gates; the synthesis tool does that. It is lower than purely behavioral or algorithmic descriptions, because the designer does specify, in cycle-by-cycle detail, where every value lives at every clock edge. The contract is precise: an RTL design says exactly which flip-flops hold which values at the start of each cycle, and exactly what combinational function computes the next state. The synthesis tool is then free to choose any gate-level implementation that produces the same behavior and meets timing.
A classic RTL pattern is a structure like this, written in a generic pseudo-HDL:
| always @(posedge clk) begin | |
| if (reset) | |
| count <= 0; | |
| else if (enable) | |
| count <= count + 1; | |
| end |
This block describes a register count that resets to 0 and otherwise increments by 1 on each clock edge when enabled. The synthesis tool turns it into the appropriate flip-flops, an adder, a multiplexer to choose between the new value and the held value, and the routing required to connect them. The designer never mentions any of those gates explicitly.
A complete RTL design is an interconnection of such blocks, typically organized into modules with clearly defined inputs and outputs. The art lies in choosing the right partition of the design into modules, deciding which signals are wires (combinational) and which are registers (sequential), and writing each block in a style the synthesis tool can interpret unambiguously.
There are a few stylistic rules that experienced designers follow without thinking. Combinational logic and sequential logic are kept in separate blocks: one type of always block for flip-flop updates, another for combinational equations. State machines are written with a clear structure: a state register block, a next-state logic block, and an output logic block. Asynchronous loops — combinational paths that close on themselves without a flip-flop — are forbidden. Latches, the transparent kind that follow their input while a control signal is high, are forbidden in the main datapath, because they are easy to inadvertently produce and they break static timing analysis. Adhering to these conventions is what makes a design synthesizable, simulatable, and ultimately fabricatable.
It is worth pausing to appreciate how dramatic an abstraction RTL is. A modern processor core might be described in a few hundred thousand lines of RTL, which the synthesis tool turns into tens or hundreds of millions of gates. The leverage from one line of HDL to thousands of transistors is what allows a small team to design a complete CPU. None of it works without the synchronous discipline of the previous section — RTL is, at its core, a notation for describing what the synchronous discipline allows.
05. Datapath and Control
A particularly important structuring principle inside RTL designs is the separation of datapath from control. The datapath is the collection of registers, multiplexers, ALUs, shifters, memories, and the wires between them — the parts of the design that carry values from one place to another and operate on them. The control logic, usually one or more finite state machines, decides on each cycle which multiplexer selects which input, which register loads its value, and which functional unit produces a useful result. Datapath answers the question what is connected to what; control answers what should happen this cycle.
Figure: Datapath and control split: the control FSM issues control signals down to the datapath and receives status flags back, while data flows horizontally through the datapath
\begin{tikzpicture}[font=\small, >=Stealth, line cap=round,
blk/.style={draw, thick, fill=white, minimum width=3cm, minimum height=1cm, align=center}]
\node[blk] (ctrl) at (0, 0) {control FSM};
\node[blk] (dp) at (0, -2.4) {datapath\\(regs, ALU, mux)};
\draw[->] (ctrl.south) -- (dp.north) node[midway, right] {control signals};
\draw[->] (dp.west) -- ($(dp.west)+(-1.4,0)$) |- (ctrl.west) node[midway, left, align=right] {status\\(flags, valid)};
\draw[->] ($(dp.east)+(1.4,0)$) -- (dp.east) node[midway, above] {data in};
\draw[->] (dp.east) -- ($(dp.east)+(1.4,0)$) node[midway, below] {};
\node[anchor=west] at ($(dp.east)+(0.2,-0.4)$) {data out};
\end{tikzpicture}This split is more than tidy bookkeeping. It localizes change. A bug in arithmetic shows up in the datapath; a bug in sequencing shows up in the control. Optimization decisions — widening a path, adding a forwarding network, gating a clock — affect one side or the other but rarely both. Verification benefits as well: the datapath is largely combinational and lends itself to formal equivalence checking, while the control is sequential and benefits from FSM-aware coverage and assertions.
A related discipline is the valid/ready handshake that connects modules with their own internal control. The producer raises a valid signal when it has data to offer; the consumer raises a ready signal when it can accept; the data transfers on cycles where both are high. Variants of this two-wire protocol — AXI-Stream on ARM systems, TileLink on RISC-V, Avalon-ST on Altera FPGAs — dominate modern on-chip interconnects, and they all rely on the same separation of data flow from control flow that this section names.
06. Pipeline Registers and Retiming
A timing inequality from the previous chapter showed that the clock period is bounded below by the longest combinational path between two flip-flops. The straightforward way to allow a higher clock frequency is therefore to break that path: insert a flip-flop in the middle, splitting one long stage into two shorter ones. The flip-flops that perform this role are called pipeline registers, and the resulting structure is a pipeline.
A simple two-stage pipeline that performs followed by an OR with a constant might look like
Figure: Two-stage pipeline computing A plus B then OR with a constant, with pipeline registers separating the ADD and OR combinational blocks
\begin{tikzpicture}[font=\small, >=Stealth, line cap=round,
reg/.style={draw, thick, fill=white, minimum width=0.8cm, minimum height=1.2cm},
blk/.style={draw, thick, fill=white, minimum width=1.4cm, minimum height=1cm}]
\node[reg] (R1) at (0, 0) {R};
\node[blk] (ADD) at (2, 0) {ADD};
\node[reg] (R2) at (4, 0) {R};
\node[blk] (OR) at (6, 0) {OR};
\node[reg] (R3) at (8, 0) {R};
\draw[->] (-1,0) -- (R1.west);
\draw[->] (R1.east) -- (ADD.west);
\draw[->] (ADD.east) -- (R2.west);
\draw[->] (R2.east) -- (OR.west);
\draw[->] (OR.east) -- (R3.west);
\draw[->] (R3.east) -- (9, 0);
\end{tikzpicture}The pipeline doubles the number of cycles a value spends in the design but cuts the longest stage roughly in half, allowing the clock to run roughly twice as fast. The throughput — results per unit of wall-clock time — nearly doubles even though the latency for any single value gets slightly worse.
This chapter does not develop pipelining as a processor design technique; that is the subject of Chapter 22. The point here is narrower: at the RTL level, a pipeline is just an ordinary synchronous design with extra flip-flops, and balancing the lengths of the resulting stages is a routine optimization called retiming. Modern synthesis tools can perform retiming automatically, moving flip-flops across combinational logic to balance stage delays without changing the design's externally visible cycle behaviour.
The one subtlety pipelining adds at this level is that data passing through a pipeline must be accompanied by control and valid signals that travel through their own pipeline registers in lockstep. A single missing flip-flop in the control path will leave a data item arriving at the wrong consumer on the wrong cycle, and the bug is rarely caught by simple stimulus.
07. Simulation and Testbenches
A design that has not been simulated has not been verified. Hardware is far less forgiving than software: there is no debugger attached to a fabricated chip, no printf to insert in a wire, and certainly no opportunity to recompile and try again after the masks are made. The cost of taping out a flawed design at modern process nodes is measured in millions of dollars and many months of schedule. Catching bugs in simulation, before silicon, is therefore not optional.
A simulator is a program that takes an HDL description of a design, together with a description of stimulus to apply to it, and computes the resulting waveforms. The simulator advances a virtual clock, evaluates the design's logic at each event, and records the values of every signal over time. The output is typically a waveform file that the designer can open in a viewer, scroll through, and inspect for correctness.
The companion to the design is the testbench, a separate HDL module written specifically to exercise the design under test. A testbench has no inputs and no outputs of its own; it is purely an environment. Inside, it instantiates the design, generates a clock, applies stimulus to the design's inputs, and observes its outputs. A good testbench also checks the outputs automatically, comparing them against expected values and reporting any discrepancies, rather than leaving the designer to eyeball waveforms.
A skeleton testbench in pseudo-HDL might look like this:
module tb_counter;
reg clk = 0;
reg reset, enable;
wire [7:0] count;
counter dut(.clk(clk), .reset(reset),
.enable(enable), .count(count));
always #5 clk = ~clk; // 100 MHz clock
initial begin
reset = 1; enable = 0;
#20 reset = 0;
#10 enable = 1;
#200 enable = 0;
#50 $finish;
end
always @(posedge clk)
if (count > 8'd20)
$display("ERROR: count out of range: %d", count);
endmoduleThe level of sophistication of testbenches has grown enormously over the past two decades. Early testbenches were ad hoc collections of stimuli; modern verification environments built on the Universal Verification Methodology (UVM) layer object-oriented infrastructure on top of SystemVerilog to provide constrained-random stimulus generation, scoreboards that automatically check outputs, functional coverage that measures which behaviors have been exercised, and reusable verification components for common interfaces. A serious processor verification environment may be as large as the design itself.
Beyond ordinary simulation, several other techniques contribute to verification. Linting tools statically check the HDL for stylistic mistakes and likely bugs. Formal verification uses mathematical proof techniques to show that certain properties hold for all possible inputs, rather than just the ones the testbench happens to exercise. Equivalence checking confirms that the synthesized gate-level netlist behaves identically to the source RTL. Emulation maps the design onto a large array of FPGAs to simulate at near-real speed, useful for booting an entire operating system before silicon arrives. Each of these techniques addresses a class of bugs that simulation alone cannot reach economically.
The economics of verification are stark. In a typical chip project, verification consumes more engineer-time than design, often by a factor of two or three. The investment is justified by the alternative: a single respin of a complex chip can cost tens of millions of dollars and slip a product by half a year. Time spent in simulation is the cheapest form of insurance available.
08. HDLs: Verilog, SystemVerilog, and VHDL
A hardware description language, or HDL, is the medium in which RTL designs are written. Three matter in practice today.
Verilog
Verilog was created in the 1980s by Phil Moorby at Gateway Design Automation, originally as a simulation language. It was standardized as IEEE 1364 and has gone through several revisions since. Verilog's syntax is C-like, and its semantics revolve around two kinds of construct: continuous assignments to wires, which describe combinational logic, and procedural blocks (always blocks) that describe sequential or combinational behavior depending on their sensitivity list.
A small Verilog example, an 8-bit register with synchronous reset and enable:
module reg8(
input wire clk,
input wire reset,
input wire enable,
input wire [7:0] d,
output reg [7:0] q
);
always @(posedge clk) begin
if (reset)
q <= 8'd0;
else if (enable)
q <= d;
end
endmoduleVerilog earned its place in the industry through simplicity and concision. Its weaknesses — a permissive type system, ambiguous semantics in corners of the language, and a limited toolkit for verification — are well known and motivated its successor.
SystemVerilog
SystemVerilog, standardized as IEEE 1800 in 2005, is a substantial extension of Verilog. It adds stronger types (real logic instead of the weakly-typed reg/wire distinction), packed and unpacked structures, enumerations, interfaces that bundle signals into named groups, classes for object-oriented testbench code, constrained-random generation, functional coverage, and assertions. The design portion of SystemVerilog is mostly a cleanup and extension of Verilog; the verification portion is essentially a new language layered on top.
In contemporary practice, design code is written in a SystemVerilog dialect that uses a few of the new features (especially logic, enum, and interface) but mostly looks like cleaned-up Verilog. Verification code uses the full language, including classes and constrained-random features. Most commercial simulators and synthesis tools accept SystemVerilog as a matter of course, and Verilog as a subset.
VHDL
VHDL, the VHSIC Hardware Description Language, was developed in the 1980s under contract from the U.S. Department of Defense and was standardized as IEEE 1076. Its syntax is Ada-like — verbose, strongly typed, and explicit. The same register in VHDL is several times longer to write than in Verilog, but the type system catches a class of errors at compile time that Verilog allows through to simulation.
library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use IEEE.NUMERIC_STD.ALL;
entity reg8 is
port (
clk : in std_logic;
reset : in std_logic;
enable : in std_logic;
d : in std_logic_vector(7 downto 0);
q : out std_logic_vector(7 downto 0)
);
end entity;
architecture rtl of reg8 is
signal q_int : std_logic_vector(7 downto 0);
begin
process(clk)
begin
if rising_edge(clk) then
if reset = '1' then
q_int <= (others => '0');
elsif enable = '1' then
q_int <= d;
end if;
end if;
end process;
q <= q_int;
end architecture;VHDL is dominant in European industry, in defense and aerospace work, and in many academic curricula. Verilog and SystemVerilog dominate in the United States and in most of the commercial semiconductor industry, particularly in processor design. There is no compelling technical case for one over the other; the choice is mostly a matter of regional and organizational tradition.
Newer alternatives
Several newer languages — Chisel, SpinalHDL, Bluespec, Amaranth, and others — attempt to raise the level of abstraction further by embedding hardware description in a general-purpose programming language (Scala, Python) and using the host language's expressiveness to generate Verilog. They have a real foothold in academic and open-source projects, particularly the RISC-V community, and a growing but still modest presence in industry. We will mention them again when we discuss RISC-V in Part IX.
09. FPGA and ASIC Basics
Once a design is written in HDL and verified in simulation, it has to run on something physical. The two main choices are an FPGA and an ASIC, and the difference between them shapes the entire economics of the project.
ASICs
An ASIC, application-specific integrated circuit, is a chip designed and fabricated for a particular purpose. The design's logic, after synthesis, is mapped onto a layout of transistors that is etched into silicon at a foundry. The result is a chip that does exactly what it was designed to do, as fast and efficiently as the process technology allows.
The ASIC flow is roughly: write RTL, simulate, synthesize to a gate-level netlist, place and route the netlist onto a die, perform static timing analysis, generate a set of photolithographic masks, and send the masks to a foundry. The foundry fabricates wafers, dices them into individual chips, and ships them back to be packaged and tested.
The strengths of ASICs are immense. Performance is the highest available; power per operation is the lowest; chip area is the smallest. Almost every commercial processor — CPUs, GPUs, mobile SoCs, custom AI accelerators — is an ASIC.
The weaknesses are also immense, and they are economic rather than technical. Mask sets at modern process nodes cost millions of dollars, and a single bug can require new masks. The process is one-shot: once a chip is fabricated, it cannot be modified. The path from RTL to working silicon takes many months. ASICs are appropriate for products with high enough volume, or high enough performance demands, that the up-front cost can be amortized.
FPGAs
An FPGA, field-programmable gate array, is a chip whose logic is reconfigurable. Internally, an FPGA is a sea of small programmable elements called lookup tables (LUTs), each able to implement any Boolean function of a few inputs, together with flip-flops, dedicated arithmetic hardware, embedded memory blocks, and a programmable interconnect that wires them together. A configuration bitstream, generated by FPGA-specific tools from the same RTL that an ASIC flow would consume, programs the LUTs, the flip-flops, and the interconnect to realize the desired design.
The strengths of FPGAs mirror the weaknesses of ASICs. The bitstream can be regenerated and reloaded in seconds, so a bug found after deployment is often fixable in the field. Up-front cost is dominated by the price of the FPGA chip itself, with no mask costs. The path from RTL to a working prototype is hours, not months.
The weaknesses mirror the ASIC's strengths. An FPGA implementation is typically several times slower than the same design in an ASIC, consumes substantially more power, and uses chip area much less efficiently because the programmable fabric carries an overhead of routing and configuration that a fixed layout does not need. FPGAs are appropriate for low-volume products, for systems whose requirements change over time (network switches, software-defined radio, signal processing pipelines), and as prototypes of designs that will eventually become ASICs.
The two technologies are not strictly competitors. Almost every ASIC processor design is first prototyped on an FPGA — sometimes a large array of FPGAs working together — to validate the design at near-real speeds and to bring up the operating system and software stack before committing to silicon. The same RTL, with minor differences in the choice of memories and clocks, runs on both targets.
Other technologies
A few other physical realizations are worth mentioning briefly. Structured ASICs sit between FPGAs and full-custom ASICs, offering a mostly fixed layout customized only in the upper metal layers, which dramatically reduces mask costs. CPLDs (complex programmable logic devices) are smaller, simpler cousins of FPGAs, often used for board-level glue logic. eFPGAs are FPGA blocks embedded inside an otherwise fixed ASIC, giving the best of both: a high-performance fixed substrate with a small reconfigurable region for late changes or customer-specific features. None of these change the picture in any fundamental way; they are points along a spectrum from "completely fixed" to "completely programmable."
10. Power in Digital Design
Until now we have measured a circuit's quality almost entirely in speed and area. The third axis, power, is at least as important and increasingly the dominant constraint at the leading edge of process technology. Chapter 52 will treat power, thermal, and physical limits in depth; this section introduces the bare minimum needed to read RTL with power in mind.
In CMOS, total power has two parts. Dynamic power is consumed each time a signal switches, charging or discharging the parasitic capacitance on its wire. To a first approximation,
where is the average switching activity, is the capacitance being switched, is the supply voltage, and is the clock frequency. Static power, also called leakage, is consumed continuously by transistors that conduct a small current even when nominally off. Leakage was negligible at older process nodes; on modern nodes it is comparable to dynamic power and rises sharply with temperature.
The formula explains the gross design moves of the past two decades. Voltage appears squared, so reducing saves more than reducing — hence aggressive dynamic voltage and frequency scaling, in which a chip lowers both when full speed is not needed. Activity can be reduced by clock gating: feeding the clock to a register only on cycles when its value will actually change, leaving it untouched otherwise. Sufficiently inactive blocks can be power-gated: their supply is disconnected entirely, eliminating both dynamic and leakage power at the cost of a wake-up delay. All three techniques are inserted at the RTL level today, often automatically by synthesis tools, but they require a discipline of writing HDL that exposes when state needs to change and when it does not.
11. Design for Test
A chip that cannot be tested cannot be sold. Modern silicon contains so many flip-flops, wires, and transistors that no realistic test program can reach all of them through ordinary functional inputs. The discipline of design for test, abbreviated DFT, modifies the design itself to make defects detectable.
The foundational technique is scan. Every flip-flop in the design is replaced by a scan flip-flop that has, in addition to its normal data input, a second input wired in series with the next flip-flop in a long chain. In test mode, the chain is reconfigured into a giant shift register: a tester pumps a known pattern into one end, runs the design for one or more functional cycles, and shifts the resulting state out the other end for inspection. With scan, every flip-flop in the chip becomes both observable and controllable, even ones buried in the middle of a complex datapath. Automatic test-pattern generation (ATPG) tools then compute, given the gate-level netlist, the small set of patterns required to exercise every possible single-stuck-at fault.
Larger blocks of regular structure — caches, register files, large SRAMs — are tested by built-in self-test (BIST) circuits embedded next to them. A BIST controller generates an exhaustive sequence of writes and reads and compares the results against expected values, all on-chip, at full speed. The BIST result is exposed to external test equipment as a single pass/fail bit per array, which is enormously cheaper to capture than streaming every cell's contents off-chip.
The standard interface through which a tester talks to a chip's scan chains, BIST controllers, and other test infrastructure is JTAG, formally IEEE 1149.1. JTAG defines a small four- or five-pin serial interface (TCK, TMS, TDI, TDO, optionally TRST) and a state machine that accepts test commands. The same interface is widely repurposed for ordinary debugging: every modern processor exposes its registers, breakpoints, and memory through JTAG, and the debug probes used by software developers connect through these pins. DFT, in other words, is not just a manufacturing concern; it is the low-level interface through which silicon is brought up after fabrication and debugged in the field.
12. Physical Design and PVT Variation
Synthesis turns RTL into a netlist of gates. The netlist still has to be turned into a layout — a set of geometric shapes describing where each transistor and wire physically goes on the silicon die. This stage is physical design, and it is where many of the timing assumptions in static timing analysis are first put to a real test.
Physical design proceeds through several steps. Floorplanning decides where major blocks (cores, caches, I/O pads) sit on the die and how the chip is partitioned. Placement finds positions for each individual standard cell within a block. Clock-tree synthesis builds the balanced distribution network discussed earlier. Routing draws the wires that connect cell pins, choosing among the multiple metal layers a modern process provides. Parasitic extraction measures the resulting wire resistance and capacitance, which dominate timing on long interconnects. Finally, sign-off timing analysis re-checks every path with these measured parasitics and flags any that no longer meet the clock period.
The gates and wires whose timing the analysis checks are not, however, fixed objects. Their delays vary with three external conditions, collectively called PVT:
- Process variation captures the small differences between individual fabricated chips. Two dice from the same wafer can run at meaningfully different speeds because the lithography has imperfectly transferred the masks, the ion implants vary slightly, and so on. The fastest dice from a wafer are sometimes binned and sold as premium parts.
- Voltage variation captures the fact that a chip's supply is never exactly at its nominal value: it sags under heavy switching, droops on power-supply transitions, and can be deliberately scaled by DVFS.
- Temperature variation captures the strong dependence of transistor speed and leakage on operating temperature, which itself depends on workload and cooling.
A design that closes timing only at typical conditions will fail in chips that come out at the slow end of the process distribution, run at low voltage, or get hot. Sign-off therefore checks every path at multiple corners — combinations such as slow process, low voltage, high temperature and fast process, high voltage, low temperature — and accepts the design only if all corners meet timing. This is multi-corner multi-mode (MCMM) analysis, and it adds a real cost to closing timing on a complex chip.
The same analysis flows track other physical concerns. Signal integrity checks for crosstalk between adjacent wires. IR drop analysis confirms that the power-distribution network can supply enough current without the local supply voltage sagging too far. Electromigration checks that no wire carries enough current to drift its metal atoms over the chip's intended lifetime. Each of these is a separate analysis and each can send the designer back to the RTL or to placement to fix problems.
13. IP Reuse and Design Hierarchy
A modern system-on-chip is rarely designed from scratch. Most of its silicon is occupied by intellectual-property blocks — IP, in the industry's shorthand — that are licensed from their authors, integrated into the chip, and configured for the target application. A typical mobile SoC contains a CPU complex licensed from ARM, a GPU from another vendor, USB and PCIe controllers from a third, memory controllers from a fourth, and a hundred smaller blocks of varying provenance.
IP comes in two main forms. Soft IP is delivered as RTL source code, sometimes encrypted. The licensee runs synthesis, place-and-route, and timing closure themselves, which gives flexibility but transfers responsibility for closing the block. Hard IP is delivered as a finished, characterized layout for a specific process technology. The licensee drops the block into their floorplan and connects to its pins, with timing already guaranteed. Hard IP is faster to integrate but locks the design to a particular process node.
For either kind, integration depends on standardized interfaces. AMBA (the Advanced Microcontroller Bus Architecture) and its modern members AXI, AHB, and APB are the dominant interconnect protocols on ARM-centric SoCs. TileLink plays the same role in the RISC-V ecosystem. Wishbone and Avalon appear elsewhere. Each protocol defines the wires, the handshakes, and the ordering rules that any compatible IP block must obey, so that blocks from different vendors can be wired together without renegotiating the interface from scratch.
The cultural consequence is significant. A modern chip designer is, more than anything else, an integrator: the value added is not in inventing every adder and FIFO from scratch but in choosing the right blocks, configuring them for the workload, connecting them through the chosen interconnect, and verifying the whole. The blocks of this chapter and the previous one are still the vocabulary, but most working RTL is glue around licensed IP, and most of the RTL that is genuinely new in any given project is itself destined to become tomorrow's IP.
14. Summary
Synchronous design is the conceptual framework that makes complex digital systems tractable: every state-holding element is updated by a common clock, all combinational logic lives between flip-flops, and timing closure becomes a question of bounding worst-case paths. Clock distribution and reset distribution are the practical concerns that turn this clean idea into a working chip. Finite state machines describe the high-level behavior of control logic in terms of states, transitions, and outputs, and map straightforwardly onto a state register surrounded by combinational blocks. RTL is the level of abstraction at which most real designs are written: a description in terms of registers and the combinational functions that update them, leaving the synthesis tool to produce gates. Inside RTL, designers separate datapath from control, balance pipelines against the clock period through retiming, and write code that exposes opportunities for clock gating and other power-saving transformations. Simulation and testbenches verify these designs before silicon, with progressively heavier methodologies (UVM, formal verification, emulation) for progressively more complex chips. Verilog, SystemVerilog, and VHDL are the dominant HDLs, with newer host-embedded languages making inroads in some communities. Designs run physically either on FPGAs, where logic is reconfigurable but performance is modest, or on ASICs, where logic is fixed but performance is the best available, with structured ASICs, CPLDs, and eFPGAs filling intermediate niches. Surrounding the logical design is the engineering of power (dynamic, leakage, gating), test (scan, BIST, JTAG), and physical realisation (floorplanning, place-and-route, multi-corner timing closure under PVT variation). And surrounding that, in turn, is the modern reality that most working silicon is built by integrating licensed IP blocks across standardized interconnects rather than designing every block from scratch.
This concludes Part I. We have moved from the philosophical foundation — what a computer is — through the representation of data, the algebra of logic, the building blocks of digital design, and the discipline that ties them together. With this background, Part II will assemble the blocks into the recognizable parts of a working processor: the datapath, the control unit, the memory hierarchy, and the instruction cycle that brings them all to life.