Part IFoundations

The Instruction Cycle

May 16, 2026·31 min read·beginner

The previous chapter introduced the basic blocks of a CPU as a set of static drawings: a datapath full of registers and functional units, a control unit telling each piece what to do, a load–store path connecting the processor to memory. This chapter sets the picture in motion. We will follow a single instruction from the moment its address sits in the program counter to the moment its result is committed somewhere visible, and then we will see how the same sequence of activities can be packaged into different implementation styles.

The sequence of activities is called the instruction cycle, sometimes the fetch–decode–execute cycle, and sometimes simply the cycle. It is the pulse of every general-purpose computer ever built. By the end of the chapter you should be able to walk through the cycle stage by stage, recognize each stage in any of the architectures we will study later, and explain why a designer might choose to perform the whole cycle in one clock period or to spread it across several.

01.Fetch

The first stage of the cycle is fetch. Its job is simple: bring the next instruction's bits from memory into the CPU and update the program counter so that the next fetch will read the next instruction.

The fetch step has three sub-activities, and they happen in roughly this order on any reasonable implementation.

The CPU presents the value of the program counter to the instruction memory's address port. In a modified Harvard machine, this is the L1 instruction cache; on a miss, the request propagates down the hierarchy until a line is returned. In a strict von Neumann implementation, the PC is sent to a unified memory port, possibly competing with data accesses for the same path.

The memory returns the bits at that address. For a fixed-width 32-bit ISA, this is exactly four bytes. For a variable-width ISA such as x86-64, the fetch unit reads a fixed-width chunk — perhaps 16 or 32 bytes — and a separate length-decode step picks out the boundaries of the instructions inside it. The variable-length case is genuinely harder than the fixed-width one, and we will return to it when we discuss x86-64 in detail.

The returned bits are latched into the instruction register, and the program counter is updated. In ordinary execution, the update is PC ← PC + L, where L is the size of the instruction just fetched. In the case of a branch or jump, the update will eventually be replaced by a different value, but that does not happen until later in the cycle.

A small but important point: in a simple machine, the program counter update is performed by a dedicated adder, not by the main ALU. This is so common that it is worth highlighting. The reason is that the ALU is needed in a later stage of the cycle for arithmetic on the instruction's operands; if the same ALU were responsible for incrementing the PC, the two activities would have to take turns, which is precisely what we are trying to avoid. The PC adder is small (it adds a constant to a register) and cheap, so giving it its own hardware costs nearly nothing.

Figure: Instruction fetch with parallel PC update: the PC drives instruction memory into IR while a dedicated adder computes PC plus instruction length to update the PC

LaTeX

\begin{tikzpicture}[font=\small, >=Stealth, line cap=round,
  blk/.style={draw, thick, fill=white, minimum width=2.6cm, minimum height=0.8cm}]
  \node[blk] (pc) at (3, -0.5) {PC};
  \node[blk] (im) at (3, -2)   {instruction memory};
  \node[blk] (ir) at (3, -3.5) {IR};
  \draw[->] (pc) -- (im);
  \draw[->] (im) -- (ir) node[midway, right, font=\footnotesize] {instr bits};
  \node[font=\footnotesize] at (3, -4.5) {meanwhile, in parallel:};
  \node[blk] (pc2)  at (1.5, -6) {PC};
  \node[blk, minimum width=1.2cm] (plus) at (3.5, -6) {$+L$};
  \node[font=\footnotesize] at (5.5, -6) {back to PC};
  \draw[->] (pc2) -- (plus);
  \draw[->] (plus) -- (4.7, -6);
\end{tikzpicture}

The fetch step contains the entire reason for the existence of the program counter. Without a PC, the CPU would have nowhere to look for the next instruction. The PC is, in a real sense, the only piece of state the CPU strictly needs to keep going; everything else is in service of doing useful work between fetches.

02.Decode

Once the instruction is in the IR, the next step is decode. Decoding is the act of interpreting the bits of the instruction to figure out what the rest of the cycle should do.

Decode has two faces. There is a logical aspect, in which the instruction's fields are extracted and routed to the right places, and there is a control aspect, in which the opcode is translated into a vector of control signals that steers the datapath through the rest of the cycle.

On a fixed-width RISC ISA, decoding is largely a matter of looking at the appropriate bits and routing them. For a typical instruction format

Figure: A 32-bit RISC instruction divided into opcode, two source registers, a destination register, and an immediate field, with bit ranges labeled

LaTeX

\begin{tikzpicture}[font=\small, line cap=round]
  \draw (0,-1) rectangle (2,-0.2);    \node at (1, -0.6) {opcode};   \node[font=\tiny] at (1, 0)    {31..26};
  \draw (2,-1) rectangle (3.4,-0.2);  \node at (2.7, -0.6) {rs1};    \node[font=\tiny] at (2.7, 0)  {25..21};
  \draw (3.4,-1) rectangle (4.8,-0.2);\node at (4.1, -0.6) {rs2};    \node[font=\tiny] at (4.1, 0)  {20..16};
  \draw (4.8,-1) rectangle (6.2,-0.2);\node at (5.5, -0.6) {rd};     \node[font=\tiny] at (5.5, 0)  {15..11};
  \draw (6.2,-1) rectangle (8.8,-0.2);\node at (7.5, -0.6) {immediate}; \node[font=\tiny] at (7.5, 0) {10..0};
\end{tikzpicture}

the decode step does the following in a single cycle:

it sends the opcode (and on some ISAs a few additional function bits) to the control unit, which produces the cycle's control signals;
it sends the rs1 and rs2 fields to the read-port address inputs of the register file, so that the source operands begin to be read;
it sends the rd field to the write-port address input of the register file, ready for use later in the cycle;
it sends the immediate bits to the immediate generator, which sign- or zero-extends them to the full word width.

In a modified Harvard machine with a clean RISC ISA, this is more or less the entire decode story. The instruction format is regular, the fields are in fixed positions, and the control unit's job is mostly a handful of small Boolean equations.

On a variable-width CISC ISA such as x86-64, decode is dramatically more involved. The instruction may be one byte long or fifteen bytes long. There may be optional prefixes that change the meaning of the opcode, optional ModR/M bytes that select the addressing mode, optional SIB bytes that help with complex addressing, and optional immediates and displacements of varying widths. Determining how long the current instruction is, and which bytes form which fields, is itself a multi-step calculation. Modern x86-64 designs perform this work in a multi-stage front-end pipeline and translate each architectural instruction into one or more internal micro-operations of a much more regular form, which the back end then executes. We will treat all of this in detail in Part VII; for now, the point is that the decoding workload differs by an order of magnitude between RISC and CISC, even if the conceptual role of the stage is the same.

In every case, the output of decode is the same kind of thing: a fully populated set of control signals, plus the operands ready to flow into the next stage. From this point on, the rest of the cycle does not care what the original instruction was; it only sees the control signals and the values on the wires.

03.Execute

The execute stage is where the operation specified by the instruction actually happens. For most instructions, this is a single step through the ALU; for memory instructions, the ALU computes the address, and the actual memory access happens in the next stage; for branches, the ALU evaluates the branch condition.

A handful of common cases will make the role of execute clear.

For an arithmetic instruction such as add rd, rs1, rs2, execute reads the two source operands from the register file (which decode kicked off), feeds them to the ALU, and selects the addition operation. The ALU produces the result on its output, and the zero, negative, carry, and overflow flags settle along with it. The result will be written to rd in the writeback stage; the flags may or may not be written to a flags register, depending on the ISA.

For an immediate instruction such as addi rd, rs1, imm, the right input of the ALU comes from the immediate generator instead of from the register file. The control unit has set the ALUSrc mux accordingly, so the same ALU performs the same addition; only the source of the second operand has changed.

For a load instruction such as ld rd, imm(rs1), execute is used to compute the effective address. The ALU adds the base register and the immediate offset. The result is the address from which the memory access in the next stage will read; the loaded value will eventually go into rd.

For a store instruction such as st rs2, imm(rs1), execute again computes the effective address as rs1 + imm. The value to be written, in rs2, has already been read from the register file in decode, and it will be presented to the data memory along with the address in the memory-access stage.

For a conditional branch such as beq rs1, rs2, label, execute uses the ALU to subtract one source from the other so that the zero flag indicates equality. If the flag matches the condition the instruction tests, the branch is taken, and the program counter for the next cycle will be loaded with PC + offset instead of PC + L.

For a jump such as jal rd, label, the target is PC + offset and there is no comparison to make; the ALU may compute PC + L so that it can be saved to rd as the return address. (The exact details vary by ISA — jal is the RISC-V example — but the spirit is the same.)

The defining feature of execute is that, for most instructions, this is the cycle in which "the work" of the instruction is done. Almost every other stage either prepares the inputs for execute or commits the result that execute produced. In a deeply pipelined machine, execute is often the most intricate stage, with multiple parallel functional units, forwarding networks, and bypass paths. But the conceptual role does not change: take the operands and the operation, do the operation, and pass the result on.

04.Memory Access

Not every instruction touches memory in the cycle after execute, but the ones that do — the loads and stores — give this stage its name. The memory access stage is responsible for actually completing the data-side memory transaction whose address was computed in execute.

For a load, the address from execute drives the data memory's address port. The memory returns the requested value, which the writeback stage will deposit into the destination register. In a real processor with a cache hierarchy, this stage is where the cache is consulted; on a miss, the load is held until the requested line returns from a lower level. In a simple single-cycle machine, the cache is idealized as a fast SRAM and the access completes in one cycle.

For a store, the address from execute and the data from the register file are both presented to the data memory, and a write transaction is performed. There is nothing to write back to the register file, and the writeback stage is effectively a no-op for stores.

For arithmetic instructions and branches, the memory access stage does nothing. Some implementations literally insert a "no operation" in this stage of the pipeline; others arrange the stage to be skipped. In either case, no memory transaction occurs.

A subtlety worth flagging: in a strict von Neumann implementation that performs both instruction fetch and data memory access on the same memory port, this stage and the fetch stage of the next instruction would compete. This is the very contention we discussed in Chapter 6 as the von Neumann bottleneck, and it is part of why every modern processor has a modified Harvard organization with separate instruction and data caches at the top of the memory hierarchy.

A second subtlety: the memory hierarchy itself imposes ordering rules that simple single-cycle machines can ignore but that high-performance designs must enforce carefully. A load issued after a store to the same address must see the value the store wrote, even if the cache, the write buffer, and the load–store queue would otherwise allow them to slip past each other. We will return to these issues in Part V when we discuss the load–store micro-architecture.

05.Writeback

The writeback stage is the last stage at which an instruction can change architectural state. For most instructions, this means writing a value into the destination register named by the instruction's rd field.

For an arithmetic instruction, the value written is the ALU's output. For a load, the value is the data returned by the memory access stage. The writeback mux, controlled by the MemToReg signal we met in the previous chapter, selects between the two. The register file's write enable is asserted, and the value is latched into the named register on the next clock edge.

A handful of instructions have nothing to write back. Stores have committed their effect during memory access; a writeback for a store is a no-op. Branches that are not taken have written nothing of consequence; a taken branch has committed its effect by updating the program counter. Some flags-setting instructions have no register destination, only a flags update, and they write only to the flags register.

In a real processor, the writeback stage has more work than this short description suggests. Out-of-order machines do not simply write back results into the register file as instructions complete; they retire them through a reorder buffer that keeps the architectural state precise even when execution has happened in a different order. Speculative results are squashed if the speculation turns out to be wrong, and the writeback only commits values along the correctly predicted path. We will see this machinery in Chapter 25. For the simple machine of this chapter, writeback is the straightforward act of dropping a value into a named register.

A useful way to think of the cycle as a whole is as the propagation of a single instruction through five logical roles:

Figure: The five logical roles of an instruction cycle: fetch, decode, execute, memory, and writeback, each captioned with the work it performs

LaTeX

\begin{tikzpicture}[font=\small, >=Stealth, line cap=round,
  blk/.style={draw, thick, fill=white, minimum width=2.2cm, minimum height=0.8cm}]
  \node[blk] (f) at (1.4, -0.5)  {FETCH};
  \node[blk] (d) at (4.0, -0.5)  {DECODE};
  \node[blk] (e) at (6.6, -0.5)  {EXECUTE};
  \node[blk] (m) at (9.2, -0.5)  {MEMORY};
  \node[blk] (w) at (11.8, -0.5) {WRITEBACK};
  \draw[->] (f) -- (d);
  \draw[->] (d) -- (e);
  \draw[->] (e) -- (m);
  \draw[->] (m) -- (w);
  \node[align=center, font=\footnotesize] at (1.4, -1.8)  {bring instr\\into IR};
  \node[align=center, font=\footnotesize] at (4.0, -1.8)  {interpret\\\& route operands};
  \node[align=center, font=\footnotesize] at (6.6, -1.8)  {do the operation\\in ALU};
  \node[align=center, font=\footnotesize] at (9.2, -1.8)  {read or write\\data memory};
  \node[align=center, font=\footnotesize] at (11.8, -1.8) {commit result\\to register / PC};
\end{tikzpicture}

Every instruction visits these stages in this order, even if some stages are no-ops for a given instruction. In a single-cycle implementation, all five are squeezed into one clock period. In a multi-cycle or pipelined implementation, they are spread over several cycles. The next two sections take up these two implementation styles.

06.Sequencing the Cycle: Hardwired and Microprogrammed Control

The five stages of the cycle have to be sequenced by something. In a single-cycle machine the sequencing is implicit — every stage's logic settles within one clock period — but in a multi-cycle or pipelined machine, an explicit mechanism decides which step happens on which cycle. We touched on the two main styles in the previous chapter; the cycle is the right place to revisit them.

A hardwired sequencer is a finite state machine, often simple enough that it has only a handful of states corresponding to the stages of the cycle. The opcode, together with a few derived signals, picks among different state sequences for different kinds of instructions: an add skips the memory-access state, a store skips the writeback state, and a multi-cycle instruction such as a software-implemented multiply walks through several execute states before moving on. Hardwired control is fast and area-efficient when the number of distinct sequences is small, which is one reason every modern RISC core uses it for the bulk of its instructions.

A microprogrammed sequencer treats the cycle itself as a small program. A control store — a small ROM — holds micro-instructions, each of which directly specifies the control signals for one cycle. The opcode picks a starting address in the control store, and a small micro-PC walks through the corresponding micro-instructions one per cycle. Microprogramming makes it cheap to add complex multi-step instructions: rewriting an entry in the control store is much easier than redesigning a state machine.

The two styles are not mutually exclusive. Modern x86-64 processors decode their most common simple instructions with a hardwired front end and decode rare or complex instructions — string operations, segment-register manipulation, far calls, and the like — by jumping into a microprogrammed sequencer. The micro-instructions produced by either path are eventually executed on the same back-end pipeline. Chapter 27 returns to this in detail; for the cycle picture, the relevant fact is that something has to drive the cycle through its stages, and that something is itself either a small FSM or a small program in a small ROM.

07.Variable-Length Instructions and Decode Throughput

The decode stage of the cycle has very different shape on different ISAs, and that difference shows up in the front-end pipeline of every real processor.

On a fixed-width RISC ISA, the address of one instruction plus its width gives the address of the next, with no decoding required. The fetch stage can stream in a continuous block of instructions, and decode of all of them can proceed in parallel; a four-wide superscalar processor can decode four instructions in a single cycle without breaking a sweat.

On a variable-length CISC ISA such as x86-64, the length of each instruction is itself the output of a small decode step. The first byte of an instruction may indicate that there are zero, one, or several prefix bytes ahead of the actual opcode; the opcode may then take one or two bytes; a ModR/M byte may follow that selects the operands and addressing mode; an SIB byte may follow that to refine the addressing; a displacement of one, two, or four bytes may follow that; and finally an immediate of one, two, four, or eight bytes may follow. In the worst case an instruction is fifteen bytes long; in the best case it is one byte. To decode several instructions per cycle, the processor must determine all the boundaries simultaneously, which is a non-trivial parallel computation.

Real x86 processors solve this with a combination of three techniques. First, a small pre-decode unit attached to the L1 instruction cache annotates each cached byte with whether it is the start of an instruction, the end, or in the middle, so that the boundaries are pre-computed once per cache line and reused on every fetch. Second, a macro-op queue holds already-decoded instructions for the back end, decoupling the variable-length front end from the regular-width back end. Third, the micro-op cache introduced in Chapter 27 short-circuits decoding entirely for hot loops by remembering the previously decoded micro-ops directly.

None of this affects the logical role of the decode stage — it still maps an instruction onto a vector of control signals — but it dramatically affects the cost of that stage. The reason fixed-width ISAs are sometimes called "easier to decode" is exactly this: the front-end pipeline of a wide RISC processor can be a few stages, while the front end of a comparable x86-64 processor is typically substantially deeper.

08.Interrupt and Exception Checks Within the Cycle

The cycle as drawn so far assumes nothing ever goes wrong. In real hardware, almost every stage is also the place where some kind of exception can be detected, and the cycle has to incorporate a check for these conditions before it commits any architectural state.

During fetch, the address driven onto the instruction memory may fault. The PC may point at a non-existent address, or at a page marked non-executable, or at a region the current privilege level cannot read. Any of these conditions raises an exception that prevents the instruction from being delivered to decode at all.

During decode, the instruction's bits may not correspond to any defined opcode, or they may correspond to one that is illegal in the current privilege level (a wrmsr or csrw issued from user mode, for example). The decoder raises an illegal instruction or privilege exception in those cases, and the cycle is redirected to the appropriate handler.

During execute, arithmetic exceptions can fire: integer division by zero, signed overflow on architectures that trap on it, floating-point exceptions if floating-point logic is part of the same path. Address calculation can produce a misaligned access on architectures that fault on misalignment.

During memory access, the data-side memory access may fault for the same kinds of reasons as the instruction fetch: missing translation, wrong privilege, write to a read-only region, access to a region the operating system has not mapped at all. Cache misses delay the cycle but do not normally raise an exception.

During writeback, very few exceptions arise, but interrupts that have been pending throughout the cycle are usually checked here and taken at the boundary between this instruction and the next. The discipline of checking exceptions at this stage is what allows the writeback to act as the commit point of the instruction: if no exception fires, the instruction's effects become permanent; if one fires, the architectural state is left as if the instruction had not run.

For a simple in-order processor this discipline is straightforward. For an out-of-order processor it is one of the central challenges of the design, because instructions following the faulting one may already have executed; their results have to be discarded before the exception is taken. Chapter 15 develops the exception model in detail, and Chapter 25 explains how out-of-order processors preserve precise exceptions in the face of speculative execution. The point worth keeping for now is that the cycle is not just a sequence of useful operations: it is also a sequence of checks, and the writeback stage is the only place where any change is allowed to become visible.

09.Instruction Prefetch and the Front End

The fetch stage as described pulls one instruction from memory each cycle, just in time to be executed. This is fine when the I-memory is fast and the program follows a straight path, but on real machines neither assumption holds: main memory is hundreds of cycles away, and even an L1 instruction cache is several cycles away. Without help, the cycle would stall constantly waiting for instructions.

The help comes from the front end, a collection of mechanisms that aim to keep a buffer of decoded instructions waiting for the back end at all times.

An instruction prefetcher issues fetches ahead of the current PC, on the assumption that the program will continue past it. A simple sequential prefetcher reads the next several cache lines in advance; a more sophisticated one tracks recent fetch patterns and issues prefetches matching them. The prefetched lines populate the I-cache so that, by the time the program counter actually catches up, the bytes are already on chip.

A branch predictor decides, before a branch has been decoded, which way it is likely to go, so that fetch can continue down the predicted path without stalling. The simplest predictors track a single bit of history per branch; modern designs use multi-level tables indexed by hash functions of the recent branch history. We will give branch prediction its own chapter (Chapter 23); for the cycle picture, the consequence is that fetch is speculative. Bytes are pulled into the front end on the assumption that the predicted path is correct, and the back end has to be prepared to throw them away when a misprediction is detected.

A return-address stack is a small specialized predictor for the most predictable kind of indirect branch: the function return. The stack records the address pushed by each call, and on the next return, it predicts that the target will be the value on top of the stack. With a sufficiently deep stack and well-behaved code, return prediction approaches 100% accuracy.

All of this machinery sits on top of the simple fetch stage of this chapter. It is invisible to the architectural picture — from outside, the CPU still appears to fetch one instruction per cycle from the address held in the program counter — but the cycle would simply not run at modern speeds without it. Chapter 22 will treat the front end as a unit; here, the relevant point is that the apparently humble fetch step is, in any high-performance machine, the front-line user of the most aggressive speculation in the chip.

10.Predication and Conditional Execution

The cycle as drawn assumes a strict distinction between taken and not-taken branches. Some ISAs blur that distinction by allowing every instruction to carry its own condition, executing or not executing based on the current state of the flags. This style is called predicated execution, and it changes the texture of the cycle in interesting ways.

The extreme example is ARM AArch32, in which a four-bit condition field on every instruction names one of sixteen flag combinations (EQ, NE, CC, CS, MI, PL, VS, VC, HI, LS, GE, LT, GT, LE, AL, NV). An addne r0, r1, r2 adds only when the zero flag is clear; the same instruction with a different condition does nothing. A short if-then-else block can be compiled to two predicated instructions with no branch at all:

Plain Text

    cmp     r0, #0
    addgt   r1, r1, #1     ; r1 += 1 if r0 > 0
    sublt   r1, r1, #1     ; r1 -= 1 if r0 < 0

A more conservative version of the idea, present in essentially every modern ISA, is the conditional move or conditional select instruction — cmov on x86, csel on AArch64. A csel reads two source registers and writes one of them to the destination based on the flags, doing in one instruction what a small if-then-else block of branches would otherwise do.

From the cycle's point of view, predicated and conditional-select instructions look just like ordinary instructions: they pass through fetch, decode, execute, and writeback like everything else. The difference is that the writeback stage's update is gated on the predicate; if the predicate is false, the destination register is left unchanged. The benefit is that short conditional sequences avoid branches entirely, which removes the risk of branch mispredictions. The cost is that work done by not-executed instructions still consumes pipeline slots, so heavy use of predication on long conditional regions can be slower than a branch the predictor handles correctly.

Which style is right depends on the predictability of the branch in question. AArch64 dropped most of the AArch32 predication in favour of a smaller set of more selective conditional-select instructions, on the judgment that aggressive branch prediction now handles most short branches well enough that wholesale predication is no longer the right tradeoff. We will return to predication and its interaction with speculation in Chapters 22 and 23.

11.Single-Cycle Design

In a single-cycle implementation, the entire instruction cycle — fetch, decode, execute, memory access, writeback — completes in a single clock period. Every clock edge starts a new instruction, and each instruction is fully done before the next one begins.

The schematic appeal of this design is that it maps almost directly onto the block diagram from Chapter 7. There is no notion of "stages" in the hardware itself; there is just a datapath, and on each clock cycle, the relevant signals propagate from PC through I-memory through the IR through register-file reads through the ALU through D-memory through the writeback mux back to the register file. The control unit is purely combinational, producing all the cycle's control signals from the IR.

The schematic appeal disappears the moment we ask how fast the clock can be. The clock period of a single-cycle design must be at least the sum of the delays through every block on the longest path:

$T_{\text{clk}} \ge t_{\text{Imem}} + t_{\text{regfile read}} + t_{\text{ALU}} + t_{\text{Dmem}} + t_{\text{regfile write setup}}.$

The longest instruction is typically a load: it needs the instruction memory access, the register read, the ALU (for address calculation), the data memory access, and the register-file write. Every instruction, even those that do not access memory, must wait this long, because the clock period is set globally by the worst case. A simple arithmetic instruction that could in principle finish much faster is paying for the load's data-memory access time anyway.

This single fact is enough to rule out single-cycle designs for any high-performance processor. A modern L1 cache access alone is several gate delays; a register-file read is another; an ALU addition is another. Adding them all in series produces a clock period far longer than designers are willing to accept.

Single-cycle designs are still useful as teaching tools and as the simplest possible starting point for a real processor. The first FPGA CPU project most students build is almost always a single-cycle machine — it is small, it is easy to understand, and it makes the role of each block visible in the timing diagram. Once that machine works, the natural next step is to break the cycle into stages.

12.Multi-Cycle Design

In a multi-cycle implementation, each instruction is broken into several shorter steps, and each step takes one clock cycle. The clock period is now set by the longest step, not by the longest whole instruction. Different instructions take different numbers of cycles to complete, depending on which steps they need.

A typical multi-cycle layout for the simple machine of this chapter assigns one cycle to each of the five logical stages, and each instruction visits only the stages it needs:

Instruction	Cycles used
`add rd, rs1, rs2`	fetch, decode, execute, writeback (4 cycles)
`addi rd, rs1, imm`	fetch, decode, execute, writeback (4 cycles)
`ld rd, imm(rs1)`	fetch, decode, execute, memory, writeback (5 cycles)
`st rs2, imm(rs1)`	fetch, decode, execute, memory (4 cycles)
`beq rs1, rs2, lbl`	fetch, decode, execute (3 cycles)
`jal rd, lbl`	fetch, decode, execute, writeback (4 cycles)

The hardware that supports this style of execution differs from a single-cycle datapath in a few important ways.

State must be added to remember intermediate results between cycles. The address driven to the data memory in the memory access stage was computed by the ALU in the previous cycle; somewhere a register has to hold it across the cycle boundary. The same applies to the value read from memory before it is written back. Multi-cycle designs typically introduce an MAR (memory address register), an MDR (memory data register), and one or more ALU output registers for exactly this purpose.

The control unit becomes a finite state machine rather than a combinational block. Where a single-cycle control unit produced one cycle's control signals from one opcode, a multi-cycle unit walks through a sequence of states, producing different signals in each. The opcode picks among several possible state sequences. The state register inside the control unit is, in effect, "where in the instruction's execution are we?".

The same hardware is reused across cycles. In a single-cycle design, the ALU is used for one operation per instruction; in a multi-cycle design, the ALU is used in execute, and may also be used to increment the PC during fetch, and possibly to compute branch targets. The single ALU is multiplexed across multiple roles by sharing it across cycles, which saves area at the cost of taking more cycles per instruction.

The advantage of multi-cycle design is that the clock period can be much shorter, because each cycle does less work. The disadvantage is that each instruction takes several clock cycles to complete, so the cycles per instruction (CPI), a metric we will meet in Chapter 10, rises from 1 to typically 3 to 5. Whether this is a net win for performance depends on whether the clock-period reduction outweighs the higher CPI; in practice, for any moderately complex memory hierarchy, it does.

Multi-cycle designs were standard in the 1980s and remain common today in microcontrollers, where simplicity and low gate count matter more than raw speed. They are the natural stepping stone between a single-cycle teaching CPU and a pipelined high-performance design. The pipelined design, which we will study in detail in Chapter 22, takes the multi-cycle idea one step further: rather than letting one instruction occupy each stage in turn while the others sit idle, it lets several instructions be in flight simultaneously, each in a different stage, so that on every clock cycle one instruction enters fetch and one instruction completes writeback. The instruction throughput rises to one instruction per cycle, even though each instruction still takes several cycles to traverse the pipeline.

To see the difference clearly, consider three instructions executed in each style:

Plain Text

Single-cycle:
  cycle 1:  I1 [F D E M W]
  cycle 2:  I2 [F D E M W]
  cycle 3:  I3 [F D E M W]
  total:    3 cycles (but each cycle is very long)

Multi-cycle (no overlap):
  cycle 1:  I1 F
  cycle 2:  I1 D
  cycle 3:  I1 E
  cycle 4:  I1 M
  cycle 5:  I1 W
  cycle 6:  I2 F
  ...
  total:    15 cycles (but each cycle is short)

Pipelined (Chapter 22 preview):
  cycle 1:  I1 F
  cycle 2:  I1 D   I2 F
  cycle 3:  I1 E   I2 D   I3 F
  cycle 4:  I1 M   I2 E   I3 D
  cycle 5:  I1 W   I2 M   I3 E
  cycle 6:           I2 W   I3 M
  cycle 7:                  I3 W
  total:    7 cycles, with 3 instructions completed

The pipelined picture is what every modern general-purpose processor approximates. It accepts the multi-cycle clock period as the price of admission and gets back the throughput of one instruction per cycle by overlapping the stages of consecutive instructions. The hazards and corner cases that arise when instructions are in flight at the same time will occupy us in Chapter 22 and beyond.

13.Summary

The instruction cycle is the heartbeat of every general-purpose computer. Its five logical stages — fetch, decode, execute, memory access, writeback — describe the journey of every instruction from a pattern of bits in memory to a change in architectural state. The fetch stage brings the instruction into the CPU and updates the program counter, leaning on the front-end machinery of prefetchers, branch predictors, and return-address stacks to keep instructions flowing. The decode stage interprets the instruction, routes its fields, and produces the cycle's control signals; the cost of doing so depends sharply on whether the ISA uses fixed- or variable-width encoding. The execute stage uses the ALU to perform arithmetic, evaluate branch conditions, or compute addresses, and on architectures that support predication or conditional select it may also gate its result on the current flags. The memory access stage completes loads and stores. The writeback stage commits the result to the named destination register or to the program counter, and it is here that the cycle's accumulated exception checks finally decide whether the instruction's effects become permanent or are discarded. Sequencing the cycle is itself either a hardwired finite state machine or a small microprogram, and modern processors blend the two. A single-cycle implementation packs all of this into one long clock period; a multi-cycle implementation spreads it over several short ones; a pipelined implementation, our subject in Part V, overlaps the stages of many instructions to combine the short clock period of the multi-cycle design with the high throughput of the single-cycle one.

Chapter 9 turns from the CPU itself to the I/O subsystem, and asks how the same fetch–execute machinery interacts with devices that live outside the processor's neat synchronous world.

Book mode

	cmp r0, #0
	addgt r1, r1, #1 ; r1 += 1 if r0 > 0
	sublt r1, r1, #1 ; r1 -= 1 if r0 < 0