Input/Output Organization
May 16, 2026·31 min read·beginner
So far the discussion of computer organization has stayed inside a comfortable, regular world. The CPU runs at a steady clock; memory answers requests in a few nanoseconds; programs and data have…
So far the discussion of computer organization has stayed inside a comfortable, regular world. The CPU runs at a steady clock; memory answers requests in a few nanoseconds; programs and data have neat fixed sizes. The moment we step beyond this world — the moment the computer has to talk to a keyboard, a network card, a disk, or a sensor — almost all of those nice properties break down. Devices operate at speeds millions of times slower than the processor. They generate events whose timing the program does not control. They have their own internal state machines, their own buffer sizes, their own quirks of behavior. Bringing them into the synchronous, predictable framework of the CPU is the job of the input/output subsystem, and it is the subject of this chapter.
We will look at two main families of techniques. The first is the question of how the CPU addresses devices: with the same load and store instructions it uses for memory, or with separate I/O-only instructions. The second is the question of when the CPU notices that a device needs attention: by repeatedly checking, by waiting for the device to interrupt it, or by handing the work off entirely to a separate engine. Each technique has its place, and modern systems use all of them simultaneously.
01. Memory-Mapped I/O
The cleaner of the two main addressing strategies is memory-mapped I/O. The idea is to assign each device's registers a range of addresses in the same address space the CPU uses for ordinary memory. From the program's point of view, reading a status bit from a network card looks exactly like reading a byte from RAM. The same load and store instructions, the same addressing modes, the same calling conventions — everything generalizes.
A typical memory-mapped device has a small set of device registers, sometimes called MMIO registers (memory-mapped I/O registers). They are not RAM; they are flip-flops or latches inside the device that the CPU happens to be able to read and write through a memory-style interface. A simplified UART (a serial port) might expose three registers:
| address | name | type | meaning |
|---|---|---|---|
| 0xFE00_0000 | data | R/W | read: incoming byte; write: outgoing byte |
| 0xFE00_0004 | status | R | bit 0: rx ready; bit 1: tx empty |
| 0xFE00_0008 | control | R/W | bit 0: rx-int en; bit 1: tx-int en; bit 2: rx en; bit 3: tx en |
To send a byte, software writes the byte to the data register. To receive one, software waits for status bit 0 to read as 1 and then reads from data. The CPU does not need any special instructions; ordinary ld and st (or mov on x86) reach the device exactly as they reach RAM.
The interconnect, of course, has to know which addresses lead to RAM and which to the I/O subsystem. This is the job of an address decoder. Each transaction's address is examined by the decoder, which routes the request to the appropriate target: the memory controller, an I/O bridge, a particular device. The decoder is usually configured at boot time by firmware, which discovers the devices and their address ranges, often by walking a tree of buses such as PCI Express.
Memory-mapped I/O has several genuine advantages.
The CPU needs no special I/O instructions. A simpler ISA is easier to implement, easier to teach, and easier to compile to. RISC architectures including ARM and RISC-V use memory-mapped I/O exclusively for this reason.
All of the CPU's addressing modes work for I/O. A driver can index into a device's register file with a base register and offset just like any other memory access. Compiler tricks like volatile pointers, pointer arithmetic, and even structures laid out over MMIO registers all work.
Page-based memory protection covers I/O access as a side effect. Mapping a device's registers into a process's page table at the right privilege level lets the operating system control which programs are allowed to talk to which devices, using the same machinery that protects ordinary memory. We will see how this works in Chapter 19 on virtual memory.
The disadvantages are subtle but real. Because device registers live in the same address space as memory, the CPU's caching and reordering machinery must be told to behave differently for them. A read from a device's status register must actually go to the device every time; the CPU cannot cache it, because the device may have changed the bit since the previous read. A write to a control register must reach the device in the right order, because writing the wrong value at the wrong time can crash the device or corrupt data. The processor and the operating system therefore mark MMIO regions as non-cacheable and impose strict ordering through memory-mapping attributes and barrier instructions. We will see these mechanisms again in Chapter 19 (virtual memory and protection) and Chapter 31 (memory consistency).
In practice, memory-mapped I/O is the dominant approach in essentially all modern systems. Even x86, which historically supported a separate I/O space, uses MMIO for nearly all modern devices.
02. Port-Mapped I/O
The older alternative, surviving mostly on x86, is port-mapped I/O, sometimes called isolated I/O. Devices live in a separate address space — the I/O port space — that is reached only by special instructions. On x86, those instructions are IN and OUT:
| ; read a byte from I/O port 0x60 (the legacy keyboard controller) | |
| in al, 0x60 | |
| ; write a byte to I/O port 0x70 | |
| mov al, 5 | |
| out 0x70, al |
The CPU has a separate set of address pins (or, on modern designs, a separate bit in the bus protocol) that distinguishes I/O accesses from memory accesses. A device on an I/O port simply does not appear in the regular memory address space; software cannot reach it without using the I/O instructions.
Port-mapped I/O made sense when address spaces were small. On the original 8086, with 20-bit addresses giving 1 MiB of memory, devoting parts of memory to I/O would have wasted scarce address space. A separate 16-bit port space, addressing 64 KiB of devices, sidestepped that problem and kept the memory map clean. Most early PC peripherals — keyboard controller, programmable interrupt controller, timer, DMA controller, serial ports, parallel ports — sat in the port space.
Modern x86-64 still supports IN and OUT for backward compatibility, and the port space still exists. But for any device introduced in the last twenty years — graphics cards, USB controllers, NVMe drives, network adapters — communication is through MMIO. The port space is essentially a museum of legacy controllers.
The conceptual point worth taking from port-mapped I/O is the realization that the choice between port-mapped and memory-mapped is, at heart, a packaging decision. Both schemes simply expose a set of registers to the CPU. The differences — separate vs. unified address space, special vs. general instructions — are matters of cost and convenience. Once a program has the address (or port number) of a register, talking to it looks much the same in either case.
03. Polling and Interrupts
Suppose the CPU has issued a command to a slow device — say, asking a serial port to transmit a byte — and now needs to know when the device is ready for more work. There are two basic strategies.
Polling
The simplest is polling. The CPU repeatedly reads the device's status register and checks whether the bit it cares about has changed.
| // software polling loop | |
| while ((*UART_STATUS & TX_EMPTY) == 0) { | |
| /* spin */ | |
| } | |
| *UART_DATA = next_byte; |
Polling is conceptually trivial: no extra hardware mechanism is needed beyond the ability to read the status register, which we already have through MMIO. It has predictable timing, because the polling code is in full control of when checks happen. And it avoids the considerable complexity of asynchronous event handling.
Its drawbacks are equally obvious. The CPU spends real time on the polling loop, time it could have used for useful work. If the device is much slower than the CPU — and it nearly always is — the polling loop consumes cycles that have no chance of seeing a change in status. Polling a network card might mean spinning for a million cycles before a single packet arrives. Polling is acceptable when the device is so fast that the spin is short, when responsiveness is critical and there is no other work to do, or when the simplicity is worth the inefficiency. It is unacceptable as the general-purpose mechanism for I/O.
Interrupts
The more flexible strategy is the interrupt. Instead of repeatedly asking a device whether it needs attention, the CPU lets the device tell it. When the device's status changes — a byte has been received, a transmission has completed, a disk operation has finished — the device asserts a signal that causes the CPU to suspend whatever it is doing, run a special routine called an interrupt handler or interrupt service routine (ISR), and then resume.
The hardware mechanism is straightforward in outline.
The CPU has one or more interrupt request input lines. Devices that need attention assert these lines.
When such a line is asserted, the CPU finishes its current instruction (or finishes a clean stopping point in pipelined designs) and instead of fetching the next instruction in the program, it transfers control to the interrupt handler.
To find the handler, the CPU consults a table — variously called the interrupt vector table, the IDT on x86, or the exception vector on ARM — indexed by the source of the interrupt. Each entry holds the address of the corresponding handler.
Before jumping to the handler, the CPU saves enough state to be able to resume the interrupted program afterward: at minimum the program counter and the flags, often more. Some architectures push this state onto a stack; others copy it into special saved-state registers.
The handler runs. It typically reads a few of the device's registers to learn what happened, processes the event (perhaps copying a received byte into a buffer, or signaling a higher-level subsystem), acknowledges the interrupt at the device and at the interrupt controller, and returns.
A special return-from-interrupt instruction (IRET on x86, eret on ARM, mret/sret on RISC-V) restores the saved state and resumes the interrupted program at the exact instruction that was about to execute when the interrupt arrived.
The advantage over polling is decisive. The CPU does not need to know when an event will happen, only how to handle it once it does. While no events are pending, the CPU runs application code at full speed; when an event arrives, the handler runs briefly and control returns. Modern operating systems use interrupts as the basic event mechanism for almost everything: the timer interrupt that drives scheduling, the keyboard and mouse interrupts that drive input, the disk and network interrupts that drive I/O.
Interrupts come with their own complications. Code running inside an interrupt handler cannot itself be preempted in the same way ordinary code can; an interrupt that arrives while a handler is running has to be either deferred or carefully nested. Many architectures distinguish between maskable interrupts, which the kernel can defer, and non-maskable interrupts, which it cannot. Latency from interrupt arrival to handler entry is a key real-time metric, especially in embedded systems. We will return to these topics in Chapter 15 (where we discuss exceptions and traps in general) and Chapter 57 (where we discuss real-time constraints).
A typical timing diagram of an interrupted program looks like this:
Figure: Interrupt timing: the program runs I1 through I3, jumps into a handler that saves, executes, restores, and returns, then resumes at I4 through I6
\begin{tikzpicture}[font=\footnotesize, >=Stealth, line cap=round]
\draw[thick] (0, 0) -- (3, 0);
\node at (1.5, 0.4) {program: I1 I2 I3};
\draw[thick] (7, 0) -- (10, 0);
\node at (8.5, 0.4) {program: I4 I5 I6};
\draw[thick, fill=white] (3.4, -2.1) rectangle (6.6, -0.9);
\node[align=center] at (5, -1.5) {handler\\save / body / restore / return};
\draw[->] (3, 0) -- (3.4, -0.9);
\draw[->] (6.6, -0.9) -- (7, 0);
\end{tikzpicture}The interrupt arrived between I3 and I4, the handler ran to completion, and execution resumed at I4 as if nothing had happened. This is the basic illusion the interrupt mechanism preserves: from the program's perspective, control flows linearly, regardless of how often the hardware diverted into a handler.
04. DMA
For devices that move large amounts of data — disks, network cards, graphics cards, sound cards — even interrupt-driven I/O is too inefficient. If the CPU has to be involved in transferring every byte from a device to memory, an interrupt on every byte (or even every cache line) would consume the entire processor. The solution is direct memory access, or DMA: a separate piece of hardware that performs bulk transfers between a device and main memory without the CPU's involvement.
A DMA-capable device contains, or is paired with, a small engine called a DMA controller. The CPU programs the controller with a description of the transfer it wants performed: a source address (a device register or a memory location), a destination address, a length, perhaps a stride or a scatter–gather list. The CPU then issues a "go" command and turns its attention to other work. The DMA controller arbitrates for the memory bus and performs the transfer one cache line or one bus transaction at a time, until the entire block has been moved. When it is done, it raises an interrupt to tell the CPU.
Figure: DMA transfer: the CPU programs the device, the DMA controller moves data between device and memory without CPU involvement, then raises an interrupt on completion
\begin{tikzpicture}[font=\small, >=Stealth, line cap=round,
blk/.style={draw, thick, fill=white, minimum width=2.6cm, minimum height=0.8cm}]
\node[blk] (cpu) at (1, -0.5) {CPU};
\node[blk] (dev) at (9, -0.5) {Device};
\node[blk] (dma) at (5, -2.5) {DMA controller};
\node[blk] (mem) at (5, -4.5) {Memory};
\draw[->] (cpu) -- (dev) node[midway, above, font=\footnotesize] {1. program DMA};
\draw[->] (dev.south) -- (dma.east);
\draw[<->] (dma) -- (mem) node[midway, right, font=\footnotesize] {2. transfer (no CPU)};
\draw[->] (dev.west) -- (cpu.east) node[midway, below, font=\footnotesize] {3. interrupt on completion};
\end{tikzpicture}DMA changes the cost structure of I/O dramatically. The CPU is involved only at the start and end of a transfer; the actual movement of data, which may be megabytes or gigabytes long, happens in parallel with the CPU running other code. A modern computer copies hundreds of gigabytes per second between devices and memory through DMA without the processor noticing.
DMA also introduces complications that did not exist in CPU-driven I/O.
The DMA controller and the CPU are now both potential masters of the memory subsystem. They have to share the memory bus, and they may compete for cache lines. The interconnect must arbitrate between them, and the cache hierarchy must be coherent enough that DMA writes are visible to the CPU's later reads and vice versa. We will return to coherence in Chapter 31; for now, the important fact is that this issue exists.
DMA breaks the simple model in which all memory accesses come from the CPU and therefore go through the CPU's memory management unit. A DMA controller programmed with a virtual address means nothing — the device does not know how to translate it. Either the operating system has to give the controller a physical address (and pin the underlying memory pages so they cannot be moved or swapped), or the system needs an IOMMU, a separate translation engine that performs virtual-to-physical translation for DMA traffic. Modern systems generally include an IOMMU, partly for this reason and partly to prevent malicious or buggy devices from writing to arbitrary memory.
DMA introduces cache coherence concerns. If the CPU has cached part of the buffer that the DMA controller is writing into, the CPU's later read of the buffer would return stale cached data. Modern cache-coherent fabrics (CHI on ARM, the QPI/UPI/Infinity Fabric coherent links on x86, CCIX, CXL.cache on emerging systems) extend coherence to DMA traffic so the OS does not have to perform manual cache invalidations. On older or smaller systems, the OS still has to issue explicit cache flush operations around DMA buffers.
A useful taxonomy of I/O transfer styles, then, is this:
| Mode | Per-byte CPU cost | Latency | Best for |
|---|---|---|---|
| Polled | High | Lowest | Tiny, fast, time-critical transfers |
| Interrupt-driven | Moderate | Low–moderate | Small, occasional transfers (keyboard, mouse) |
| DMA + interrupt | Negligible | Higher startup cost, low per-byte | Bulk transfers (disk, network, GPU) |
Real systems use all three. A keyboard delivers keystrokes through interrupts, one at a time. A network card delivers packets through DMA into ring buffers and signals an interrupt when packets have been deposited. A disk transfers blocks through DMA. A polled I/O loop might be used for the highest-throughput, lowest-latency networking, where even one interrupt per packet is too expensive. The job of the operating system is to use each mechanism where it fits.
05. Device Registers
We have referred several times to device registers without examining them closely. Let us pause briefly on what is actually inside a modern device.
A device, from the CPU's point of view, is a small computer of its own. It has a state machine that does whatever the device exists to do — moving data between a magnetic platter and an interface, framing and de-framing network packets, scanning a matrix of pixels. The CPU communicates with this internal state machine through a window of registers that are exposed in MMIO space.
Several kinds of registers are common.
Data registers hold values being moved between the device and the CPU. A simple UART has a transmit data register and a receive data register, each one byte wide. A higher-throughput device might have a FIFO of dozens or hundreds of entries, accessed through a single MMIO address that auto-advances on each read or write.
Status registers hold flags that describe the device's current condition. Ready, busy, error, FIFO full, FIFO empty, interrupt pending: every device has its own collection. Status registers are read-only from the CPU's side, though reading some bits may have side effects (for example, reading a UART's data register typically clears the rx ready status bit).
Control registers allow the CPU to configure the device or issue commands. Setting bits to enable transmitters and receivers, choosing baud rates, selecting modes, kicking off transfers — all of these are control register writes.
Address and length registers for DMA-capable devices hold the parameters of pending or in-progress transfers. Programming a transfer means writing source and destination addresses, transfer length, and possibly a transfer-size or stride into these registers, then writing a "go" bit in the control register.
Interrupt registers describe pending interrupts and let software acknowledge them. A modern device may have dozens of distinct interrupt sources, with mask and pending registers identifying which is asserting.
Device drivers are, at heart, programs that read and write device registers in the right order to make the device do what the operating system wants. The complexity of a real driver — the Linux drivers for a modern NVMe SSD, a 100-gigabit network card, or an Nvidia GPU run to tens of thousands of lines — comes from the elaborate sequences of register accesses that the device's specification requires, not from any deep computational complexity.
06. Device Discovery and Enumeration
The driver model assumes that the operating system already knows which devices are present, what kind they are, and where their registers live in the memory map. On the earliest computers, the answer was hardwired: a particular controller always sat at a particular address, and software was compiled with that address baked in. Modern systems have moved decisively away from this model, because the same operating-system binary must boot on machines with very different sets of devices.
Three mechanisms dominate.
On PCI Express systems, the bus itself is self-describing. Each device exposes a small region of standardized configuration space — 256 bytes in legacy PCI, 4 KiB in PCIe — with fields identifying the vendor, the device class, the supported features, and a set of base address registers (BARs) describing the address ranges the device wants in the memory map. At boot, firmware (or the OS) walks the tree of PCIe switches, reads each device's configuration space, allocates address ranges to its BARs by writing them, and records the resulting map. A driver later attaches to a device by matching its vendor and device IDs against a table; once attached, it reads the assigned BAR addresses and proceeds. The whole process, called PCI enumeration, is the reason you can plug a brand-new GPU into a working computer and have it appear without recompiling the kernel.
On embedded systems where there is no enumerable bus, the equivalent role is played by a device tree. A device tree is a small, structured data file (the canonical format is the flattened device tree used by Open Firmware and adopted by Linux) that describes every device in the system: its type, its register addresses, its interrupt numbers, its clock and power domains, and any board-specific properties. Firmware passes the device tree to the operating system at boot. The kernel walks the tree, instantiates a driver for each entry it can match, and configures the device using the parameters the tree supplies. Most modern ARM and RISC-V Linux systems use this approach.
On the legacy PC platform, the Advanced Configuration and Power Interface (ACPI) plays a similar role. ACPI tables list devices that are not enumerable by PCI — fixed motherboard hardware, embedded controllers, thermal zones — and also describe power-management capabilities, processor topology, and interrupt routing. ACPI also includes a small bytecode language, AML, that lets the firmware describe device behaviour declaratively rather than hardcoding it in every operating system.
From the cycle's point of view, none of this matters: a driver eventually issues loads and stores against device registers exactly as before. The discovery mechanisms simply ensure that drivers know where those registers are. The reason to mention them in this chapter is that I/O organization is incomplete without them: a system that knows how to talk to a device but does not know which devices are present is still a long way from booting.
07. Buffering, FIFOs, and Flow Control
The naive picture of I/O — a device produces or consumes one byte at a time, the CPU services every byte — makes good sense for slow devices like keyboards but breaks down completely for fast ones. A 10-gigabit Ethernet adapter delivers a frame every few hundred nanoseconds; even an interrupt every frame would consume more cycles than the CPU has to give. The bridge between fast devices and the relatively slow software that drives them is buffering.
The simplest buffer is a small FIFO (first-in-first-out queue) inside the device itself. A UART with a 16-byte transmit FIFO, for example, lets software write up to sixteen bytes at once and then move on; the device clocks them out one at a time and raises an interrupt only when the FIFO is empty (or has fallen below a programmable threshold). Receive FIFOs play the symmetric role: incoming bytes accumulate, and the device interrupts only after several have arrived or after a timeout. The number of interrupts per byte falls dramatically, sometimes by an order of magnitude or more, with no loss of throughput.
For higher-rate devices the FIFO grows into a ring buffer or descriptor ring in main memory. The CPU and the device share a circular array of descriptors, each describing one transfer; the device walks the ring autonomously through DMA, advancing its own head pointer, while software walks behind, advancing a tail pointer as it consumes completed entries. A single interrupt covers many transfers, and software can batch its responses. Modern network cards, NVMe SSDs, and GPUs all use ring-based interfaces; the differences between them are largely in the descriptor format.
A related idea is double buffering (or ping-pong buffering): two buffers are alternated so that one can be filled by hardware while the other is processed by software, with the roles swapped on each cycle. This is the standard way to avoid stalls in audio playback, video capture, and any other steady-state streaming application. Triple and N-way buffering generalize the idea when the consumer's processing time is irregular.
Flow control is the discipline that prevents producers from overrunning consumers. On any FIFO or ring, when the buffer fills, the producer must either stop, drop data, or apply backpressure to its source. Hardware buses include explicit flow-control signals — the ready line of a valid/ready handshake, the credit counters of a credit-based interconnect, the XOFF byte of a serial protocol — to make this safe. Drivers in turn implement higher-level flow control by leaving headroom in the ring and by signalling the upper layers (the network stack, the filesystem) when room runs short. Without flow control, fast devices either lose data, deadlock the bus, or send their software into pathological behaviour; with it, throughput remains usable across orders-of-magnitude speed differences between participants.
08. MMIO Ordering, Caching, and Barriers
Most of the time, a load or store completes far enough away from any other transaction that ordering does not matter. Memory-mapped I/O is one of the few situations where it does, and where the discipline that protects it deserves its own paragraph.
A store to a device's control register and a subsequent store to its data register often must reach the device in that order, because the control register sets up the meaning of the data register. A read from a device's status register may have a side effect: reading a UART data register clears the receive-ready bit; reading a counter register may latch a snapshot for the next read. The CPU's normal optimizations — cache it, reorder it, combine it with neighbouring accesses, drop it if it looks redundant — are all wrong for this kind of traffic.
The operating system therefore marks MMIO regions of the address space with strict attributes when it sets up the page tables. Common attribute classes include device memory (no caching, no speculative reads, no write combining, accesses occur in program order to that region), strongly ordered (the strictest level, used for the most timing-sensitive devices), and write-combining (writes may be merged into larger bursts but may not be reordered with respect to other write-combining writes; useful for graphics framebuffers). The exact menu varies by architecture; the AArch64 device-nGnRnE through device-GRE spectrum is the canonical example.
Even with strict region attributes, software sometimes needs explicit synchronization. A driver that programs a DMA descriptor and then writes a kick register to start the transfer must ensure that the descriptor write has reached memory before the kick reaches the device; otherwise the device may begin reading a half-written descriptor. The instruction that enforces this is a memory barrier or fence — mfence and sfence on x86-64, dmb, dsb, and isb on AArch64, fence on RISC-V — and a single appropriately placed barrier is enough to fix what would otherwise be a deeply mysterious race condition. We will return to memory ordering in detail in Chapter 31; the point worth flagging here is that I/O code is one of the standard places where an ordinary application programmer first meets it.
A related complication is MMIO write posting. To improve throughput, modern interconnects often allow the CPU to consider a write "done" as soon as it has been handed off to the bus, even if the device has not yet accepted it. A subsequent read from the same device acts as an implicit barrier: the read cannot complete until prior writes have, so a driver that wants to know whether a previous write has reached the device can issue a follow-up read of any non-side-effecting register. This write-then-read idiom appears in virtually every device driver and is a common source of subtle bugs when omitted.
09. I/O Errors and Reliability
Real devices fail. A disk reports an unreadable sector. A network adapter detects a CRC error on an incoming packet. A PCIe link drops a transaction because of a transient fault. The I/O subsystem has to detect these conditions, report them to software, and — where possible — recover.
At the device level, every nontrivial controller exposes a status register with error bits and an interrupt path for asynchronous error notifications. The driver reads the status register on every interrupt, distinguishes successful completions from errors, retries or fails the affected request as appropriate, and — if the error is severe enough — escalates to the operating system kernel.
At the link level, modern interconnects include their own error-detection and recovery mechanisms. PCIe defines Advanced Error Reporting (AER), in which links record correctable errors (single-bit ECC corrections, retried packets) and uncorrectable ones (failed retries, malformed packets), and report them through a standardized hierarchy of registers and interrupts. Memory channels include ECC on the data path; storage controllers add CRC on every block; networks add CRC at multiple layers and end-to-end checksums in the protocol.
At the system level, the operating system orchestrates recovery. It may retry a failed I/O once or twice in case the failure was transient, and only then surface an error to the application. It may reset a hung device by power-cycling it, often through ACPI methods or board-specific GPIOs. It may go further and isolate a failing device through the IOMMU so that it cannot corrupt the rest of the system while it is being investigated. The most reliable systems — mainframes, fault-tolerant servers — add still more, including hot-pluggable I/O cards, redundant paths to storage, and the ability to retire failing components without rebooting.
This layered, fault-aware view of I/O is increasingly the norm rather than the exception. As systems push more work through their interconnects — high-bandwidth memory, accelerators, smart NICs — the assumption that I/O "just works" becomes harder to sustain, and architectures invest more silicon in detecting, containing, and recovering from the inevitable failures. Chapter 53 will treat reliability and validation as a topic in its own right; here it is enough to know that error reporting and recovery is part of the I/O organization, not an afterthought.
10. Interrupt Controllers
The interrupt mechanism described earlier assumed the CPU has a small, fixed number of interrupt request inputs. In a real system there are dozens or hundreds of devices that may need to interrupt the CPU, far more than a CPU can have dedicated input pins for. The hardware that bridges the gap is the interrupt controller.
An interrupt controller sits between the devices and the CPU's interrupt input. Each device's interrupt request is wired to one of the controller's input lines. The controller maintains, internally, the state of which interrupts are pending and which are enabled. When at least one enabled interrupt is pending, it asserts a single interrupt request to the CPU. When the CPU acknowledges the interrupt, the controller delivers a number — the interrupt vector — that identifies which source caused the interrupt. The CPU uses that number to index into the interrupt vector table and reach the right handler.
The classical example is the 8259 PIC of the original PC architecture, which provided eight interrupt inputs (with two cascaded units giving fifteen usable inputs in the AT). Modern x86 systems use a much more elaborate scheme called the APIC (advanced programmable interrupt controller), with a local APIC in each CPU core and an I/O APIC (or, in newer systems, a streamlined message-signaled interrupt mechanism) collecting interrupts from devices. ARM systems use the GIC (generic interrupt controller), now in its third major version. RISC-V defines the PLIC (platform-level interrupt controller) and the newer AIA (advanced interrupt architecture).
The features these controllers provide go well beyond simple aggregation. They typically support:
- Priority levels, so that a more important interrupt can preempt the handler of a less important one.
- Masking, so that software can selectively disable individual sources without disabling all interrupts.
- Routing, so that an interrupt can be steered to a particular CPU core (or to whichever core is least busy).
- Message-signaled interrupts, in which the device sends a memory write to a special address rather than asserting a wire; this scales much better as the number of devices grows.
The interrupt controller is, in effect, a small CPU dedicated entirely to managing interrupt traffic. Its complexity is hidden from device drivers, which see only the simple "this device interrupts me" mechanism, but it is essential to making interrupts work in a system with many cores and many devices.
11. Bus Arbitration
The last topic of this chapter brings us back to the interconnect. We have been speaking as if "the CPU sends a request to memory" or "the device sends data to memory" were single, atomic acts. In a system with multiple potential masters of the bus — multiple CPU cores, several DMA controllers, perhaps an IOMMU and a graphics processor — the question of which master gets to use the bus on any given cycle requires resolution. That resolution is bus arbitration.
Several arbitration policies exist, each with different fairness and performance properties.
Fixed-priority arbitration assigns each master a permanent rank. When two masters request the bus, the higher-ranked one wins. Fixed priority is simple and predictable but starves low-priority masters under heavy load.
Round-robin arbitration rotates priority among the masters. After a master is granted the bus, it goes to the bottom of the priority list. This guarantees fairness but can be slower for the master that does the most work, since it has to wait its turn.
Weighted arbitration schemes give high-bandwidth masters more turns. The CPU might receive two slots out of every five, while two DMA controllers receive one slot each, and so on.
Quality-of-service schemes let software annotate transactions with priorities or deadlines, and the arbiter respects those annotations. A real-time audio stream might be annotated as latency-critical, while bulk file copies are annotated as throughput-tolerant.
In a classical shared-bus system, arbitration was a centralized function: a single arbiter received requests from all masters and granted the bus to one at a time. In a modern point-to-point fabric, arbitration is distributed among the switches and routers in the interconnect. Each switch arbitrates among the streams arriving at it, often using its own local policy. The overall effect is the same — at any moment, every shared resource has chosen exactly one user — but the implementation is fragmented across many small arbiters rather than concentrated in one.
Bus arbitration matters more in I/O than in pure CPU-and-memory traffic, because I/O introduces masters with very different characteristics. A real-time audio device cannot tolerate jitter; a streaming network card values throughput over latency; a tightly coupled accelerator may need bursts of high bandwidth. Designing arbitration so that all of these coexist on the same fabric without making any of them unhappy is one of the harder problems in system architecture.
12. Summary
The I/O subsystem exists to bridge the gap between the CPU's neat synchronous world and the messier external one. Devices are reached either through memory-mapped I/O, in which their registers occupy ranges of the regular address space, or through port-mapped I/O, in which they live in a separate space accessed by special instructions; modern systems use the former almost exclusively. The CPU notices that a device needs attention either by polling its status, which is simple but wasteful, by responding to interrupts, which the device asserts when something happens, or by handing whole transfers to a DMA engine that moves data while the CPU runs other code. Each strategy has its place, and a real system uses all three. Behind every device lies a set of MMIO registers — data, status, control, address, length, interrupt — through which a driver does its work, and an enumeration mechanism such as PCI configuration space, a device tree, or ACPI tables tells the operating system which devices are present and where those registers live. Buffering by FIFOs, ring descriptors, and double-buffered DMA queues, together with explicit flow-control protocols, lets devices and software exchange data smoothly across enormous speed differences. MMIO traffic is given strict region attributes that disable caching and reordering, with explicit memory barriers in driver code wherever ordering matters. Errors are detected at multiple layers — device status registers, PCIe AER, ECC on memory and links, end-to-end checksums — and recovery is orchestrated by the operating system. Interrupt controllers aggregate the requests of many devices into a tractable interface for the CPU and provide priority, masking, and routing. And underneath everything, bus arbitration ensures that the shared interconnect is shared in a way that respects fairness and performance.
In Chapter 10 we will take stock of what we have built and ask how to talk meaningfully about its performance — what we measure, how we compare implementations, and where the bottlenecks tend to hide.