Part IFoundations

Performance Basics

May 16, 2026·32 min read·beginner

Up to now, this book has used words like "fast" and "slow" without ever defining them. The previous chapters established what a computer is and how it is organized; the chapters from here on will, in…

Up to now, this book has used words like "fast" and "slow" without ever defining them. The previous chapters established what a computer is and how it is organized; the chapters from here on will, in one way or another, be devoted to making it run faster. Before we can measure progress, we need a vocabulary for talking about performance. That vocabulary is the subject of this chapter.

The vocabulary is older and more carefully developed than a beginner often suspects. Decades of architects have argued about how to compare machines, and the arguments have produced a small set of metrics — latency, throughput, clock rate, cycles per instruction, speedup — that capture most of what matters. They have also produced a very useful warning, in the form of Amdahl's law, that the part of a system you do not optimize can dominate the part you do. And they have produced an entire industry of benchmarks and performance counters that turn the metrics into things you can actually measure on real hardware. By the end of the chapter you should be able to read a performance claim about a processor and know what it means, what it does not mean, and what questions to ask next.

01. Latency and Throughput

Two distinct numbers describe the speed of any system that processes work: how long a single piece of work takes, and how many pieces of work it can finish per unit of time. The first is latency; the second is throughput. They are related, but they are not the same thing, and confusing them is one of the most common mistakes a beginner makes when reasoning about performance.

Latency is the time from when a request enters a system to when its result emerges. For a single instruction, latency is the number of clock cycles (or seconds) between the start of its execution and the moment its result is available. For a network packet, latency is the time from sending the first byte to receiving an acknowledgment. For a disk read, latency is the time between issuing the request and getting back the data. Latency is what a user feels when waiting for a single thing to happen.

Throughput is the rate at which work is produced. For a CPU, throughput is the number of instructions completed per second. For a disk, throughput is the number of bytes transferred per second. For a network, throughput is the number of packets per second or bits per second. Throughput is what a system's overall capacity looks like over time.

A simple analogy: latency is the time it takes a single car to travel a road; throughput is the number of cars per minute that pass any point on the road. A wider road can carry more cars per minute (higher throughput) without changing how long any single car's journey takes (latency). A faster road reduces the journey time (lower latency) and, all else being equal, increases the throughput too.

The two metrics often diverge sharply in real systems.

A pipelined CPU has high throughput but the same latency as an unpipelined one for any single instruction. An instruction still takes five cycles to traverse the pipeline; what changes is that five instructions can be in flight at once, so one finishes per cycle.

A network with a satellite hop has high throughput (hundreds of megabits per second) but high latency (around 250 milliseconds round trip). You can transfer a movie quickly, but you cannot play a fast-paced video game over the link.

A multi-core processor running embarrassingly parallel work has high throughput (many tasks per second) but the latency of each individual task is unchanged from the single-core case. Adding cores does not make any one task faster.

When evaluating a system, it is essential to ask which metric matters for the workload at hand. A web server cares about throughput in requests per second, but each request also has a latency budget. A scientific simulation cares about throughput in floating-point operations per second, but a single dependency chain through the simulation cares about latency. Real systems usually have to balance both.

02. Clock Rate

The most visible performance number on any processor is its clock rate or clock frequency, measured in hertz. A 3.5 GHz processor has a clock that ticks 3.5×1093.5 \times 10^9 times per second, and the period of each tick is

Tclk=13.5×109286 picoseconds.T_{\text{clk}} = \frac{1}{3.5 \times 10^9} \approx 286 \text{ picoseconds}.

The clock rate sets the upper bound on how fast the processor can do anything, because every state change inside the processor happens on a clock edge. A larger clock rate means more cycles per second, and other things being equal, more work done per second.

For two reasons, however, the clock rate alone is a poor measure of performance.

The first reason is that "other things being equal" is rarely true. Doubling the clock rate while halving the amount of work done per cycle leaves throughput unchanged. The 1990s and 2000s saw a vivid demonstration of this when Intel pushed the Pentium 4 to extremely high clock rates by deepening its pipeline; the resulting chips ran at clock rates impressive on the box but did less work per cycle than slower-clocked competitors and consumed enormous amounts of power. The industry collectively concluded that the simple "more megahertz" race was over and started competing on more meaningful metrics.

The second reason is that the clock rate is constrained by physics. The clock period must be at least as long as the worst-case path through the combinational logic between any pair of flip-flops, plus the setup, hold, and skew margins we met in Chapter 4. Modern processes give very fast transistors, but the speed of light is not negotiable: signals must still travel across the chip, and at 5 GHz a signal can move only about 6 cm per cycle in a vacuum, less in silicon. Designs that have pushed the clock rate aggressively have always paid for it in power, in pipeline depth, in cache miss penalty, and in the difficulty of timing closure.

That said, the clock rate is one of the inputs to a more meaningful metric, which we turn to next.

03. CPI and IPC

To compare two processors fairly, we need a metric that combines the clock rate with the amount of work the processor accomplishes per cycle. Two such metrics are universally used.

Cycles per instruction, or CPI, is the average number of clock cycles required to complete one instruction. For a perfectly pipelined RISC processor that issues one instruction per cycle and never stalls, CPI is exactly 1. Real processors have CPI greater than 1 because of pipeline stalls, cache misses, branch mispredictions, and dependencies. A CPI of 2 means the processor takes, on average, two cycles per instruction.

The total time to execute a program is the product

Tprogram=Ninstructions×CPI×Tclk.T_{\text{program}} = N_{\text{instructions}} \times \text{CPI} \times T_{\text{clk}}.

This equation, sometimes called the performance equation, is the most useful single tool in the performance toolkit. It separates the three things you can change to make a program faster: fewer instructions (better algorithm or better compiler), fewer cycles per instruction (better micro-architecture), or shorter clock period (faster process or fewer logic levels per cycle).

Instructions per cycle, or IPC, is the reciprocal of CPI. An IPC of 2 means the processor completes, on average, two instructions per cycle. IPC is more flattering for high-performance designs, because they can issue multiple instructions per cycle and so achieve IPC values well above 1. A modern out-of-order processor running well-behaved code can sustain an IPC of 3 to 5; a simple in-order processor running a memory-bound workload may struggle to reach 0.5.

CPI and IPC are averages, and like all averages they hide variation. Different parts of a program have different CPIs. Code that hits in the L1 cache, has predictable branches, and has plenty of independent instructions runs at the processor's peak IPC. Code that misses the cache constantly, mispredicts every branch, or has long dependency chains drops to a fraction of that. When you read that a CPU has "an IPC of 4," what is meant is usually peak issue width, which is the absolute maximum the processor can achieve under ideal conditions, not the average it sustains on real workloads.

Decomposing CPI into its components is a standard technique:

CPI=CPIideal+stalls per instruction.\text{CPI} = \text{CPI}_{\text{ideal}} + \text{stalls per instruction}.

The "stalls per instruction" term breaks down further into contributions from each source: instruction-fetch stalls, data-cache stalls, branch mispredictions, structural hazards, and so on. We will see this decomposition again in Chapter 22 when we discuss pipelining and again in Chapter 54 when we discuss performance analysis.

04. Speedup and Bottlenecks

When we change something about a system — make the cache larger, add a branch predictor, switch to a better compiler — we want to quantify how much faster it became. The standard metric is speedup.

Speedup is the ratio of the old execution time to the new execution time:

S=ToldTnew.S = \frac{T_{\text{old}}}{T_{\text{new}}}.

A speedup of 2 means the new system runs the same workload in half the time. A speedup of 1 means no improvement. A speedup less than 1 (a slowdown) means the change made things worse.

Speedup is meaningful only relative to a baseline and a workload. "X is 1.4× faster than Y" needs to be qualified: faster on which workload? With which compiler? At which clock setting? Two systems can have speedups of 0.7, 1.1, and 2.5 relative to a third on different benchmarks, and a fair comparison requires reporting all of them, not just the most flattering.

A closely related concept is the bottleneck. In any pipeline of work, one stage is always slower than the others, and that stage limits the total throughput. The bottleneck might be the CPU itself; it might be memory bandwidth; it might be disk I/O or network latency. Improving any other stage produces no speedup at all. Improving the bottleneck produces real speedup, but only until some other stage becomes the new bottleneck.

A useful diagnostic question for any optimization effort is therefore: what is the bottleneck right now? If you do not know, you do not yet know what to fix. Profiling tools, performance counters, and the techniques of Chapter 54 are designed precisely to answer this question. Without that information, "optimizing" usually means "spending effort on whichever part the engineer happens to find interesting," which often achieves nothing.

A simple example. Suppose a program spends 80 percent of its time in matrix multiplication and 20 percent in input parsing. Speeding up the matrix multiplication by 2× reduces the total time from 100 to 60 (the matrix part dropped from 80 to 40, the parsing remained at 20), giving a speedup of 100/60 ≈ 1.67×. Speeding up the parsing by 100× reduces the total time from 100 to 80.2, a speedup of only 1.25×. In both cases, what was not sped up still dominates. This phenomenon has its own name and its own famous formulation, which we turn to now.

05. Units of Performance: MIPS, FLOPS, IOPS, Bandwidth

The performance equation gives us time, the most fundamental quantity. Many practitioners, however, prefer to talk in rates — work performed per unit time — and a small but persistent vocabulary of units has grown up around these rates. Each measures something genuinely useful, and each can mislead if used carelessly.

MIPS — millions of instructions per second — is the oldest such unit. It is computed as the count of instructions retired divided by the elapsed time. The trouble is that an "instruction" is not a constant unit of work: a 64-bit multiply does more than a single-byte move, and an x86-64 instruction with a complex addressing mode and a memory operand can subsume several RISC instructions. Comparing MIPS across different ISAs is therefore close to meaningless. The unit survives because it is easy to measure and because, within a single ISA, it is a reasonable summary of how fast a processor is dispatching work.

FLOPS — floating-point operations per second — is the standard unit for scientific and machine-learning workloads. The peak FLOPS of a modern CPU core is the product of its clock rate, the number of floating-point lanes per SIMD register, the number of fused-multiply-adds it can issue per cycle, and a factor of two if FMAs are counted as two operations (a multiply and an add). For a 3 GHz core with two 256-bit FMAs per cycle (eight double-precision lanes each), the peak is

3 GHz×2 FMAs×8 lanes×2 ops/FMA=96 GFLOPS.3 \text{ GHz} \times 2 \text{ FMAs} \times 8 \text{ lanes} \times 2 \text{ ops/FMA} = 96 \text{ GFLOPS}.

A full server chip with sixty-four such cores has a peak of about 6 TFLOPS, and modern GPUs reach well past 50 TFLOPS for double precision and several hundred for the reduced precisions used in machine learning. Sustained FLOPS is almost always far below peak, because real code is rarely able to keep all the lanes busy.

IOPS — I/O operations per second — is the storage world's analogue: the number of read or write requests a device can sustain. A spinning disk delivers about 100 IOPS, a SATA SSD about 10 000 IOPS, an NVMe SSD several million IOPS at small block sizes. Like MIPS, IOPS is sensitive to what counts as an operation: a 4-KiB random read and a 1-MiB sequential read are not interchangeable.

Bandwidth measures the rate at which bytes flow through some channel — from memory to the CPU, from the CPU to a network adapter, from a GPU's HBM to its compute units. It is reported in bytes per second (or, traditionally for networking, bits per second), and it is the single most important quantity for a memory-bound or I/O-bound workload. Modern chips quote impressive bandwidth numbers — several hundred gigabytes per second for desktop DDR5 systems, a terabyte per second for HBM-equipped GPUs and accelerators — but those are aggregate peaks across many parallel channels, not the bandwidth seen by any single thread.

The trap that catches every newcomer is the assumption that high peak implies high sustained. Reaching peak FLOPS requires keeping every floating-point lane busy every cycle; reaching peak bandwidth requires issuing the right pattern of accesses to the right memory channels. A program that does neither sees only a fraction of the rated number, and the architectural skill is largely the skill of identifying which limit is binding and arranging code so that as much of it as possible is exercised.

06. The Roofline Model

A particularly useful framework for thinking about whether a workload is bound by computation or by memory traffic is the roofline model, introduced by Williams, Waterman, and Patterson in 2009. It plots achievable performance as a function of arithmetic intensity, the ratio of computation to memory traffic.

Let π\pi be the peak floating-point throughput of a machine, in FLOPS, and let β\beta be its peak memory bandwidth, in bytes per second. The arithmetic intensity II of a kernel is the ratio of FLOPs performed to bytes loaded from memory, in FLOPs per byte. The achievable performance PP for that kernel is then bounded by

Pmin(π,βI).P \leq \min(\pi, \beta \cdot I).

Graphed on log-log axes with II on the horizontal axis and PP on the vertical, the bound forms a roof: a sloped line P=βIP = \beta I on the left, where the kernel is memory-bound, joined at the ridge point I=π/βI^* = \pi / \beta to a flat ceiling P=πP = \pi on the right, where the kernel is compute-bound. Kernels with intensity well below II^* cannot reach peak performance no matter how cleverly they are coded, because there is not enough computation per loaded byte to keep the units busy. Kernels with intensity well above II^* are limited by the floating-point units, not by the memory system.

For a CPU with π=96\pi = 96 GFLOPS and β=25\beta = 25 GB/s, the ridge point is I=96/253.8I^* = 96 / 25 \approx 3.8 FLOPs per byte. A dense matrix-multiply with II in the tens of FLOPs per byte sits comfortably on the compute-bound side; a sparse matrix-vector multiply with II near 0.25 sits firmly on the memory-bound side. No amount of compiler tuning will move the sparse multiply onto the flat ceiling; the only way to do that is to restructure the algorithm to reuse loaded data.

The model can be refined to add intermediate roofs for cache levels (each level has its own bandwidth and ridge point) and ceilings below peak that account for missed micro-architectural opportunities such as scalar code, no fused-multiply-add use, or thread serialization. Refined or not, the roofline gives a single picture that explains why a kernel is slow and points at the right intervention. We will return to it briefly in Chapter 54 when we treat performance analysis in earnest, and again throughout Parts V and VI.

07. Energy and Power Efficiency

In the era of unconstrained clock scaling, performance was the only metric that mattered. That era ended in the mid-2000s, and modern processors are designed at least as much for efficiency as for raw speed. The vocabulary deserves its own treatment, because the easy assumption that fast equals good no longer holds.

Power is the rate of energy consumption, measured in watts. A modern desktop CPU dissipates 65–250 watts under load; a server CPU may push past 350 watts; a smartphone application processor budgets 2–6 watts; a deeply embedded microcontroller may run on milliwatts. Power matters because of three downstream constraints: heat (which the cooling system must remove, sometimes at greater capital cost than the CPU itself), battery life (in mobile devices), and electricity cost (in data centers, where power and cooling can dominate the operating budget).

Energy is power integrated over time, measured in joules. The total energy a computation consumes is the metric a battery actually cares about; a phone that runs at high power for a short time may drain its battery less than one that runs at lower power for a longer time. The rate of energy consumption is power, so a faster processor that finishes the work sooner can sometimes consume less energy than a slower one even if its instantaneous power is higher — the race-to-sleep strategy used by many mobile chips.

Performance per watt is the most common efficiency metric. Two systems doing the same work can be compared by dividing their throughput by their average power; the higher number wins. The Green500 ranks supercomputers this way, and it is the metric mobile-chip designers optimize most aggressively.

For a more nuanced view, the energy-delay product EDP=ET\text{EDP} = E \cdot T penalizes designs that save energy by going extremely slowly, and the energy-delay-squared product ED2P=ET2\text{ED}^2\text{P} = E \cdot T^2 penalizes them more heavily. ED²P is approximately invariant under voltage scaling, which makes it a popular metric for comparing micro-architectural alternatives independently of operating point.

We will not pursue power and energy further in this chapter — Chapter 52 is dedicated to the topic — but the basic vocabulary belongs alongside throughput and latency from the start. Any modern performance argument that ignores energy is incomplete.

08. Tail Latency, Percentiles, and Variability

The latency we have been discussing has, so far, been a single number: "the time taken by an operation". In real systems, the time taken by each operation varies, sometimes enormously. Two systems with the same average latency can offer very different user experiences: one delivers nearly all responses within a tight band, the other has a fast common case but occasional very long stalls. The vocabulary for talking about this is the vocabulary of distributions.

A latency distribution is usually summarized by its percentiles. The 50th percentile, or median, is the value that half the operations beat; the 99th percentile is the value that 99 percent of operations beat; the 99.9th and 99.99th percentiles capture the rare worst cases. The high percentiles are collectively called tail latency, and they often dominate the user-visible behaviour of a system: a webpage built from a hundred backend requests fires a hundred dice on each load, and any single tail event ruins the whole page's response time.

Mean latency, the most common single-number summary, is a poor proxy for tail latency in any system with substantial variability. A distribution with a 1-millisecond mean and a 1-millisecond standard deviation is well-behaved; one with a 1-millisecond mean but a heavy tail of 100-millisecond outliers is not, and the difference is invisible to the mean. Modern monitoring systems therefore report multiple percentiles — typically 50, 95, 99, and 99.9 — rather than averages alone.

Measuring the tail accurately is itself nontrivial. A naive sampler that records every NNth operation can miss exactly the rare events that matter; better systems record every operation and use streaming algorithms (HDR histograms, t-digests) to summarize the distribution at low memory cost.

09. Little's Law and Queueing

A simple but powerful result, due to John Little in 1961, ties together throughput, latency, and the average number of in-flight requests. Little's Law states

L=λW,L = \lambda W,

where λ\lambda is the average rate at which requests arrive, WW is the average time each request spends in the system, and LL is the average number of requests in the system at any instant. The result holds in remarkable generality: it does not depend on the distribution of arrival times, the distribution of service times, or the queueing discipline. As long as the system is in steady state, the equation holds.

Little's Law is the right tool for many architectural sanity checks. A memory system that takes 100 ns per request and is asked to deliver 10 GB/s in 64-byte cache lines must have λ=156\lambda = 156 million requests per second; with W=100W = 100 ns, the average number of in-flight requests must be L=λW15.6L = \lambda W \approx 15.6. A processor that can only have eight outstanding cache misses at a time is therefore limited by Little's Law to roughly 8/W=808 / W = 80 million misses per second, and at 64 bytes apiece, to about 5 GB/s of effective bandwidth. The peak bandwidth of the memory system is irrelevant if the processor cannot keep enough requests in flight to fill the pipeline.

This is the deep architectural reason that out-of-order execution and aggressive load buffers exist: they raise LL on the processor side so that the available memory bandwidth can be exploited. We will see Little's Law return in many guises throughout the book, especially in Chapters 17 and 26.

A related body of theory, queueing theory, deals with the way latency and queue lengths blow up as utilization approaches 100 percent. The single most important qualitative result is that for any system with any variability in arrival or service times, average latency rises very sharply as utilization approaches the saturation point. A system run at 50 percent utilization has small queues and short waits; the same system at 90 percent utilization may have queues an order of magnitude longer; at 99 percent, queues become unbounded in expectation. This is why production systems are rarely run anywhere near their theoretical capacity, and why "add 10 percent more headroom" is the most common response when latency targets begin to slip.

10. Strong Scaling, Weak Scaling, and Means

When a workload is run on a machine with more resources — more cores, more memory bandwidth, more accelerators — there are two distinct ways to ask whether it scales. Strong scaling holds the problem size fixed and asks how speedup varies with resources; it is the regime in which Amdahl's law is the relevant bound, and the serial fraction is the limit. Weak scaling holds the work per resource fixed and grows the problem with the machine; here Gustafson's law applies, and the achievable speedup is essentially unbounded provided the parallel fraction grows.

Most real workloads sit somewhere between the two ideals, and the choice of which to report can flatter or punish a design. A scientific simulation reported under strong scaling will show diminishing returns past some core count; the same simulation run on a finer grid, reported under weak scaling, will show near-linear speedup well past that point. Honest performance reporting names the scaling regime explicitly.

A related point is the choice of mean when summarizing performance across a benchmark suite. The arithmetic mean of running times is appropriate when running times are the quantities of interest. The harmonic mean of rates is the right choice if you want the mean rate to correspond to total work over total time. The geometric mean of normalized speedups is the standard way to combine ratios across benchmarks of widely differing absolute sizes; SPEC uses it precisely because it does not let a single very fast benchmark dominate the score, and because it is invariant under the choice of baseline. Using the wrong mean for the wrong quantity is one of the most common mistakes in performance papers; a SPEC summary that reported arithmetic mean of normalized speedups would systematically favour processors that excelled at the slowest benchmarks.

11. Wall-Clock Time, CPU Time, and System Time

A last category of metric is operational rather than architectural: when a tool reports "the program took TT seconds", what exactly is being measured? Three answers are common.

Wall-clock time, also called elapsed time or real time, is the actual time that passed between starting the program and finishing it. It is what the user experiences and the right metric for almost any user-facing concern.

CPU time is the time the CPU was actually executing the program's instructions, summed across all the cores it ran on. CPU time can be larger than wall-clock time for parallel programs (eight cores running for one second yields eight CPU-seconds of one wall-second) and smaller for programs that spend significant time waiting on I/O, locks, or sleep.

CPU time is itself usually broken down into user time, spent executing the program's own instructions, and system time, spent inside the kernel on behalf of the program (in system calls, page-fault handlers, and so on). A program with high system time relative to user time is often bottlenecked on I/O or on heavy interaction with the operating system, and the tuning intervention is usually different from one that reduces user time.

The time command on every Unix system reports all three:

Plain Text
$ time ./a.out
real 0m3.527s
user 0m12.115s
sys 0m0.342s

Here real is the wall-clock time, user is the total CPU time across cores, and sys is the kernel's share. The 12.115 user-seconds spread over 3.527 wall-seconds tells us, by simple division, that the program used about 3.4 cores' worth of compute on average — a useful figure no single number could provide.

This distinction will return in Chapter 54 when we discuss profiling. The point worth keeping for now is that time is not a single quantity, and a careful performance discussion almost always needs to name which time it means.

12. Amdahl's Law

In a 1967 paper, Gene Amdahl argued that the speedup achievable from improving part of a system is fundamentally limited by the part that is not improved. The argument is so concise and so important that it deserves to be developed carefully.

Suppose a program takes time TT to execute. Suppose a fraction ff of that time is amenable to some optimization, and a fraction 1f1 - f is not. Apply the optimization, achieving a speedup ss on the affected fraction. The new execution time is

Tnew=(1f)T+fsT.T_{\text{new}} = (1 - f) T + \frac{f}{s} T.

The total speedup is

S=TTnew=1(1f)+fs.S = \frac{T}{T_{\text{new}}} = \frac{1}{(1 - f) + \frac{f}{s}}.

This is Amdahl's law. It is one of the few formulas in computer architecture worth committing to memory.

The qualitative consequences are striking. As ss \to \infty — that is, as the optimization makes its target instantaneous — the speedup approaches

Smax=11f.S_{\max} = \frac{1}{1 - f}.

If f=0.9f = 0.9, the maximum possible speedup is 10.1=10\frac{1}{0.1} = 10, no matter how dramatically the 90 percent is improved. If f=0.5f = 0.5, the maximum is 2. If f=0.99f = 0.99, the maximum is 100. The unimproved part of the program sets a hard ceiling on the achievable speedup.

A small worked example will fix the intuition. Suppose a program takes 100 seconds, of which 60 are spent in computations that can be parallelized across many cores and 40 are inherently serial. With one core, the time is 100 seconds. With four cores, the parallel part takes 60/4=1560 / 4 = 15 seconds and the serial part still takes 40, giving 55 seconds total — a speedup of about 1.82. With sixteen cores, the parallel part takes 60/16=3.7560 / 16 = 3.75 seconds, but the serial part still takes 40, giving 43.75 seconds — a speedup of about 2.29. With infinitely many cores, the parallel part takes essentially zero, but the serial part still takes 40 seconds, capping the speedup at 100/40=2.5100 / 40 = 2.5.

This is the deepest lesson of Amdahl's law: serial portions of a program limit parallel speedup, and the limit cannot be escaped by adding more cores. It applies in many other settings too. A faster cache cannot speed up code that does not use the cache. A larger pipeline cannot speed up code that constantly stalls. A smarter branch predictor cannot help code that has no branches.

A related observation is Gustafson's law, which provides a counterargument applicable when the problem size scales with the resources rather than staying fixed. If the parallel portion of the workload grows as more cores are available — for example, if a scientific simulation uses extra cores to refine its grid rather than to run the same simulation faster — then the achievable speedup is

S=(1f)+sfS = (1 - f) + s f

and is not bounded above. Gustafson's law is the optimist's version of Amdahl's; both are correct, and which applies depends on whether the problem size is fixed or growing.

In modern practice, Amdahl's law is the more commonly cited of the two, because it captures the disappointment that a great many parallel projects experience when they discover that their carefully parallelized 95 percent leaves the unparallelized 5 percent as the bottleneck. The lesson is the same one we drew from the discussion of bottlenecks above: optimize where the time is.

13. Benchmarks

The metrics above describe performance abstractly. To measure performance concretely, we need representative workloads — benchmarks — and a discipline for running them.

A benchmark is a program (or a suite of programs) chosen to stand in for some class of real workloads. The hope is that a processor's performance on the benchmark predicts its performance on programs of the same general kind. Whether the hope is realized depends entirely on how representative the benchmark is.

Several benchmark suites have shaped processor evaluation over the decades.

SPEC CPU (currently CPU2017) is the dominant suite for measuring CPU performance on integer and floating-point workloads. It includes compilers, perl interpreters, fluid-dynamics simulations, image-recognition kernels, and a couple of dozen other programs assembled to be representative of "what a CPU is asked to do" in scientific and engineering settings. SPEC reports a geometric mean of normalized speedups, giving a single overall score per machine.

TPC (Transaction Processing Council) benchmarks measure database performance. TPC-C, TPC-H, and TPC-DS are widely cited in the data-management world.

MLPerf is a more recent suite that measures machine-learning training and inference performance, both on CPUs and on accelerators.

LINPACK measures dense linear algebra throughput and is the basis for the Top500 ranking of supercomputers.

Synthetic benchmarks such as Dhrystone, Whetstone, and CoreMark are simple, small programs designed to run quickly and stress particular operations. They are widely used in embedded systems, where running a full SPEC suite is impractical, but they are notoriously easy to game.

A few rules of thumb apply to all benchmarking.

A benchmark predicts your performance only if your workload looks like the benchmark. A score on SPEC CPU2017 says little about how a processor will perform on a graphics-heavy game or a database server.

Geometric means are usually the right way to combine speedups across benchmarks; arithmetic means give too much weight to the slowest case. SPEC and most other suites use geometric means for this reason.

Vendors optimize for benchmarks. Compilers are tuned to recognize common benchmark patterns; processors are tuned to perform well on the workloads vendors expect to be measured against. This is not necessarily dishonest, but it does mean that benchmark performance is sometimes better than performance on novel workloads.

The most reliable measurement is your own workload on the hardware you intend to use. A benchmark is a proxy; the workload itself is the truth.

14. Performance Counters

Modern processors include hardware that helps measure their own performance. Performance counters are special-purpose registers, embedded in the CPU, that count the occurrence of specific events — instructions retired, clock cycles elapsed, cache misses at each level, branches mispredicted, TLB misses, floating-point operations, and many more. They are exposed to software through model-specific registers, and on every modern operating system there are tools — perf on Linux, vtune on Intel-specific tooling, Instruments on macOS, xperf and Performance Toolkit on Windows — that read the counters and present the results.

A typical session with a tool like perf produces a report like this:

Plain Text
Performance counter stats for './a.out':
12,345,678 cycles # 3.500 GHz
9,876,543 instructions # 0.80 insn per cycle
123,456 cache-misses # 1.247 % of all cache refs
234,567 branches
12,345 branch-misses # 5.27 % of all branches
3.527 seconds time elapsed

The counter values themselves are raw integers; the secondary numbers ("3.500 GHz", "0.80 insn per cycle", "1.247 %") are derived ratios. From a report like this, an experienced eye can quickly diagnose where the program is spending its time. An IPC of 0.80 with a high branch-misprediction rate suggests the program is spending most of its time on misspeculated paths; reducing branches would likely help. An IPC of 0.30 with very high cache-miss rates suggests the program is memory-bound; restructuring data access patterns is the right intervention. An IPC of 3.5 with low miss rates suggests the program is already running near peak throughput, and improvements will require algorithmic changes rather than micro-architectural tuning.

Performance counters are also the foundation of profiling. Sampling profilers periodically interrupt the program — typically driven by a timer or a counter overflow — and record the program counter at the moment of interruption. After many samples, the distribution of program counters reveals where the program spends its time. Instruction-level profiling, which uses precise event counters, can attribute cache misses, branch mispredictions, and other expensive events to the specific instructions that caused them, allowing very fine-grained diagnosis.

We will return to performance analysis in detail in Chapter 54. For now, the point is that the metrics described in this chapter are not abstract: every modern processor measures them on its own behalf, and the tools to inspect those measurements are freely available on every operating system.

15. Summary

Performance is described by two complementary metrics: latency, the time for a single piece of work, and throughput, the rate at which work is completed. The clock rate sets the upper bound on the speed of a processor's internal activities, but it is meaningless on its own; it must be combined with the average number of cycles per instruction (CPI) or its reciprocal IPC to give a meaningful picture, captured in the performance equation T=N×CPI×TclkT = N \times \text{CPI} \times T_{\text{clk}}. Speedup measures the relative improvement of one configuration over another, and is always meaningful only relative to a baseline and a workload. Standard rate units — MIPS, FLOPS, IOPS, and bandwidth — each capture some aspect of throughput, and the roofline model relates achievable performance to the arithmetic intensity of a kernel, dividing workloads cleanly into compute-bound and memory-bound regimes. Power and energy round out the picture; performance per watt and the energy-delay products give us a vocabulary for efficiency that the era of unbounded clock scaling no longer permits us to ignore. Tail latency and percentiles describe the distribution of response times in a way that simple averages cannot, and Little's Law ties throughput, latency, and concurrency together with a relation that holds in great generality and recurs throughout architectural reasoning. Amdahl's law warns that the unimproved fraction of a program limits the achievable speedup, no matter how aggressively the rest is optimized; Gustafson's law gives a more optimistic picture when problem size scales with available resources, and the strong-versus-weak-scaling distinction names the choice between the two regimes. Benchmarks attempt to estimate performance on representative workloads, with mixed success that depends on how well the benchmark resembles the work you actually do, and the choice of mean — arithmetic, harmonic, or geometric — is a real one that materially changes the conclusions. Wall-clock, user, and system times offer three distinct views of how long a program took, each illuminating a different bottleneck. Performance counters, present on every modern processor, let us measure all of this directly, turning architectural reasoning into observable evidence.

This concludes Part II. We have built up a working organization for a computer: a CPU made of datapath and control unit, a memory subsystem feeding it through a bus, an I/O subsystem to talk to the outside world, and a vocabulary for measuring how well it all performs. Part III turns to the interface between hardware and software: the instruction set architecture that the CPU exposes to the programmer and the compiler.

Book mode
computer-architecturecomputer-organization
Was this helpful?