Part VII·Advanced and Frontier·Chapter 54 of 62

Part VIIAdvanced and Frontier

Performance Analysis

May 16, 2026·12 min read·advanced

This chapter is about understanding performance: how to measure it, how to interpret what you measure, and how to identify where time is being spent. It's referenced from Chapter 10 (performance…

This chapter is about understanding performance: how to measure it, how to interpret what you measure, and how to identify where time is being spent. It's referenced from Chapter 10 (performance basics) and Chapter 27 (decode and microcode). Where Chapter 10 introduced the foundational metrics (throughput, latency, IPC, MIPS, etc.), this chapter is about practical performance engineering — the tools and methodology used in production work.

01.The Top-Down Performance Methodology

Modern Intel and AMD CPUs expose a structured way to attribute pipeline stalls. The Top-Down Microarchitecture Analysis (TMA), pioneered by Yasin (Intel, 2014), partitions every pipeline slot into one of four categories:

Retiring: the slot delivered a useful instruction.
Bad Speculation: the slot was filled but the work was squashed (mispredicted branches, machine clears).
Front-End Bound: the slot was empty because the front end didn't deliver a uop.
Back-End Bound: the slot was empty because the back end couldn't accept a uop (resource stalls).

A core delivers $W$ uops per cycle (its issue width). Over $N$ cycles, the total slots are $W \cdot N$ . Each slot is in exactly one category. The percentages let you see where bottlenecks are.

The methodology then drills down. If you're back-end bound, is it memory bound or core bound? If memory bound, which level — L1, L2, L3, DRAM? If core bound, which execution port or scheduler? Each level has its own performance counters.

perf stat -d on Linux shows the highest-level metrics. pmu-tools (toplev.py from Andi Kleen) walks the full top-down hierarchy. Intel VTune and AMD μProf provide GUI front-ends.

A typical top-down analysis on a numerical workload might show:

Retiring: 18% — surprisingly low.
Bad Spec: 2% — branch prediction is fine.
FE Bound: 5% — front end keeps up.
BE Bound: 75% — heavily back-end bound.
- of which Memory Bound: 65%.
  - L3 Bound: 40% — many L3 hits.
  - DRAM Bound: 25% — many DRAM accesses.
- Core Bound: 10%.

This points clearly: the workload is bandwidth-limited. Optimization should focus on reducing memory traffic — better data layout, better algorithms, blocking for cache.

A different workload might show 60% BE Bound, of which 50% is Core Bound (saturated divider port, perhaps). The optimization is to reduce divisions or use reciprocal multiplication.

02.Performance Counters

CPUs include hardware performance monitoring counters (PMCs) — registers that increment on specific events:

Instructions retired.
Cycles.
L1 / L2 / L3 cache accesses and misses.
TLB misses.
Branch mispredicts.
Specific stall cycles by reason.
Specific resource utilization (port pressure on a particular execution port).
Memory bandwidth.

A modern CPU exposes hundreds of counter events but has a limited number of physical counters (typically 4-8 fixed plus a few programmable). Software multiplexes events across counters — running for a fraction of time with one set, then switching.

perf list shows available events. perf stat -e cache-misses,cache-references workload measures specific events.

Sampling

Beyond simple counts, PMCs support sampling profiling: every $N$ th event triggers an interrupt, recording the instruction pointer (and call stack, with appropriate hardware). Over a run, you accumulate a statistical profile of where events occur.

perf record -e cycles workload does cycle-based sampling — the standard CPU profile. perf record -e cache-misses workload profiles cache misses to specific code locations. Both approaches give actionable information.

Modern hardware supports PEBS (Precise Event-Based Sampling, Intel) and IBS (Instruction-Based Sampling, AMD): the recorded instruction pointer is the actual instruction that caused the event, not a few instructions later (as basic interrupts can be). PEBS is essential for accurate attribution.

Last Branch Records

LBR (Last Branch Records) are a CPU feature that buffers the last N taken branches. Combined with PMC sampling, LBR enables call-stack-aware profiling without runtime instrumentation: a sample includes the recent control-flow context. Used by Intel processor trace, perf record's call-graph, and similar tools.

03.Profiling Tools

A non-exhaustive list of common tools:

perf (Linux): the kernel-integrated profiler. Powerful, scriptable, relatively low overhead.

Bash

perf stat ./workload          # summary statistics
perf record ./workload        # sampling profile
perf report                   # interactive report
perf annotate                 # source-level breakdown
perf top                      # live profile of running system

Intel VTune Profiler: a comprehensive Intel-specific tool with strong UI. Microarchitectural analysis, threading analysis, memory access analysis, GPU offload analysis.

AMD μProf: AMD's equivalent. Especially good for AMD-specific events.

ARM Streamline / Performance Studio: ARM's profiler for Cortex-A platforms.

eBPF / bpftrace: dynamic instrumentation in the Linux kernel. Lets you attach probes to kernel and user functions; aggregate counts; histogram latencies. Brendan Gregg's books and FlameGraphs make heavy use of bpftrace.

FlameGraphs (Brendan Gregg): a stack-trace visualization. Each function's time-on-CPU is a horizontal bar; called functions stack on top. The width is time spent. Pattern-recognition friendly.

Time Profile vs. Off-CPU Profile. A standard profile shows where the program spends time running. An off-CPU profile shows where it spends time waiting (blocked on I/O, locks, syscalls). Both views are needed for full understanding.

04.Microbenchmarking

Often you want to measure something specific: how long does this operation take? Done naively, microbenchmarks are notoriously misleading.

Common pitfalls:

Compiler removes the work. If the result of the operation isn't used, the optimizer eliminates it. Use volatile, output the result, or use compiler-specific "DoNotOptimize" hints.

Inlining changes things. A function called once for the benchmark may be inlined; in production, the same call site may not inline. Account for both cases.

Cache effects. First call sees a cold cache; later calls see a warm cache. Decide which you're measuring.

Frequency scaling. Modern CPUs change frequency mid-run. Measure cycles, not wall time, when comparing implementations.

Branch prediction. Patterns repeat in microbenchmarks; predictors learn them. Real workloads may have less predictability.

Throughput vs. latency. Are you measuring how fast a single op completes (latency) or how fast many ops can run in parallel (throughput)? They're different by an order of magnitude or more.

Use libraries for benchmarking. Google Benchmark (C++), Hyperfine (CLI), Criterion (Rust), JMH (Java) — all provide statistical rigor: warmup, multiple runs, outlier detection, confidence intervals.

A simple latency measurement:

volatile uint64_t result;
for (int i = 0; i < 1000000; i++) {
    result = expensive_op(input);
}

But this measures throughput more than latency (ops can pipeline). For latency, create a serial dependency:

volatile uint64_t x = initial;
for (int i = 0; i < 1000000; i++) {
    x = expensive_op(x);
}

Now each operation depends on the previous, so they cannot pipeline. This measures latency.

For throughput, use independent ops:

volatile uint64_t a = 0, b = 0, c = 0, d = 0;
for (int i = 0; i < 1000000; i++) {
    a = op(a, x); b = op(b, y); c = op(c, z); d = op(d, w);
}

Multiple independent dependency chains let the OoO core overlap them.

05.Roofline Analysis

The roofline model (Williams, Waterman, Patterson 2009) frames performance against two limits:

Compute roof: peak FLOPs per second the CPU can deliver.
Bandwidth roof: peak bytes per second from memory.

A workload's arithmetic intensity (FLOPs per byte from memory) determines which roof binds it. Plot performance vs. arithmetic intensity on log-log axes:

For low arithmetic intensity (memory-bound), bandwidth limits performance: lines slope up at 45°.
For high arithmetic intensity (compute-bound), the compute roof flat-lines performance.

The "knee" at the intersection is the arithmetic intensity above which the workload is compute-bound.

For a CPU with peak 800 GFLOPs FP64 and 200 GB/s memory bandwidth, the knee is at AI = 4 FLOPs/byte. Code with AI < 4 is bandwidth-limited; code with AI > 4 is compute-limited.

Roofline plots are excellent for:

Setting realistic performance targets.
Identifying which optimization matters: if you're bandwidth-bound, faster compute won't help.
Comparing multiple kernels on the same chart.

Intel Advisor and NVIDIA Nsight Compute both produce roofline plots. Manual roofline analysis with measured intensity and known peaks gives the same information.

06.Memory Profiling

Memory access patterns dominate performance for many workloads. Tools and metrics:

Cache miss rates at each level (PMCs).

Memory bandwidth utilization vs. peak (PMCs report bytes transferred; compare to peak).

TLB miss rates, especially for workloads with large working sets that exceed TLB coverage.

Cache line bouncing between cores: tools like Intel VTune detect "Memory Bound: HitM" — loads where the line was modified in another cache, requiring coherence transfer.

False sharing detection: VTune's threading analysis or perf c2c finds cases where two cores write to the same line accidentally.

For deeper memory profiling, memory access traces can be captured (Intel PT, ARM CoreSight, vendor tracing) and analyzed for patterns. Slow but extremely informative.

07.I/O and Latency Profiling

For workloads bottlenecked outside the CPU:

iostat / sar: disk I/O statistics.
bcc/bpftrace tools: attach to syscalls, measure latencies, distribution histograms.
flamegraphs of off-CPU time: where does the program block?
dtrace (Solaris/macOS/FreeBSD): comprehensive dynamic tracing.

Latency-sensitive workloads (web servers, financial systems) care about percentiles, not means: p50 latency might be fine but p99 latency could be terrible. Tools must track distributions.

08.Threading Profiling

Concurrent code has its own analysis needs:

Lock contention: time threads spend waiting on locks. perf lock on Linux; lock profiling in VTune.
Imbalanced load: parallel work that doesn't divide evenly.
Scaling efficiency: speedup vs. ideal as core count grows.
NUMA effects: memory access latency changes if a thread is moved across NUMA boundaries.

Amdahl's Law (Chapter 30) sets an upper bound: serial fractions limit overall speedup. Profiling identifies the serial fractions to attack.

09.CPU-GPU and Heterogeneous Profiling

For systems with GPUs, NPUs, or other accelerators:

NVIDIA Nsight Compute / Systems: kernel-level and system-level profiling for NVIDIA GPUs.
rocprof: AMD GPU profiling.
Intel VTune with GPU offload analysis.
Apple Instruments: CPU + GPU + Metal profiling on macOS / iOS.

Heterogeneous workloads often spend significant time moving data between processors. Profiling must capture this — not just kernel time on each device.

10.Continuous Profiling

For production fleets, continuous profiling has become standard practice. A small sampling overhead (e.g., 100 Hz of stack sampling) is added to every server; results are aggregated centrally; dashboards show CPU usage by function across the fleet.

Tools: Google's Cloud Profiler, Pyroscope, Parca, Polar Signals. The overhead is typically 1-2% of CPU; the insight is enormous — you can attribute every CPU cycle in your fleet to specific functions.

For optimization at scale: a small percentage win in a hot function across thousands of servers is large absolute savings. Continuous profiling identifies these targets that local testing wouldn't surface.

11.Tracing and Causal Profiling

Tracing captures sequences of events: function entries/exits, syscalls, locks acquired, network packets sent. Perfetto, Chromium tracing, ftrace, eBPF can all generate traces.

Causal profiling (COZ — Curtsinger and Berger, 2015) is a clever variant: rather than measuring where time is spent, it measures which optimizations would have impact. By artificially slowing down other code while running, COZ infers the speedup that would result if the target code were faster. Identifies functions worth optimizing — sometimes counter-intuitive ones that consume modest time but block parallelism.

12.Microarchitectural Profiling Specifics

For deep microarchitectural understanding:

Port pressure. Modern x86 has 4-12 execution ports; saturating one bottlenecks the rest. PMCs report per-port utilization. If port 5 is consistently 100% utilized, look for instructions that go only to port 5 (some shuffles, complex shifts).

Issue queue full. If the scheduler can't accept new uops, the front end stalls. Counter: UOPS_ISSUED.STALL_CYCLES on Intel.

ROB full. The Reorder Buffer fills, blocking dispatch. Counter: BACKEND_BOUND.RESOURCE.

Memory ordering machine clears. The CPU detected a memory-ordering violation in OoO and squashed. Hot path: shared memory with frequent writes.

Self-modifying code clears. Code that writes its own instruction stream causes a full pipeline flush. Avoided in optimized code.

Front-end starvation. Branch mispredicts, instruction cache misses, decode stalls. Counter: FRONTEND_RETIRED categories.

For ARM and RISC-V, similar counters exist (somewhat varied between vendors); the principles transfer.

13.Performance Modeling

Beyond profiling, performance modeling predicts performance from architectural parameters and workload characteristics. Models range:

Analytical: closed-form formulas. Simple but capture only basic effects.
Statistical: ML-based predictions from features. Works for some queries but is opaque.
Simulation: cycle-level (gem5, ZSim) or trace-driven. Detailed but slow.

Models are used for:

Pre-silicon what-if analysis: should we double the cache?
Workload characterization: how does this code respond to architectural changes?
Capacity planning: how many servers do we need?

The best models blend approaches: simulation for accuracy on critical kernels, analytical for big-picture sizing, profiling for ground truth.

14.Common Pitfalls

A list of frequent mistakes:

Optimizing the wrong thing. Without measurement, intuition is often wrong. Profile first.

Measuring the wrong thing. Wall time, CPU time, cycles, instructions — these can diverge. Pick the right metric for the question.

Single-run measurements. Performance varies. Run multiple times; report distributions.

Frequency drift. Multi-second runs can change frequency states. Pin frequency for benchmarks (perf set-cpufreq, cpupower).

Cold vs. warm cache. Are you measuring first run or steady state?

Microbenchmark artifacts. Trivial loops achieve unrealistic IPC because the CPU specializes. Real code has more complexity.

Improper attribution. Sampling at 1 kHz means events are attributed to the IP at sample time, not necessarily the cause. Use PEBS / IBS for precision.

Comparing across machines without normalization. Two machines might differ in TLB size, cache size, frequency, or chip generation. Normalize when comparing.

15.Summary

Performance analysis is a craft. The hardware exposes thousands of counters; the question is which ones to look at and how to interpret them. Top-down methodology gives a structured starting point. Sampling profiling identifies hot code; PEBS/IBS attribute precisely. Roofline frames whether you're compute-bound or bandwidth-bound. Microarchitectural counters drill into specific stalls. FlameGraphs visualize stacks; off-CPU profiles show waiting time; tracing captures sequences; continuous profiling at production scale aggregates fleet-wide patterns.

Good performance work combines these tools with measurement discipline: run multiple times, control frequency, use precise counters, beware of microbenchmark artifacts. The goal isn't peak benchmark numbers but understanding — knowing why a workload performs as it does and where the leverage is.

The next chapter is about modern packaging: how multiple chips, dies, and substrates are integrated. The era of single monolithic dies is ending; chiplets and 3D-stacked memory are reshaping how we think about chip-level architecture.

Book mode

	perf stat ./workload # summary statistics
	perf record ./workload # sampling profile
	perf report # interactive report
	perf annotate # source-level breakdown
	perf top # live profile of running system

	volatile uint64_t result;
	for (int i = 0; i < 1000000; i++) {
	result = expensive_op(input);
	}

	volatile uint64_t x = initial;
	for (int i = 0; i < 1000000; i++) {
	x = expensive_op(x);
	}

	volatile uint64_t a = 0, b = 0, c = 0, d = 0;
	for (int i = 0; i < 1000000; i++) {
	a = op(a, x); b = op(b, y); c = op(c, z); d = op(d, w);
	}