Part VIIIAppendices

Suggested Labs and Projects

May 16, 2026·14 min read·intermediate

The material in this book is most useful when grounded in hands-on practice. This appendix lists projects that exercise the concepts of the main text. They range from quick exercises (an hour or two) to substantial projects (weeks or months). Each project lists what it teaches, what tools and prerequisites help, and what variations or extensions are worth pursuing.

The labs are grouped by theme rather than by chapter. Some require specific hardware; many can be done on any modern Linux laptop.

01.Group 1: Bit-Level and Number-System Practice

Lab D.1: Bitwise Operation Library

Build a small C library that implements: popcount, clz, ctz, bit_reverse, byte_swap, and count_trailing_set_bits, all without using the corresponding intrinsics. Then time them against the intrinsic-using versions.

Teaches: Boolean algebra in code, bit manipulation idioms, the value of dedicated hardware instructions.

Extensions: Compare your software popcount to __builtin_popcount performance. Write a SWAR (SIMD Within A Register) popcount and measure. Read Hacker's Delight chapter 5.

Lab D.2: Two's Complement Calculator

Implement a tool that takes any integer and width (8, 16, 32, 64 bits) and prints binary, hex, signed-decimal, and unsigned-decimal interpretations side-by-side. Add an "interpret as floating-point" mode (treat the bits as IEEE 754 single or double).

Teaches: number-system conversions, floating-point representation.

02.Group 2: Cache and Memory Hierarchy

Lab D.3: Memory Hierarchy Discovery

Write a C program that allocates an array of size $S$ and reads each cache line in order, $N$ times, then computes the average access time. Plot access time vs. $S$ from $S = 4$ KiB to $S = 1$ GiB.

You should see plateaus corresponding to L1, L2, L3, and DRAM. The transition points reveal the cache sizes; the plateau heights reveal the cache latencies.

// Pseudocode
for (size = 4 KiB; size <= 1 GiB; size *= 2) {
    char *buf = malloc(size);
    // Stride through buf so each access misses the most recent
    // Or: random shuffle order
    measure(buf, size);
}

Teaches: cache hierarchy, latency, the practical reality of "memory is slow".

Extensions: vary stride to detect cache line size (transitions at 64 bytes); use mfence and clflush to interpret behavior; compare DRAM vs. NVMe access by mmap'ing a file.

Lab D.4: Cache Simulator

Implement a simulator that takes a stream of memory addresses (load/store) and reports hits/misses for a configurable cache: associativity, line size, total size, replacement policy (LRU, FIFO, random, NRU).

Generate the trace from a real program using pin (Intel Pin tool) or perf record + post-processing.

Teaches: cache mechanics, replacement policies, the value of associativity.

Extensions: simulate a multi-level hierarchy; add write-back vs. write-through; add a TLB and measure walk overhead.

Two threads update separate counters on a shared cache line vs. on different cache lines (padded to 64+ bytes). Measure the difference.

Teaches: cache coherence costs, false sharing, the importance of layout.

// Bad version: counters share a cache line
struct shared { volatile long a, b; } counters;
// Good version: counters separated
struct padded { volatile long val; char pad[56]; };
struct padded counters[2];

Lab D.6: TLB Behavior

Allocate a large array, walk through it with stride equal to one page (4 KiB), and measure the access time as the number of pages touched grows. The transition reveals TLB size.

Teaches: TLB hierarchy, page-walk overhead.

Extensions: use huge pages (2 MiB) and observe the difference; profile a real workload's TLB miss rate with perf stat -e dTLB-load-misses,iTLB-load-misses.

03.Group 3: Branches and Prediction

Lab D.7: Branch Predictor Behavior

Write a function that branches based on a pattern in an array. Make the pattern: (a) all true, (b) all false, (c) random, (d) regular alternating. Measure performance for each.

Teaches: branch prediction in practice; how "random" data hurts performance.

Extensions: sort the array and re-measure. (The classic "why is it faster on sorted data" demonstration.)

Lab D.8: Branchless Code

Take a small if-else (e.g., max, abs, clamp) and write three implementations: with a branch, with conditional move (compiler-driven via ?:), and explicitly branchless using bit manipulation. Measure on predictable vs. unpredictable inputs.

Teaches: predication; branchless idioms; when each strategy wins.

04.Group 4: Microarchitecture and ILP

Lab D.9: Pipeline Stalls

Write a sequence of dependent instructions (each uses the previous one's result) and a sequence of independent instructions doing the same total work. Measure both. The difference shows OoO's ability to hide latency.

// Dependent: serial chain
for (i = 0; i < N; i++) x = x * a + b;
// Independent: 4-way unrolled with separate accumulators
for (i = 0; i < N; i += 4) {
    x0 = x0 * a + b;
    x1 = x1 * a + b;
    x2 = x2 * a + b;
    x3 = x3 * a + b;
}

Teaches: ILP, dependency chains, why unrolling helps even on OoO machines.

Lab D.10: 5-Stage Pipelined Simulator

Implement a cycle-accurate simulator of a 5-stage pipeline (IF, ID, EX, MEM, WB) for a subset of RISC-V. Handle hazards: stalls for load-use, forwarding for ALU-to-ALU, branch flushes.

Teaches: pipeline mechanics deeply; forwarding; hazards.

Extensions: add branch prediction; add a multiplier with multi-cycle latency; extend to a 7-stage pipeline.

Lab D.11: Out-of-Order Simulator

Implement an OoO simulator: front-end fetch and rename; reservation stations; ROB; functional units with various latencies; in-order retirement. Run small programs through it and compare to in-order.

Teaches: Tomasulo's algorithm, register renaming, ROB management.

Tools: gem5 is a real research-grade simulator if you want to skip the implementation and study an existing one.

05.Group 5: Performance Profiling

Lab D.12: perf Profiling Tour

Take any non-trivial C/C++ program (a parser, a small simulator, etc.). Run:

Bash

perf stat -d ./prog               # cache and basic stats
perf stat -e branches,branch-misses ./prog
perf record -g ./prog
perf report                       # interactive call graph
perf annotate                     # see hot assembly

Identify the hottest function and the main bottleneck (instructions, cache, branches?).

Teaches: profiling workflow, reading perf counters, finding bottlenecks.

Lab D.13: Top-Down Analysis

Use Intel VTune (or AMD uProf, or perf stat -e with the right counters) to do a Top-Down Microarchitecture Analysis on a workload. Classify it as front-end-bound, back-end-bound, retiring, or bad-speculation.

Teaches: TMA methodology; how to read modern microarchitectural counters.

Lab D.14: Roofline Model

Pick a numerical kernel (matrix multiply, stencil, FFT). Compute its arithmetic intensity (FLOPs per byte). Plot vs. the machine's peak FLOP/s and peak memory bandwidth (the "roofline"). Determine whether the kernel is compute-bound or memory-bound. Optimize accordingly.

Teaches: arithmetic intensity, the memory wall, optimization strategy.

06.Group 6: Assembly and Compilers

Lab D.15: godbolt Tour

Use Compiler Explorer (godbolt.org). Compile the same C function across:

GCC vs. Clang
x86-64 vs. AArch64 vs. RISC-V
-O0 vs. -O1 vs. -O2 vs. -O3
-mcpu=skylake vs. -mcpu=znver3 vs. -mcpu=cortex-a78

Note where they differ.

Teaches: ISA differences, compiler differences, optimization-flag effects.

Lab D.16: Hand-Written SIMD

Write a function (e.g., array sum, dot product) using x86-64 SSE intrinsics, then AVX2, then AVX-512. Compare against autovectorized scalar code. Measure.

Teaches: SIMD programming; the relationship between width and performance.

Extensions: rewrite in NEON for AArch64 or RVV for RISC-V (on QEMU).

Lab D.17: Assembly from Scratch

Write a complete program in pure assembly: hello world that uses the Linux write syscall directly. Then a slightly more complex program: read a file, capitalize each character, write to stdout.

Teaches: ABI, system calls, hand-rolled assembly mechanics.

07.Group 7: Operating-System and Architecture Interface

Lab D.18: Page Walker

Write a small program that reads /proc/self/pagemap and /proc/self/maps to print the physical page each virtual page maps to.

Teaches: virtual memory layout, page tables, kernel exposure of MMU state.

Lab D.19: Mini Bootloader

Write a 16-bit real-mode bootloader (for x86) that prints "Hello" and halts. Boot it via QEMU (qemu-system-x86_64 -fda boot.img).

Teaches: BIOS boot process (deprecated but still illustrative); real mode.

Extensions: progress to UEFI by writing an EDK2-based application that prints to the framebuffer.

Lab D.20: Tiny Kernel

Build a kernel that boots, sets up paging, enables interrupts, handles a timer interrupt, and runs a single user-mode process. RISC-V is a popular choice: starting from xv6-riscv (MIT) or writing one from scratch with riscv-virt in QEMU.

Teaches: privilege transitions, interrupt handling, virtual memory setup, syscall mechanism.

Lab D.21: System-Call Tracer

Use ptrace (Linux) to write a strace-like tool that prints every syscall a child process makes.

Teaches: syscall mechanics, ptrace API, kernel-userspace interface.

08.Group 8: Concurrency and Memory Models

Lab D.22: Lock-Free Queue

Implement a single-producer, single-consumer lock-free queue using only atomic loads and stores. Test with multiple threads and ASan/TSan.

Teaches: memory ordering, atomic operations, the pitfalls of lock-free code.

Extensions: extend to MPMC; benchmark against a mutex-based queue; test on AArch64 and observe issues that don't appear on x86 due to weaker ordering.

Lab D.23: Memory-Ordering Litmus Tests

Use litmus7 (or a hand-rolled equivalent) to run classical memory-model tests (store buffering, IRIW, etc.) on x86-64 and AArch64. Observe the differences.

Teaches: memory consistency models, why they matter.

Lab D.24: Read-Copy-Update

Implement a simplified RCU (Read-Copy-Update) for a linked list. Multiple readers traverse without locks; writers update via copy-then-replace.

Teaches: advanced concurrency; the relationship between hardware ordering and concurrent algorithms.

09.Group 9: GPUs and Accelerators

Lab D.25: First CUDA / HIP Kernel

Implement vector addition, then matrix multiplication, in CUDA. Compare to CPU. Then optimize: shared memory tiling, coalesced access, occupancy tuning.

Teaches: SIMT model, GPU memory hierarchy, kernel launch overhead.

Extensions: rewrite with cuBLAS; use Nsight to profile; port to HIP for AMD.

Lab D.26: cuBLAS vs. Naive

Compare your hand-written gemm to cuBLAS. Measure the gap. Read about how cuBLAS exploits tensor cores; write a tensor-core gemm using wmma or CUTLASS.

Teaches: the gap between naive and optimized GPU code; the importance of vendor libraries.

Lab D.27: Triton Kernel

Write a softmax or layernorm kernel in OpenAI's Triton DSL. Compare to a hand-written CUDA equivalent.

Teaches: high-level GPU programming; kernel autotuning.

10.Group 10: Embedded and Real-Time

Lab D.28: Bare-Metal Cortex-M Blink

On a Cortex-M0/M3/M4 board (Nucleo, BluePill), write a bare-metal C program (no RTOS) that toggles an LED via direct GPIO register manipulation. Measure the loop period with a logic analyzer.

Teaches: embedded boot sequence, memory-mapped I/O, the lack of an OS.

Lab D.29: FreeRTOS Tasks

On the same board, run two FreeRTOS tasks at different priorities. Demonstrate preemption.

Teaches: RTOS basics, task scheduling, context switching.

Lab D.30: WCET Analysis

Write a small loop with conditional branches and analyze its worst-case execution time:

(a) Theoretically, by counting instructions. (b) Empirically, by running it many times. (c) On a Cortex-M, by enabling DWT cycle counter and recording timing.

Teaches: WCET concepts; the gap between average and worst case.

11.Group 11: Reconfigurable

Lab D.31: FPGA Hello-World

Use an inexpensive FPGA dev board (Lattice ICE40, Xilinx Spartan-7, Tang Nano) and the open-source toolchain (Yosys + nextpnr for ICE40 / ECP5). Implement a counter that drives an LED. Then a UART transmitter. Then a simple calculator.

Teaches: hardware design language (Verilog or VHDL or SpinalHDL), synthesis flow.

Lab D.32: Soft RISC-V Core

Synthesize an open-source RISC-V soft core (PicoRV32, VexRiscv, Ibex) onto your FPGA. Run a hello-world program through it.

Teaches: end-to-end CPU implementation; the simplicity of RISC-V.

Extensions: customize the core (add an instruction; add a peripheral); benchmark against the unmodified version.

12.Group 12: Reading Real Hardware

Lab D.33: Apple Silicon / Snapdragon Reverse Engineering

Read the publicly available Apple Silicon architecture analyses (e.g., Anandtech, Asahi Linux project). Identify which microarchitectural features Apple's "Firestorm/Avalanche/Everest" cores have. Compare to ARM's reference Cortex-X cores.

Teaches: how to read modern SoC analysis; the relationship between published specs and reality.

Lab D.34: Intel/AMD Optimization Manual

Read selected sections of Intel's Optimization Reference Manual or AMD's equivalent. Pick a topic (memory disambiguation, branch prediction, micro-fusion) and write a short summary.

Teaches: how vendors document their microarchitecture; the level of detail available.

Lab D.35: Disassemble Production Code

Take a binary you use (any small CLI tool: ls, cat, a compiled Python interpreter). Disassemble part of it. Identify: prologue patterns, library calls, vectorized loops, branch-heavy code.

Teaches: reading real-world assembly; recognizing compiler patterns in the wild.

13.Group 13: Putting It All Together

Lab D.36: Benchmark Suite

Pick a workload (a compression algorithm, a parser, a sort). Implement it. Benchmark across:

Compilers (GCC, Clang)
Architectures (x86-64 vs. AArch64)
Optimization levels
Compile flags (-march=native vs. baseline)
Vector intrinsics vs. autovectorized

Document the results.

Teaches: empirical performance methodology; the variability of "performance".

Lab D.37: Tiny C Compiler / JIT

Build a small JIT compiler that takes simple expressions and emits x86-64 machine code at runtime. Use mmap(MAP_ANON | MAP_EXEC) and call into the generated code.

Teaches: dynamic code generation; encoding x86-64 by hand; the JIT path.

Lab D.38: Architectural Capstone

Pick a topic that interests you and go deep:

Implement a single-issue OoO simulator with branch prediction and a multi-level cache.
Build a packet processor on an FPGA (parse Ethernet/IP, swap MAC addresses, send back).
Port a small program to a non-mainstream ISA (write a Z80 / 6502 / MOS emulator and run BASIC).
Measure the energy efficiency of three different ISAs running the same workload (Apple Silicon vs. Intel Xeon vs. RISC-V SBC).
Reproduce a paper from ISCA, MICRO, or HPCA — pick one with public data and try to replicate the results.

The capstone is yours to define.

14.Suggested Sequence

If you want a structured progression:

Start with Lab D.3 (cache discovery) — easy to do, immediately revealing.
Then Lab D.7 (branch predictor) — same machine, different angle.
Then Lab D.12 (perf profiling tour) — gives you a permanent tool for everything.
Then Lab D.15 (godbolt tour) — builds intuition for compiler output.
Then pick from each major group based on interest.

For a semester-length course:

Weeks 1-2: Labs D.1, D.2, D.3 (foundations + cache).
Weeks 3-4: Labs D.4 (cache simulator), D.7-D.8 (branches).
Weeks 5-6: Labs D.9-D.10 (ILP, pipelined simulator).
Weeks 7-8: Labs D.12, D.13, D.14 (profiling, top-down, roofline).
Weeks 9-10: Labs D.20 (tiny kernel) or D.25 (first CUDA).
Weeks 11-12: Lab D.22 or D.31 (concurrency or FPGA).
Weeks 13-14: Capstone (D.38).

15.Tools Reference

Tool	Purpose	Platforms
GCC, Clang	Compilation, assembly output	All
objdump	Disassembly	All
perf	Profiling, hardware counters	Linux
Intel VTune	Profiling, top-down analysis	x86-64 (free for non-commercial)
AMD uProf	Profiling	x86-64 (free)
Apple Instruments	Profiling	macOS
gem5	Research simulator	All (cross-platform)
QEMU	ISA emulation	All
godbolt.org / Compiler Explorer	Online assembly viewer	Web
Pin	Dynamic instrumentation	x86-64
Valgrind / Cachegrind	Memory/cache simulation	Linux
ftrace, eBPF	Kernel tracing	Linux
Yosys, nextpnr	Open-source FPGA toolchain	All
Vivado, Quartus	Vendor FPGA toolchains	Windows / Linux
Nsight	NVIDIA GPU profiling	NVIDIA
RGP	AMD GPU profiling	AMD

16.Closing

Architecture is a discipline of measurement. The labs in this appendix are meant to make the principles of the book tangible. After even a few of them, the abstractions of the main text — cache lines, pipeline hazards, memory barriers, branch prediction — stop being abstract and start being things you've personally watched cost or save microseconds.

Some of these labs are short hour-long exercises. Some can grow into research projects. Pick whichever fits your time and interest, and don't worry about doing them all. The benefit is in the engagement, not the completion.

Good luck.

This is the end of Computer Architecture from First Principles to Modern CPUs. Thank you for working through the material. The field continues to evolve rapidly; whatever you're studying or building, the foundations covered here should serve as ground that doesn't shift, even as new architectures, new accelerators, and new paradigms keep arriving.

Book mode

	// Pseudocode
	for (size = 4 KiB; size <= 1 GiB; size *= 2) {
	char *buf = malloc(size);
	// Stride through buf so each access misses the most recent
	// Or: random shuffle order
	measure(buf, size);
	}

	// Bad version: counters share a cache line
	struct shared { volatile long a, b; } counters;

	// Good version: counters separated
	struct padded { volatile long val; char pad[56]; };
	struct padded counters[2];

	// Dependent: serial chain
	for (i = 0; i < N; i++) x = x * a + b;

	// Independent: 4-way unrolled with separate accumulators
	for (i = 0; i < N; i += 4) {
	x0 = x0 * a + b;
	x1 = x1 * a + b;
	x2 = x2 * a + b;
	x3 = x3 * a + b;
	}

	perf stat -d ./prog # cache and basic stats
	perf stat -e branches,branch-misses ./prog
	perf record -g ./prog
	perf report # interactive call graph
	perf annotate # see hot assembly