Suggested Labs and Projects
May 16, 2026·14 min read·intermediate
The material in this book is most useful when grounded in hands-on practice. This appendix lists projects that exercise the concepts of the main text. They range from quick exercises (an hour or two)…
The material in this book is most useful when grounded in hands-on practice. This appendix lists projects that exercise the concepts of the main text. They range from quick exercises (an hour or two) to substantial projects (weeks or months). Each project lists what it teaches, what tools and prerequisites help, and what variations or extensions are worth pursuing.
The labs are grouped by theme rather than by chapter. Some require specific hardware; many can be done on any modern Linux laptop.
01. Group 1: Bit-Level and Number-System Practice
Lab D.1: Bitwise Operation Library
Build a small C library that implements: popcount, clz, ctz, bit_reverse, byte_swap, and count_trailing_set_bits, all without using the corresponding intrinsics. Then time them against the intrinsic-using versions.
Teaches: Boolean algebra in code, bit manipulation idioms, the value of dedicated hardware instructions.
Extensions: Compare your software popcount to __builtin_popcount performance. Write a SWAR (SIMD Within A Register) popcount and measure. Read Hacker's Delight chapter 5.
Lab D.2: Two's Complement Calculator
Implement a tool that takes any integer and width (8, 16, 32, 64 bits) and prints binary, hex, signed-decimal, and unsigned-decimal interpretations side-by-side. Add an "interpret as floating-point" mode (treat the bits as IEEE 754 single or double).
Teaches: number-system conversions, floating-point representation.
02. Group 2: Cache and Memory Hierarchy
Lab D.3: Memory Hierarchy Discovery
Write a C program that allocates an array of size and reads each cache line in order, times, then computes the average access time. Plot access time vs. from KiB to GiB.
You should see plateaus corresponding to L1, L2, L3, and DRAM. The transition points reveal the cache sizes; the plateau heights reveal the cache latencies.
| // Pseudocode | |
| for (size = 4 KiB; size <= 1 GiB; size *= 2) { | |
| char *buf = malloc(size); | |
| // Stride through buf so each access misses the most recent | |
| // Or: random shuffle order | |
| measure(buf, size); | |
| } |
Teaches: cache hierarchy, latency, the practical reality of "memory is slow".
Extensions: vary stride to detect cache line size (transitions at 64 bytes); use mfence and clflush to interpret behavior; compare DRAM vs. NVMe access by mmap'ing a file.
Lab D.4: Cache Simulator
Implement a simulator that takes a stream of memory addresses (load/store) and reports hits/misses for a configurable cache: associativity, line size, total size, replacement policy (LRU, FIFO, random, NRU).
Generate the trace from a real program using pin (Intel Pin tool) or perf record + post-processing.
Teaches: cache mechanics, replacement policies, the value of associativity.
Extensions: simulate a multi-level hierarchy; add write-back vs. write-through; add a TLB and measure walk overhead.
Lab D.5: False Sharing Demonstration
Two threads update separate counters on a shared cache line vs. on different cache lines (padded to 64+ bytes). Measure the difference.
Teaches: cache coherence costs, false sharing, the importance of layout.
| // Bad version: counters share a cache line | |
| struct shared { volatile long a, b; } counters; | |
| // Good version: counters separated | |
| struct padded { volatile long val; char pad[56]; }; | |
| struct padded counters[2]; |
Lab D.6: TLB Behavior
Allocate a large array, walk through it with stride equal to one page (4 KiB), and measure the access time as the number of pages touched grows. The transition reveals TLB size.
Teaches: TLB hierarchy, page-walk overhead.
Extensions: use huge pages (2 MiB) and observe the difference; profile a real workload's TLB miss rate with perf stat -e dTLB-load-misses,iTLB-load-misses.
03. Group 3: Branches and Prediction
Lab D.7: Branch Predictor Behavior
Write a function that branches based on a pattern in an array. Make the pattern: (a) all true, (b) all false, (c) random, (d) regular alternating. Measure performance for each.
Teaches: branch prediction in practice; how "random" data hurts performance.
Extensions: sort the array and re-measure. (The classic "why is it faster on sorted data" demonstration.)
Lab D.8: Branchless Code
Take a small if-else (e.g., max, abs, clamp) and write three implementations: with a branch, with conditional move (compiler-driven via ?:), and explicitly branchless using bit manipulation. Measure on predictable vs. unpredictable inputs.
Teaches: predication; branchless idioms; when each strategy wins.
04. Group 4: Microarchitecture and ILP
Lab D.9: Pipeline Stalls
Write a sequence of dependent instructions (each uses the previous one's result) and a sequence of independent instructions doing the same total work. Measure both. The difference shows OoO's ability to hide latency.
| // Dependent: serial chain | |
| for (i = 0; i < N; i++) x = x * a + b; | |
| // Independent: 4-way unrolled with separate accumulators | |
| for (i = 0; i < N; i += 4) { | |
| x0 = x0 * a + b; | |
| x1 = x1 * a + b; | |
| x2 = x2 * a + b; | |
| x3 = x3 * a + b; | |
| } |
Teaches: ILP, dependency chains, why unrolling helps even on OoO machines.
Lab D.10: 5-Stage Pipelined Simulator
Implement a cycle-accurate simulator of a 5-stage pipeline (IF, ID, EX, MEM, WB) for a subset of RISC-V. Handle hazards: stalls for load-use, forwarding for ALU-to-ALU, branch flushes.
Teaches: pipeline mechanics deeply; forwarding; hazards.
Extensions: add branch prediction; add a multiplier with multi-cycle latency; extend to a 7-stage pipeline.
Lab D.11: Out-of-Order Simulator
Implement an OoO simulator: front-end fetch and rename; reservation stations; ROB; functional units with various latencies; in-order retirement. Run small programs through it and compare to in-order.
Teaches: Tomasulo's algorithm, register renaming, ROB management.
Tools: gem5 is a real research-grade simulator if you want to skip the implementation and study an existing one.
05. Group 5: Performance Profiling
Lab D.12: perf Profiling Tour
Take any non-trivial C/C++ program (a parser, a small simulator, etc.). Run:
| perf stat -d ./prog # cache and basic stats | |
| perf stat -e branches,branch-misses ./prog | |
| perf record -g ./prog | |
| perf report # interactive call graph | |
| perf annotate # see hot assembly |
Identify the hottest function and the main bottleneck (instructions, cache, branches?).
Teaches: profiling workflow, reading perf counters, finding bottlenecks.
Lab D.13: Top-Down Analysis
Use Intel VTune (or AMD uProf, or perf stat -e with the right counters) to do a Top-Down Microarchitecture Analysis on a workload. Classify it as front-end-bound, back-end-bound, retiring, or bad-speculation.
Teaches: TMA methodology; how to read modern microarchitectural counters.
Lab D.14: Roofline Model
Pick a numerical kernel (matrix multiply, stencil, FFT). Compute its arithmetic intensity (FLOPs per byte). Plot vs. the machine's peak FLOP/s and peak memory bandwidth (the "roofline"). Determine whether the kernel is compute-bound or memory-bound. Optimize accordingly.
Teaches: arithmetic intensity, the memory wall, optimization strategy.
06. Group 6: Assembly and Compilers
Lab D.15: godbolt Tour
Use Compiler Explorer (godbolt.org). Compile the same C function across:
- GCC vs. Clang
- x86-64 vs. AArch64 vs. RISC-V
- -O0 vs. -O1 vs. -O2 vs. -O3
- -mcpu=skylake vs. -mcpu=znver3 vs. -mcpu=cortex-a78
Note where they differ.
Teaches: ISA differences, compiler differences, optimization-flag effects.
Lab D.16: Hand-Written SIMD
Write a function (e.g., array sum, dot product) using x86-64 SSE intrinsics, then AVX2, then AVX-512. Compare against autovectorized scalar code. Measure.
Teaches: SIMD programming; the relationship between width and performance.
Extensions: rewrite in NEON for AArch64 or RVV for RISC-V (on QEMU).
Lab D.17: Assembly from Scratch
Write a complete program in pure assembly: hello world that uses the Linux write syscall directly. Then a slightly more complex program: read a file, capitalize each character, write to stdout.
Teaches: ABI, system calls, hand-rolled assembly mechanics.
07. Group 7: Operating-System and Architecture Interface
Lab D.18: Page Walker
Write a small program that reads /proc/self/pagemap and /proc/self/maps to print the physical page each virtual page maps to.
Teaches: virtual memory layout, page tables, kernel exposure of MMU state.
Lab D.19: Mini Bootloader
Write a 16-bit real-mode bootloader (for x86) that prints "Hello" and halts. Boot it via QEMU (qemu-system-x86_64 -fda boot.img).
Teaches: BIOS boot process (deprecated but still illustrative); real mode.
Extensions: progress to UEFI by writing an EDK2-based application that prints to the framebuffer.
Lab D.20: Tiny Kernel
Build a kernel that boots, sets up paging, enables interrupts, handles a timer interrupt, and runs a single user-mode process. RISC-V is a popular choice: starting from xv6-riscv (MIT) or writing one from scratch with riscv-virt in QEMU.
Teaches: privilege transitions, interrupt handling, virtual memory setup, syscall mechanism.
Lab D.21: System-Call Tracer
Use ptrace (Linux) to write a strace-like tool that prints every syscall a child process makes.
Teaches: syscall mechanics, ptrace API, kernel-userspace interface.
08. Group 8: Concurrency and Memory Models
Lab D.22: Lock-Free Queue
Implement a single-producer, single-consumer lock-free queue using only atomic loads and stores. Test with multiple threads and ASan/TSan.
Teaches: memory ordering, atomic operations, the pitfalls of lock-free code.
Extensions: extend to MPMC; benchmark against a mutex-based queue; test on AArch64 and observe issues that don't appear on x86 due to weaker ordering.
Lab D.23: Memory-Ordering Litmus Tests
Use litmus7 (or a hand-rolled equivalent) to run classical memory-model tests (store buffering, IRIW, etc.) on x86-64 and AArch64. Observe the differences.
Teaches: memory consistency models, why they matter.
Lab D.24: Read-Copy-Update
Implement a simplified RCU (Read-Copy-Update) for a linked list. Multiple readers traverse without locks; writers update via copy-then-replace.
Teaches: advanced concurrency; the relationship between hardware ordering and concurrent algorithms.
09. Group 9: GPUs and Accelerators
Lab D.25: First CUDA / HIP Kernel
Implement vector addition, then matrix multiplication, in CUDA. Compare to CPU. Then optimize: shared memory tiling, coalesced access, occupancy tuning.
Teaches: SIMT model, GPU memory hierarchy, kernel launch overhead.
Extensions: rewrite with cuBLAS; use Nsight to profile; port to HIP for AMD.
Lab D.26: cuBLAS vs. Naive
Compare your hand-written gemm to cuBLAS. Measure the gap. Read about how cuBLAS exploits tensor cores; write a tensor-core gemm using wmma or CUTLASS.
Teaches: the gap between naive and optimized GPU code; the importance of vendor libraries.
Lab D.27: Triton Kernel
Write a softmax or layernorm kernel in OpenAI's Triton DSL. Compare to a hand-written CUDA equivalent.
Teaches: high-level GPU programming; kernel autotuning.
10. Group 10: Embedded and Real-Time
Lab D.28: Bare-Metal Cortex-M Blink
On a Cortex-M0/M3/M4 board (Nucleo, BluePill), write a bare-metal C program (no RTOS) that toggles an LED via direct GPIO register manipulation. Measure the loop period with a logic analyzer.
Teaches: embedded boot sequence, memory-mapped I/O, the lack of an OS.
Lab D.29: FreeRTOS Tasks
On the same board, run two FreeRTOS tasks at different priorities. Demonstrate preemption.
Teaches: RTOS basics, task scheduling, context switching.
Lab D.30: WCET Analysis
Write a small loop with conditional branches and analyze its worst-case execution time:
(a) Theoretically, by counting instructions. (b) Empirically, by running it many times. (c) On a Cortex-M, by enabling DWT cycle counter and recording timing.
Teaches: WCET concepts; the gap between average and worst case.
11. Group 11: Reconfigurable
Lab D.31: FPGA Hello-World
Use an inexpensive FPGA dev board (Lattice ICE40, Xilinx Spartan-7, Tang Nano) and the open-source toolchain (Yosys + nextpnr for ICE40 / ECP5). Implement a counter that drives an LED. Then a UART transmitter. Then a simple calculator.
Teaches: hardware design language (Verilog or VHDL or SpinalHDL), synthesis flow.
Lab D.32: Soft RISC-V Core
Synthesize an open-source RISC-V soft core (PicoRV32, VexRiscv, Ibex) onto your FPGA. Run a hello-world program through it.
Teaches: end-to-end CPU implementation; the simplicity of RISC-V.
Extensions: customize the core (add an instruction; add a peripheral); benchmark against the unmodified version.
12. Group 12: Reading Real Hardware
Lab D.33: Apple Silicon / Snapdragon Reverse Engineering
Read the publicly available Apple Silicon architecture analyses (e.g., Anandtech, Asahi Linux project). Identify which microarchitectural features Apple's "Firestorm/Avalanche/Everest" cores have. Compare to ARM's reference Cortex-X cores.
Teaches: how to read modern SoC analysis; the relationship between published specs and reality.
Lab D.34: Intel/AMD Optimization Manual
Read selected sections of Intel's Optimization Reference Manual or AMD's equivalent. Pick a topic (memory disambiguation, branch prediction, micro-fusion) and write a short summary.
Teaches: how vendors document their microarchitecture; the level of detail available.
Lab D.35: Disassemble Production Code
Take a binary you use (any small CLI tool: ls, cat, a compiled Python interpreter). Disassemble part of it. Identify: prologue patterns, library calls, vectorized loops, branch-heavy code.
Teaches: reading real-world assembly; recognizing compiler patterns in the wild.
13. Group 13: Putting It All Together
Lab D.36: Benchmark Suite
Pick a workload (a compression algorithm, a parser, a sort). Implement it. Benchmark across:
- Compilers (GCC, Clang)
- Architectures (x86-64 vs. AArch64)
- Optimization levels
- Compile flags (-march=native vs. baseline)
- Vector intrinsics vs. autovectorized
Document the results.
Teaches: empirical performance methodology; the variability of "performance".
Lab D.37: Tiny C Compiler / JIT
Build a small JIT compiler that takes simple expressions and emits x86-64 machine code at runtime. Use mmap(MAP_ANON | MAP_EXEC) and call into the generated code.
Teaches: dynamic code generation; encoding x86-64 by hand; the JIT path.
Lab D.38: Architectural Capstone
Pick a topic that interests you and go deep:
- Implement a single-issue OoO simulator with branch prediction and a multi-level cache.
- Build a packet processor on an FPGA (parse Ethernet/IP, swap MAC addresses, send back).
- Port a small program to a non-mainstream ISA (write a Z80 / 6502 / MOS emulator and run BASIC).
- Measure the energy efficiency of three different ISAs running the same workload (Apple Silicon vs. Intel Xeon vs. RISC-V SBC).
- Reproduce a paper from ISCA, MICRO, or HPCA — pick one with public data and try to replicate the results.
The capstone is yours to define.
14. Suggested Sequence
If you want a structured progression:
- Start with Lab D.3 (cache discovery) — easy to do, immediately revealing.
- Then Lab D.7 (branch predictor) — same machine, different angle.
- Then Lab D.12 (perf profiling tour) — gives you a permanent tool for everything.
- Then Lab D.15 (godbolt tour) — builds intuition for compiler output.
- Then pick from each major group based on interest.
For a semester-length course:
- Weeks 1-2: Labs D.1, D.2, D.3 (foundations + cache).
- Weeks 3-4: Labs D.4 (cache simulator), D.7-D.8 (branches).
- Weeks 5-6: Labs D.9-D.10 (ILP, pipelined simulator).
- Weeks 7-8: Labs D.12, D.13, D.14 (profiling, top-down, roofline).
- Weeks 9-10: Labs D.20 (tiny kernel) or D.25 (first CUDA).
- Weeks 11-12: Lab D.22 or D.31 (concurrency or FPGA).
- Weeks 13-14: Capstone (D.38).
15. Tools Reference
| Tool | Purpose | Platforms |
|---|---|---|
| GCC, Clang | Compilation, assembly output | All |
| objdump | Disassembly | All |
| perf | Profiling, hardware counters | Linux |
| Intel VTune | Profiling, top-down analysis | x86-64 (free for non-commercial) |
| AMD uProf | Profiling | x86-64 (free) |
| Apple Instruments | Profiling | macOS |
| gem5 | Research simulator | All (cross-platform) |
| QEMU | ISA emulation | All |
| godbolt.org / Compiler Explorer | Online assembly viewer | Web |
| Pin | Dynamic instrumentation | x86-64 |
| Valgrind / Cachegrind | Memory/cache simulation | Linux |
| ftrace, eBPF | Kernel tracing | Linux |
| Yosys, nextpnr | Open-source FPGA toolchain | All |
| Vivado, Quartus | Vendor FPGA toolchains | Windows / Linux |
| Nsight | NVIDIA GPU profiling | NVIDIA |
| RGP | AMD GPU profiling | AMD |
16. Closing
Architecture is a discipline of measurement. The labs in this appendix are meant to make the principles of the book tangible. After even a few of them, the abstractions of the main text — cache lines, pipeline hazards, memory barriers, branch prediction — stop being abstract and start being things you've personally watched cost or save microseconds.
Some of these labs are short hour-long exercises. Some can grow into research projects. Pick whichever fits your time and interest, and don't worry about doing them all. The benefit is in the engagement, not the completion.
Good luck.
This is the end of Computer Architecture from First Principles to Modern CPUs. Thank you for working through the material. The field continues to evolve rapidly; whatever you're studying or building, the foundations covered here should serve as ground that doesn't shift, even as new architectures, new accelerators, and new paradigms keep arriving.