Part VIIAdvanced and Frontier

Embedded and Real-Time Systems

May 16, 2026·14 min read·advanced

Most CPUs in the world are not in laptops, phones, or servers. They are in microcontrollers — embedded in appliances, vehicles, industrial machinery, medical devices, sensors, toys, and countless…

Most CPUs in the world are not in laptops, phones, or servers. They are in microcontrollers — embedded in appliances, vehicles, industrial machinery, medical devices, sensors, toys, and countless other products. The annual production of embedded CPUs vastly exceeds the production of application processors. Their architectural concerns are different: deterministic latency over peak performance, low power over throughput, hard reliability over flexibility, fixed function over generality.

This chapter covers embedded and real-time systems: what makes them different, how their CPUs are designed, what real-time means, and how the embedded world interacts with the application-class world we've spent most of the book on.

01. What "Embedded" Means

The word embedded is defined more by context than by technology. An embedded system is a computer embedded inside a larger device, performing a specific function. The user does not perceive a computer; they perceive a thermostat, a car, a pacemaker.

Common attributes of embedded systems:

  • Specialized purpose: runs a fixed program (or small set of programs).
  • Limited resources: kilobytes to megabytes of memory; a few milliwatts to a few watts of power.
  • Deterministic behavior: the system must respond predictably to inputs.
  • Long service life: 10-25 years is normal; some industrial systems last 50+ years.
  • Robust to environment: temperature extremes, vibration, EMI, radiation.
  • Cost-sensitive: pennies matter when shipping millions of units.
  • Often safety- or mission-critical: failures have real-world consequences.

The CPUs in these systems span a wide range:

  • 8-bit microcontrollers (8051, AVR, PIC): still in production after decades. Tens of MHz, kilobytes of RAM, dollar prices in volume.
  • 16-bit (MSP430, some PIC): niche; mostly displaced by 32-bit ARM.
  • 32-bit microcontrollers (ARM Cortex-M, RISC-V, MIPS, others): the dominant category today. Hundreds of MHz, hundreds of KB to a few MB of RAM, multiple dollars in volume.
  • Application-class embedded (Cortex-A, RISC-V S-mode): Linux-capable embedded; smart appliances, industrial control gateways, automotive infotainment.

This chapter focuses on the microcontroller end and the real-time aspects that make it distinct.

02. The Cortex-M Family

ARM's Cortex-M series dominates the 32-bit microcontroller market. The lineup:

  • Cortex-M0 / M0+: minimal cores; ARMv6-M ISA (Thumb only); tens of MHz; sub-cent gate counts. Used in low-end MCUs, sensors, secure elements.
  • Cortex-M3 / M4: ARMv7-M with broader ISA, optional FPU and DSP extensions. The workhorse for general MCU applications.
  • Cortex-M7: dual-issue, higher frequency (up to ~600 MHz); aggressive for an MCU.
  • Cortex-M23 / M33: ARMv8-M with TrustZone-M; security-focused.
  • Cortex-M55 / M85: with Helium (M-Profile Vector Extension) for ML on MCUs.

Cortex-M cores share characteristics that make them suitable for embedded use:

  • Thumb-only ISA: 16-bit / 32-bit instructions for code density.
  • No MMU: instead, an optional MPU (Memory Protection Unit) for region-based protection.
  • Fast interrupt entry: hardware automatically saves a subset of registers (R0-R3, R12, LR, PC, xPSR) on exception entry, allowing immediate handler execution.
  • Tail-chaining: pending interrupts can chain without unwinding the saved state, saving cycles.
  • Deterministic instruction timing (mostly): few instructions have data-dependent latency.
  • Bit-banding (M3/M4/M7): a memory region where each bit is aliased to a 32-bit word, enabling atomic single-bit access.
  • NVIC (Nested Vectored Interrupt Controller): up to 240 interrupts with priorities and nesting.

The combination delivers microsecond-class interrupt latency, predictable behavior, low power, and full 32-bit performance — all in a tiny silicon area.

03. The Embedded RISC-V Story

RISC-V has rapidly captured a significant share of the microcontroller market. The advantages:

  • No royalty: critical for low-margin parts.
  • Customizable: vendors add custom instructions for their domain (DSP, crypto, security).
  • Permissively licensed: full access to the ISA spec.

RISC-V microcontrollers from Espressif (ESP32-C series), GigaDevice (GD32V), CH32V, SiFive, Microchip's PIC64, and many others compete directly with Cortex-M.

A typical 32-bit RISC-V MCU implements RV32IMC (integer + multiply/divide + compressed) plus optional extensions. The Zicsr extension provides system control and status registers, similar in role to Cortex-M's special registers. Interrupts use the M-mode trap mechanism (Chapter 44). Some chips add the CLIC (Core-Local Interrupt Controller) for vectored, priority-based interrupts comparable to NVIC.

For real-time work, the determinism is often easier to verify on RISC-V — the simpler core has fewer microarchitectural surprises.

04. Real-Time Systems

A real-time system is one in which correctness depends not only on producing the right answer but on producing it within a deadline. The deadlines come from physics: a controller for an inverter must update its outputs before the next switching cycle; an airbag must inflate within milliseconds of impact; a quadcopter's stabilization loop must run at hundreds of Hz to stay aloft.

Real-time is not the same as fast. A real-time system that always responds in 100 ms is real-time if the deadline is 200 ms. A non-real-time system that responds in 1 ms most of the time but occasionally takes 500 ms is not real-time, because the worst case violates the deadline.

Hard, Firm, Soft

Three categories:

Hard real-time: missing a deadline is catastrophic. Avionics flight control, automotive safety systems, medical infusion pumps. Must always meet deadlines; designed and verified to do so.

Firm real-time: missing occasional deadlines is acceptable but degrades quality. Streaming media, robotic motion control. Late results are useless but not dangerous.

Soft real-time: deadlines are best-effort. UI responsiveness, web servers. Misses cause user dissatisfaction but no harm.

The architectural and verification effort scales with the category. Hard real-time systems may require formal verification of timing; soft real-time systems are tuned and tested.

05. What Makes Real-Time Hard

The architectural features that make CPUs fast on average can introduce unbounded variability:

Caches: a hit takes nanoseconds; a miss takes hundreds of nanoseconds. The worst-case execution time (WCET) of any code that touches memory must assume cache misses.

Branch prediction: a hit-fast prediction is correct most of the time; a misprediction costs many cycles. WCET must assume misprediction.

Out-of-order execution: improves average performance but complicates timing analysis.

TLB: a TLB miss costs a page-table walk, more cycles to bound.

Memory controller scheduling: DRAM accesses can be reordered for bandwidth, increasing variability.

Interrupts: an interrupt during a critical section adds latency.

Shared resources in multi-core: cache coherence, bus contention, shared memory controllers all introduce inter-core interference.

For hard real-time, the WCET must be analytically bounded. This drives toward simpler architectures: in-order pipelines, deterministic caches (or no cache), no branch prediction (or predictable patterns), no aggressive memory reordering.

06. Worst-Case Execution Time Analysis

WCET analysis is its own field. Approaches:

Static analysis: examine the code (and the architecture) and compute upper bounds on execution time for each path. Tools: aiT (AbsInt), Bound-T, Heptane.

Measurement-based: run the code with worst-case inputs; measure; argue that observed worst case bounds reality. Cheaper, less reliable.

Hybrid: combine static analysis of arithmetic with measurement of memory and interrupt effects.

For a Cortex-M3 running deterministic code (no caching, no branch prediction), WCET analysis is tractable: count cycles for each instruction, sum the longest path. For a Cortex-M7 with branch prediction and a small cache, harder. For a Cortex-A class CPU with full OoO and large caches, very hard — generally only safe with extensive workload-specific testing.

07. Real-Time Operating Systems

A real-time operating system (RTOS) prioritizes deterministic latency over the throughput optimizations of a general-purpose OS. Key features:

Priority-based preemptive scheduling: the highest-priority ready task runs immediately; lower-priority tasks are preempted on demand.

Bounded latency: the time from an event (interrupt, message) to the corresponding task running is bounded.

Priority inheritance: when a low-priority task holds a resource a high-priority task needs, the low-priority task temporarily inherits the higher priority. Prevents priority inversion.

Small footprint: kernels in tens of kilobytes, sometimes less.

Deterministic memory management: typically no virtual memory, no on-demand paging, fixed-size pools rather than general malloc.

Common RTOSes:

  • FreeRTOS: open-source, widely used on Cortex-M and small RISC-V. Acquired by AWS in 2017.
  • Zephyr: Linux Foundation open-source RTOS; broad architecture support; Bluetooth, networking, drivers; widely adopted.
  • VxWorks: Wind River; aerospace, defense, industrial.
  • QNX: BlackBerry; automotive, medical.
  • ThreadX / Azure RTOS: now Microsoft-owned.
  • RTEMS: open-source, used in space missions.
  • uC/OS-III: Micrium / Silicon Labs.
  • mbed OS, NuttX, RIOT: open-source alternatives.

Some POSIX-flavored RTOSes (QNX, VxWorks with POSIX API) ease porting between Linux and real-time. Others have their own APIs.

The Linux Real-Time Patchset

For workloads that need real-time but want full Linux capabilities, the PREEMPT_RT patchset (now mostly merged in mainline Linux) makes the kernel preemptible at most points, converts spinlocks to sleep-able mutexes with priority inheritance, and reduces interrupt-disable regions. The result: bounded latency on the order of tens of microseconds, with full Linux above.

Used in industrial automation (factory robots), audio production, some automotive systems. Not as deterministic as a true RTOS but vastly more flexible.

08. Power Constraints

Many embedded systems run on batteries or harvested power. The system must do its work and then sleep — for as long as possible.

Sleep Modes

A typical MCU has multiple sleep states:

  • Run mode: full speed.
  • Sleep: CPU clock stopped; peripherals running.
  • Deep sleep: most clocks stopped; only specific wake sources active (RTC, certain pins).
  • Standby / Stop: most of chip powered off; SRAM may or may not be retained.
  • Shutdown: only RTC and reset logic active.

Power consumption ranges over five orders of magnitude: a Cortex-M0+ might draw 1 mA in run mode at 16 MHz, 100 µA in sleep, 1 µA in deep sleep, 100 nA in shutdown.

For battery-powered devices (sensors, wearables, IoT), the duty cycle of run vs. sleep dominates energy consumption. Architectures and software cooperatively minimize active time:

  • Wake briefly (microseconds) on event.
  • Process the event quickly.
  • Return to deepest sleep.

Some MCUs include "FlexClock" or similar: peripherals can run while CPU is off, and only wake CPU if the peripheral detects something interesting. ARM's "Sleep on Exit" lets an interrupt handler return directly to sleep.

Energy Harvesting

A growing class: devices that harvest energy from the environment — solar, thermal gradients, vibration, RF — and run intermittently when energy is available. RFID tags, structural sensors, some medical implants.

These have unique computational needs: state must persist through power loss, computation must be checkpoint-able, the program must adapt to unpredictable energy availability.

09. Safety-Critical Systems

For systems where failure is dangerous, design and verification go beyond ordinary engineering:

Formal methods: mathematical proofs of correctness for critical components.

Lockstep redundancy: dual cores running the same program, comparing outputs every cycle. Disagreement triggers safety reaction. Used in automotive (ARM Cortex-R5F, R52F lockstep modes; PowerPC e200 dual-issue lockstep).

Triple modular redundancy: three identical units; majority voting catches single-point failures. Used in space and avionics.

Diverse implementation: two different teams write the software with different tool chains; results compared at runtime. Catches design bugs.

Watchdog timers: hardware that resets the system if not regularly kicked.

Memory protection: MPU enforces task isolation even without virtual memory.

Certification: standards like ISO 26262 (automotive), DO-178C (avionics), IEC 62304 (medical), IEC 61508 (industrial) impose strict process and verification requirements. Hardware must be ASIL-rated; software must be certified.

The cost: developing safety-certified software costs 10× or more than equivalent ordinary software. The market reflects this: certified RTOSes are expensive; certified compilers are expensive; certified hardware components are expensive.

10. Security in Embedded

A relatively recent concern: IoT devices and industrial control systems have become attack targets. Security in embedded:

Secure boot: chained verification from on-die ROM to application code.

TrustZone-M (ARMv8-M): hardware-enforced separation between secure and non-secure code, allowing trusted services on a microcontroller.

PSA (Platform Security Architecture): ARM's framework for IoT security; defines APIs for crypto, attestation, secure storage.

Hardware root of trust: dedicated secure elements (e.g., NXP EdgeLock, Microchip ATECC) provide key storage and crypto with anti-tamper features.

Secure firmware update: signed updates with rollback protection; A/B partition schemes for safe upgrade.

The IoT security landscape has been embarrassing — billions of devices with default passwords, unpatched vulnerabilities, no update mechanism. Standards like ETSI EN 303 645, the EU Cyber Resilience Act, and US NIST 8259 are tightening requirements.

11. Communication Buses

Embedded systems live on buses different from those in PCs:

I²C: two-wire low-speed bus; sensors, EEPROMs, simple peripherals.

SPI: synchronous serial; faster than I²C; flash memory, displays, ADCs.

UART: classic asynchronous serial; debugging, GPS modules, simple comms.

CAN / CAN-FD: automotive-grade differential bus; 1 Mbps (CAN), 5-8 Mbps (CAN-FD); reliable, multi-master, used in cars, industrial.

LIN: low-speed automotive bus; cheaper than CAN, used for non-critical functions.

Modbus: industrial control protocol over RS-485 or TCP.

Industrial Ethernet (Profinet, EtherCAT, Ethernet/IP): time-critical Ethernet variants for factory automation.

MIPI (CSI for cameras, DSI for displays, I3C): mobile-industry interfaces.

1-Wire: parasite-powered single-wire bus for very simple sensors (e.g., temperature ICs).

The selection of bus depends on bandwidth, distance, noise environment, power, cost, and protocol compatibility with existing infrastructure.

12. DSP and Signal Processing

Many embedded systems process signals: audio, control loops, sensor fusion, RF. Specialized features:

MAC instructions: multiply-accumulate in single cycles, fundamental for FIR/IIR filters and FFTs.

Saturating arithmetic: prevents overflow wraparound that would cause discontinuities in signals.

Fractional / fixed-point: Q15, Q31 formats for non-floating-point DSP.

Circular buffer addressing: hardware-assisted modular addressing for filter delay lines.

Helium / MVE on Cortex-M55, M85: SIMD vector instructions for ML and DSP on microcontrollers.

Specialized DSPs: TI C5000 / C6000, Cadence Tensilica, Qualcomm Hexagon — full DSP architectures, sometimes integrated as a coprocessor in larger SoCs.

The line between general-purpose MCU and DSP has blurred. Modern Cortex-M4/M7 with DSP extensions handle most embedded DSP work; standalone DSP chips persist where peak performance per watt is critical (cellular baseband, automotive radar).

13. Industrial vs. Consumer

Even within embedded, there's a cleavage:

Consumer: phones, wearables, smart-home devices. High-volume, fast turnover, moderate reliability requirements, low margins.

Industrial: factory controllers, medical devices, automotive, infrastructure. Long product lifecycles (10-25 years of supply), strict reliability and certification, higher margins.

Industrial parts often feature:

  • Wide temperature ranges (-40 to +85°C, sometimes -55 to +125°C).
  • Extended supply commitments (15-20 year availability).
  • Lower clock speeds, simpler architectures, mature processes.
  • Better documentation, longer-term support.

A Cortex-M0+ at 32 MHz on a 130 nm process is uncool but reliable, available, and adequate for many industrial controllers.

14. The Embedded Toolchain

Embedded development uses a distinct toolchain:

  • Cross-compilers: arm-none-eabi-gcc, riscv32-unknown-elf-gcc, IAR, Keil ARM Compiler, Green Hills.
  • Debuggers: J-Link, ST-Link, OpenOCD, Lauterbach TRACE32.
  • Programmers / flashers: in-circuit programming via SWD (ARM), JTAG, or vendor-specific.
  • RTOS-aware debugging: visualize task states, queues, semaphores.
  • Logic analyzers and oscilloscopes: for hardware debug.
  • Hardware-in-the-loop (HIL) test: simulate the surrounding environment.

The development cycle is slower than application software — physical hardware is involved, debugging is more invasive, certification adds documentation overhead. Compensations: the code is smaller, the system simpler, the constraints clearer.

15. Summary

Embedded and real-time systems are the dominant volume of CPUs in the world, but their architectural concerns differ from those of laptops and servers. Determinism, low power, robustness, and cost dominate over peak performance. Cortex-M and increasingly RISC-V serve the 32-bit microcontroller market; specialized DSPs and FPGAs fill niches. RTOSes provide bounded latency and predictable scheduling; PREEMPT_RT extends real-time capability into Linux.

Real-time correctness depends on bounded worst-case timing, not just average performance — leading to architectural choices that look unfashionable from a server-CPU perspective: in-order pipelines, no caches, no branch prediction, deterministic memory access. These choices make WCET analysis tractable.

Safety-critical systems add layers of redundancy, certification, and verification that change the engineering economics significantly. Power-constrained systems live and die by sleep mode efficiency. Industrial systems prize longevity over freshness.

Embedded is an enormous and underappreciated part of computer architecture. Most CPUs in the world live here.

The final chapter of the main text looks at reconfigurable and emerging architectures: FPGAs, neuromorphic computing, quantum, and other paths that may shape the next decades of computing.

Book mode
computer-architecturegpuacceleratorbeyond-cpu
Was this helpful?