Part VI·Systems and Software·Chapter 46 of 62

Part VISystems and Software

The OS Interface

May 16, 2026·17 min read·advanced

This chapter examines the hardware-software interface between the CPU and the operating system. Earlier chapters touched on this in passing — Chapter 15 introduced exceptions and traps; Chapters 34, 39, and 44 covered system architecture for x86-64, AArch64, and RISC-V respectively. Now we step back and consider the OS interface as a unified topic: what the CPU provides to the OS, how the OS uses those facilities, and how user programs interact with the kernel.

The OS interface is where hardware and software meet most intimately. The contract here is older than any specific OS — Unix and Multics established the basic shape in the 1960s and 1970s, and modern systems are recognizable descendants. The CPU provides privilege levels, memory protection, traps, and interrupts; the OS uses these to multiplex hardware among processes, isolate them from each other, and mediate access to shared resources.

01.What the OS Needs from the Hardware

A general-purpose OS like Linux, FreeBSD, Windows, or macOS depends on a small but critical set of hardware features:

Privilege separation. At least two privilege levels — user and supervisor — so that user code cannot directly manipulate hardware or other processes' memory. All three of our reference ISAs provide this: x86-64 rings 0/3, AArch64 EL0/EL1, RISC-V U/S modes.

Memory protection. A way to give each process its own address space, so processes cannot read or write each other's memory. This is virtual memory, with hardware page tables and a TLB. Without it, multitasking with isolation is impractical.

Traps and interrupts. A mechanism for the CPU to enter the kernel on faults (page fault, illegal instruction, arithmetic error), on system calls (deliberate kernel entry from user code), and on external events (timer, I/O completion).

Atomic operations. For synchronization between threads and between the kernel and user code: compare-and-swap, atomic increment, load-linked/store-conditional. Chapter 30 covered these.

Timer. A regular interrupt, typically a periodic tick or a programmable next-event timer, so the OS can preempt running threads and update its sense of time.

Inter-processor interrupts. On multi-core systems, the ability for one core to signal another, so the kernel can synchronize across cores (TLB shootdowns, scheduling decisions).

Cache and memory ordering primitives. Barriers and explicit cache management for the cases where the kernel needs precise control: starting another core, modifying executable code, performing DMA.

Performance counters. For introspection, profiling, and adaptive scheduling.

These features are common to all three reference ISAs, with different specifics. Most of the OS interface is universal in spirit; details differ.

02.Address Spaces

Every process has an address space. The kernel also has its own address space. On modern systems the two are typically merged: the process's view of memory has the user portion (low addresses) and a kernel portion (high addresses) that is only accessible when the CPU is in privileged mode. This merging means a system call can transition from user to kernel without changing the page table — the same translations are still valid, just with different access rights.

The split is conventional:

Linux x86-64: user space is the lower 47 or 48 bits (0x0 to 0x00007FFFFFFFFFFF); kernel space is the upper canonical region.
Linux AArch64: typically 48-bit virtual addresses; user TTBR0_EL1 covers 0x0 to 0x0000FFFFFFFFFFFF; kernel TTBR1_EL1 covers 0xFFFF000000000000 upward.
Linux RISC-V (Sv39): user space 0x0 to 0x0000003FFFFFFFFF; kernel space 0xFFFFFFC000000000 upward.

The physical memory the kernel manages is mapped into the kernel portion of every process's address space, so the kernel can dereference any kernel pointer regardless of which process is running.

A consequence: the page tables for every process include the kernel portion. To save memory, this is often done by sharing the kernel-portion page-table pages across all processes — only the user portion has per-process entries. The kernel's top-level page table entries for the kernel half are the same in every process's page table.

This shared-kernel-mapping arrangement was disrupted by Spectre/Meltdown (Chapter 51) on x86. Meltdown allowed user code to speculatively read kernel memory mapped into the user-accessible-but-permission-denied region. The mitigation, KPTI (Kernel Page Table Isolation), gives each process two page tables: a "user" page table that does not include the kernel mapping, and a "kernel" page table used only while in the kernel. Switching between them on every system call costs measurable performance, but isolates the leak.

03.System Calls

A system call is the deliberate transition from user code into the kernel to request a service. The mechanics differ across ISAs but follow a uniform pattern:

User code prepares arguments in registers (per the syscall ABI).
User code identifies the desired service (a syscall number, in some register).
User code executes the syscall instruction.
Hardware traps to the kernel entry point with privilege elevated.
Kernel saves user state, runs handler, restores state.
Hardware returns to user code.

x86-64 System Calls

Linux on x86-64 uses SYSCALL:

Assembly

mov  $1, %rax        # syscall number (1 = write)
mov  $1, %rdi        # arg0: fd
lea  msg(%rip), %rsi # arg1: buf
mov  $5, %rdx        # arg2: count
syscall              # enter kernel

The Linux x86-64 syscall ABI uses %rax for the number, %rdi-%r9 for arguments (up to 6), and %rax for the return. %rcx is clobbered (holds return address from SYSCALL); %r11 is clobbered (holds RFLAGS).

The kernel entry point is configured by the LSTAR MSR. On SYSCALL, the CPU jumps there with privilege elevated to ring 0.

AArch64 System Calls

Linux on AArch64 uses SVC:

Assembly

mov  x8, #64          # syscall number (64 = write)
mov  x0, #1           # arg0: fd
adr  x1, msg          # arg1: buf
mov  x2, #5           # arg2: count
svc  #0               # enter kernel

AArch64 ABI: x8 holds the number, x0-x7 hold arguments, x0 holds the return. SVC traps to the EL1 synchronous exception vector, with ESR_EL1 indicating the trap reason as "SVC instruction".

RISC-V System Calls

Linux on RISC-V uses ECALL:

Assembly

li   a7, 64           # syscall number
li   a0, 1            # arg0
la   a1, msg          # arg1
li   a2, 5            # arg2
ecall                 # enter kernel

RISC-V ABI: a7 holds the number, a0-a5 hold arguments (a0 also receives return), with cause code 8 ("environment call from U-mode") setting scause.

What Happens on Entry

The kernel's syscall entry has three jobs: save the user state (so it can be restored later), validate arguments, dispatch to the handler.

A simplified Linux x86-64 entry:

Assembly

entry_SYSCALL_64:
    swapgs                      # swap GS base for kernel GS
    mov  %rsp, %gs:cpu_user_rsp # save user RSP
    mov  %gs:cpu_kernel_rsp, %rsp # load kernel RSP
    push  $__USER_DS             # build pt_regs frame
    push  %gs:cpu_user_rsp
    push  %r11                   # rflags
    push  $__USER_CS
    push  %rcx                   # user RIP
    push  %rax                   # syscall number
    push  %rdi                   # ... and all the rest
    ...
    call  do_syscall_64           # C handler
    ...
    swapgs
    sysretq

The actual code is more elaborate, with instrumentation hooks, KPTI page-table switches, and security mitigations. The structure is uniform: enter kernel, build a pt_regs frame on the kernel stack containing all user registers, dispatch via syscall number to the handler, return.

Syscall Tables

The kernel maintains a table mapping syscall number → handler function. On Linux x86-64, this is sys_call_table. The handler is called with the registers as C function arguments per a calling convention that mirrors the syscall ABI.

A handler is a regular C function:

long sys_write(unsigned int fd, const char __user *buf, size_t count) {
    // ...
}

The __user annotation marks pointers that come from user space and must be validated and accessed via copy_from_user / copy_to_user (which check that the address is in the user range and handle page faults gracefully).

Returning to User

The return path is the inverse of entry: restore registers from the pt_regs frame, then issue the return instruction (SYSRET on x86-64, ERET on AArch64, SRET on RISC-V). The CPU drops privilege and resumes user code at the saved PC.

Some return paths take a slow path that handles signals, rescheduling, and notifications. The fast path is critical: a no-op syscall (e.g., getpid) on a modern Linux system takes ~80-100 ns end-to-end, the lower bound on syscall cost.

04.The vDSO and vsyscall

Some "syscalls" are too cheap for the syscall mechanism itself to be worth the cost. gettimeofday, clock_gettime, and a few others are called millions of times per second by some applications. The cost of trapping to the kernel and back would dominate.

The solution is the vDSO (virtual Dynamic Shared Object). The kernel maps a small library into every user process's address space, containing implementations of these hot calls. The vDSO reads kernel-shared data (a clock register, mapped read-only into user space) directly. No syscall is involved.

int clock_gettime(clockid_t clk_id, struct timespec *tp);

The libc dispatches to the vDSO version, which on x86-64 reads the TSC (via RDTSC) and the kernel-published TSC-to-real-time mapping. On AArch64 it reads CNTVCT_EL0 and the equivalent calibration data.

The vDSO is one of the cleanest examples of hardware-software co-design: the OS exports specific data (with appropriate atomicity) and trusts user code to read it correctly. No security implication — the data is published, not protected.

05.Signals

A signal is the OS's mechanism for delivering asynchronous events to a process: a child died (SIGCHLD), a segmentation fault occurred (SIGSEGV), a user pressed Ctrl-C (SIGINT), an alarm went off (SIGALRM), and so on. Signals are conceptually similar to interrupts at the OS level — events delivered out of band that interrupt the regular flow.

When a signal is delivered to a process, the kernel must make the process run the signal handler. The mechanism on Linux:

On the next return-to-user from the kernel (e.g., when a syscall completes), the kernel checks for pending signals.
If there is one, the kernel modifies the user state on the stack: it pushes a signal frame containing the saved register state and arranges for control to return to the signal handler instead of where the user was running.
The signal handler executes; on return, it calls sigreturn (a special syscall that restores the saved state from the signal frame).

The signal handler runs at user privilege, on a (configurable) signal stack. The implementation is delicate — getting all the registers (including FPU and SIMD state) saved and restored correctly across signals requires careful coordination between kernel and libc.

Signal delivery cost is high (microseconds, not nanoseconds). For high-frequency event delivery, modern systems use other mechanisms: epoll, io_uring, eventfd.

06.Process Scheduling

The OS scheduler selects which process (or thread) runs on each core. The CPU provides only the timer interrupt; the policy is in software. The interaction with hardware:

Timer interrupt. A periodic tick (e.g., HZ=250 or 1000 on Linux) or a one-shot programmable interrupt (a "tickless" or "NO_HZ" kernel). On the tick, the scheduler updates accounting, checks if the current task should be preempted, and decides whether to schedule a different task.

Context switch. When the scheduler decides to switch tasks, it saves the outgoing task's CPU state (registers, FP/SIMD state, FS/GS base on x86, TLS register on AArch64/RISC-V) and restores the incoming task's state. Then it returns to user mode running the new task.

The cost of a context switch:

Saving and restoring integer registers: trivial (~10 instructions).
Saving and restoring FP/SIMD state: more (several KB on AVX-512 systems; XSAVE/XRSTOR or equivalents).
Switching page tables: writing CR3 (x86-64), TTBR0 (AArch64), or satp (RISC-V).
The TLB and cache effects of running new code: indirect, often more expensive than the register saves.

Optimizations:

Lazy FPU save. Don't save FP state unless the new task uses FP. On x86, this used the CR0.TS bit and a #NM trap on first FP use, until Spectre-related issues required eager save.
PCID / ASID. Avoid TLB flushes by tagging TLB entries with an address-space ID. Switching CR3/TTBR0 doesn't invalidate other ASIDs' entries.
xstate compaction. Save only the FP state actually used (e.g., skip ZMM registers if AVX-512 hasn't been touched).

A thread context switch on a modern Linux/x86-64 system takes 1-2 µs in the kernel path; total time including TLB and cache effects can be much more depending on working set.

07.Memory Management

The kernel manages the physical memory of the system, deciding which physical pages back which virtual pages. The hardware provides the MMU; everything above is software.

Key operations:

Allocation. When a process needs more memory (mmap, brk, malloc that grew the heap), the kernel reserves virtual address ranges. Physical pages are usually allocated lazily — only when the user actually accesses the memory does the page-fault handler allocate a physical page and map it.

Page faults. When user code accesses an unmapped or improperly-permissioned page, the MMU raises a fault. The kernel's page-fault handler distinguishes cases:

The page is mapped logically but not physically (demand paging): allocate, map, return.
The page is in swap: read from disk, allocate, map, return.
The page is copy-on-write (after fork): allocate a copy, map RW.
The page is a file mapping not yet faulted in: allocate, fill from file, map.
The access is genuinely invalid (out of range, wrong permissions): deliver SIGSEGV.

Page replacement. When physical memory is tight, the kernel reclaims pages: evicting clean pages, writing dirty pages to swap, shrinking caches. Algorithms like the Linux LRU lists balance these pressures.

TLB shootdown. When the kernel changes a page-table entry that may be cached in another core's TLB, that core must invalidate its TLB entry before the next access. On x86 and RISC-V, this is done by an IPI to the affected cores, which run an INVLPG / SFENCE.VMA in the IPI handler. On AArch64, the TLBI instruction broadcasts invalidations to all cores in the inner-shareable domain — no IPI needed.

The TLB shootdown cost is one of the persistent issues with massively multithreaded workloads. A page-table modification on a 64-core system may need 63 IPIs and acknowledgments, taking microseconds. Various optimizations exist — batching, lazy invalidation, the AArch64 broadcast mechanism — but the problem remains fundamental.

08.File Descriptors and the I/O Path

The classic Unix abstraction: everything is a file. Files, sockets, pipes, devices — all are accessed through file descriptors. The kernel maintains a per-process descriptor table mapping integer fds to underlying objects.

The hardware's role here is minimal. The kernel does the work; the CPU just runs the kernel. The CPU's interaction is via interrupts from devices (network card, NVMe, USB) that the kernel handles in interrupt context.

For high-performance I/O, modern systems use mechanisms that minimize syscall overhead:

io_uring on Linux: shared ring buffers between user and kernel, allowing batched submission and completion of I/O without syscalls per operation.
DPDK for networking: bypasses the kernel entirely, with user-space drivers polling NICs.
SPDK for storage: user-space NVMe drivers.

These all use the CPU's MMU and interrupt mechanisms but minimize the per-operation syscall cost.

09.Threading

On modern systems, threads within a process share an address space but have their own register state (and thus their own stack). The kernel sees threads largely the same as processes, with shared memory.

The CPU's contribution to threading is minimal:

Per-thread registers (clearly).
A thread-local-storage register: FS base on x86-64, TPIDR_EL0 on AArch64, tp (x4) on RISC-V. The kernel sets this on context switch; threads use it to access TLS.
Atomic instructions for lockless data structures and locks.
Memory ordering primitives.

Linux threads use the clone syscall, a generalized fork that lets the caller specify which resources to share between parent and child. POSIX threads (pthreads) implement on top of clone with the appropriate flags.

A thread context switch on modern hardware is essentially the same as a process context switch, except no page-table change is needed (same address space). This is faster — no TLB invalidations, no cache disruption from a different working set.

10.NUMA and CPU Affinity

On multi-socket systems and even single-socket many-core chips, memory access latency depends on which core is accessing which memory bank. NUMA (Non-Uniform Memory Access): each CPU socket (or, in newer Zen and Sapphire Rapids designs, each NUMA node within a socket) has its own memory controller. Accessing memory attached to your local controller is fast; accessing remote memory is slower (factor of 1.5-3×).

The OS exposes NUMA to applications:

CPU affinity (sched_setaffinity): pin a thread to specific cores, so it doesn't migrate.
Memory policy (mbind, numa_alloc_onnode): allocate memory on a specific NUMA node.
Auto-balancing: the kernel migrates pages to follow accessing threads.

For latency-sensitive workloads (databases, HPC, high-frequency trading), explicit NUMA management is standard practice. For general workloads, kernel auto-balancing is enough.

11.Power Management

CPUs expose multiple power states; the OS manages transitions:

P-states (Performance states). Frequency and voltage settings. Higher P-states run faster but consume more power. Modern CPUs scale frequency dynamically; the OS can constrain by setting policies (powersave, performance, ondemand, schedutil).

C-states (Idle states). Increasingly deep idle modes. C0 is running; C1 halts the core; deeper states (C2, C3, C6 on x86; WFI/WFE plus retention states on AArch64) save more power but take longer to wake. The OS chooses idle depth based on predicted idle duration (cpuidle governor).

S-states (System states). S0 = running; S3 = suspend to RAM; S4 = suspend to disk (hibernate); S5 = off. Mostly for laptops.

The OS interface for these is typically MSRs (x86), system registers (AArch64), or SBI calls (RISC-V) plus instructions like MWAIT, HLT, WFI. ACPI provides a higher-level abstraction for power and thermal management on most systems.

12.Accounting and Telemetry

The kernel keeps track of resource usage per process: CPU time, memory, file I/O, network. Sources:

The timer tick attributes elapsed time to the running task.
Performance counters (with appropriate kernel support) measure cycles, instructions, cache misses per task.
Page-fault counts and swap activity track memory pressure.
Per-syscall accounting records I/O.

The perf tool on Linux uses the CPU's performance counters to provide detailed per-process and per-function profiling. The interface is perf_event_open; the kernel multiplexes counters across tasks and reports results.

13.OS Interface Comparison

Feature	x86-64 (Linux)	AArch64 (Linux)	RISC-V (Linux)
Syscall instruction	SYSCALL	SVC #0	ECALL
Syscall number	rax	x8	a7
Args	rdi, rsi, rdx, r10, r8, r9	x0-x7	a0-a5
Return	rax	x0	a0
Trap dispatch	LSTAR MSR	VBAR_EL1	stvec
Page-table base	CR3	TTBR0_EL1, TTBR1_EL1	satp
TLS register	FS base	TPIDR_EL0	tp (x4)
Atomic primitives	LOCK prefix, CMPXCHG	LDXR/STXR, LSE	LR/SC, AMO
TLB invalidate	INVLPG	TLBI ISH	SFENCE.VMA + IPI
Inter-processor interrupt	APIC ICR write	GICv3 SGI	SBI sbi_send_ipi

Despite very different mechanics, the OS interfaces are isomorphic in capability. Linux ports across these ISAs without conceptual changes — only the arch-specific assembly and headers change.

14.Summary

The OS interface is the negotiated boundary between hardware and the OS kernel: privilege levels, virtual memory, traps, syscalls, signals, threading, scheduling, NUMA, power management. The CPU provides primitive mechanisms (one trap instruction, page tables, a timer); the OS layers on the abstractions users actually want (processes, files, sockets, threads).

This interface is decades old in spirit. The Multics and early Unix designs established the shape; modern systems are refinements, not revolutions. Each ISA has its own dialect for the same conversation: SYSCALL/SVC/ECALL, CR3/TTBR0/satp, INVLPG/TLBI/SFENCE.VMA. A kernel programmer who knows one quickly learns the others.

The next chapter looks below the OS at the firmware: how the system boots from cold reset, what runs before the kernel, how UEFI and Device Tree describe the hardware to the OS, and how secure boot anchors trust.

Book mode

	mov $1, %rax # syscall number (1 = write)
	mov $1, %rdi # arg0: fd
	lea msg(%rip), %rsi # arg1: buf
	mov $5, %rdx # arg2: count
	syscall # enter kernel

	mov x8, #64 # syscall number (64 = write)
	mov x0, #1 # arg0: fd
	adr x1, msg # arg1: buf
	mov x2, #5 # arg2: count
	svc #0 # enter kernel

	li a7, 64 # syscall number
	li a0, 1 # arg0
	la a1, msg # arg1
	li a2, 5 # arg2
	ecall # enter kernel

	long sys_write(unsigned int fd, const char __user *buf, size_t count) {
	// ...
	}