Virtualization
May 16, 2026·15 min read·advanced
This chapter is about running multiple operating systems on a single machine. Virtualization is one of the most consequential developments in computing of the last twenty-five years. It's the…
This chapter is about running multiple operating systems on a single machine. Virtualization is one of the most consequential developments in computing of the last twenty-five years. It's the foundation of cloud computing, the basis of containers' lower layers, and the substrate on which much modern infrastructure runs. The hardware support that makes efficient virtualization possible — Intel VT-x, AMD-V (SVM), ARM virtualization extensions, RISC-V's H extension — is one of the larger architectural additions of the post-2000 era.
We'll cover the spectrum: what virtualization means, how it was originally hard, how hardware extensions made it efficient, what hypervisors look like, and the related but distinct topic of containers.
01. What Virtualization Is
A virtual machine (VM) is a software-emulated computer that can run an operating system. The OS running inside a VM is the guest; the layer providing the VM is the hypervisor (or virtual machine monitor, VMM). The hardware running the hypervisor is the host.
Virtualization allows:
- Server consolidation: many VMs on one physical machine, each isolated.
- Cloud computing: AWS EC2, Azure VMs, Google Compute Engine — VMs as a service.
- Development and testing: run multiple OSes on a developer workstation.
- Live migration: move a running VM from one host to another with minimal interruption.
- Snapshot and clone: capture VM state and revert; clone identical VMs.
- Security isolation: contain malware or untrusted code in a VM.
The performance overhead, with modern hardware extensions, is small — typically 5-10% for compute-bound workloads, somewhat more for I/O-heavy ones.
02. Type 1 vs. Type 2 Hypervisors
Type 1 hypervisors (bare-metal): run directly on hardware, with no host OS. Examples: VMware ESXi, Microsoft Hyper-V (in its native role), Xen, KVM (when considered as part of the Linux kernel).
Type 2 hypervisors (hosted): run as an application on a host OS. Examples: VMware Workstation, VirtualBox, QEMU (without KVM), Parallels Desktop.
The distinction has blurred. KVM converts Linux into a Type 1 hypervisor — Linux is the hypervisor. Hyper-V on Windows technically runs Windows as a privileged guest under the Hyper-V hypervisor.
For our purposes, the distinction matters less than the technology underneath. We focus on hardware-assisted virtualization with KVM as the canonical example.
03. The Classical Virtualization Problem
Before hardware support, virtualizing x86 efficiently was famously hard. Popek and Goldberg's 1974 paper formalized the requirements: an architecture is "classically virtualizable" if all sensitive instructions (those that read or write privileged state, or behave differently in privileged vs. unprivileged mode) are also privileged (trap when executed in unprivileged mode).
Original x86 failed this test. Some instructions silently behaved differently in user mode — for example, POPF (pop flags) ignored interrupt-flag changes in user mode rather than trapping. A guest OS running in user mode would execute POPF expecting to enable/disable interrupts, get neither a trap nor the desired effect, and proceed in a corrupted state.
VMware's solution (in 1999): binary translation. The hypervisor scans the guest kernel's instructions and replaces problematic ones on the fly with equivalent sequences that emulate the desired effect or trap into the hypervisor. Combined with shadow page tables (the hypervisor maintains real page tables that map guest virtual to host physical, applying both the guest's translation and its own), this made x86 virtualization possible but not cheap.
Around the same time, the academic community developed paravirtualization (Xen, ~2003): modify the guest OS to call hypervisor "hypercalls" instead of executing privileged instructions. This avoided the trap-and-emulate cost but required guest OS changes.
Both approaches were workable but imperfect. Hardware vendors recognized the demand for cleaner virtualization and added it.
04. Hardware Virtualization Extensions
Intel introduced VT-x in 2005; AMD followed with SVM (also called AMD-V) the same year. ARM added virtualization extensions to ARMv7-A and made them mandatory in ARMv8. RISC-V added the H (Hypervisor) extension as an optional but standardized extension.
The unifying design is:
- A new privilege level for the hypervisor, more privileged than the guest OS's most-privileged level.
- A guest mode, in which the CPU runs guest code with semantics close to bare metal but with hooks for the hypervisor.
- Configurable trap conditions: the hypervisor selects which guest events cause exits to the hypervisor.
- Two-stage memory translation: guest virtual → guest physical → host physical, with both stages walked by the MMU.
- Virtualized interrupts: efficient delivery without exiting to the hypervisor.
We look at each ISA's specifics, then unify.
Intel VT-x
VT-x adds:
- VMX root operation (the hypervisor) and VMX non-root operation (the guest). Both have rings 0-3, doubling the privilege state space.
- VMCS (Virtual Machine Control Structure): a per-VCPU control block with guest state, host state, control fields, and exit information. The hypervisor configures it before entering the guest.
- VMXON, VMXOFF: enable / disable VMX mode.
- VMLAUNCH, VMRESUME, VMCALL, VMEXIT: enter the guest, leave the guest.
- EPT (Extended Page Tables): second-stage translation. The MMU walks the guest's page tables (CR3-based) for guest-virtual to guest-physical, then walks the EPT (configured by the hypervisor) for guest-physical to host-physical.
A typical guest entry sequence: hypervisor populates VMCS with guest state; executes VMLAUNCH; CPU loads guest state and begins executing in non-root operation; eventually a VM exit occurs (e.g., guest executes a sensitive instruction, takes an external interrupt, attempts a forbidden access); CPU saves guest state to VMCS and reenters root operation at the hypervisor's exit handler.
The set of conditions that cause exits is configurable. For most operations, no exit is needed — the guest runs natively. Only the cases that the hypervisor must handle (e.g., I/O port access, certain CR writes, EPT violations) cause exits.
AMD SVM
AMD's SVM is conceptually similar to VT-x with different naming:
- SVME (Secure Virtual Machine Enable): turns on SVM.
- VMRUN, VMSAVE, VMLOAD: guest entry/exit instructions.
- VMCB (Virtual Machine Control Block): equivalent to VMCS.
- NPT (Nested Page Tables): equivalent to EPT.
- ASID for guests: each guest gets a tag in TLB entries.
The mental model is the same. Linux's KVM has a unified abstraction layer (kvm-intel.ko and kvm-amd.ko providing the per-vendor backend, kvm.ko providing the common framework).
ARM Virtualization
ARMv8 made virtualization a first-class feature. The privilege levels are:
- EL2: hypervisor.
- EL1: OS kernel (host or guest).
- EL0: user.
When EL2 is enabled, the hypervisor runs there. To run a guest, EL2 sets up the guest's CPU context (registers, page tables, system register values), configures HCR_EL2 (Hypervisor Configuration Register) for trap conditions, and executes ERET to drop into the guest's EL1.
The guest runs in EL1 / EL0, with system register accesses and certain instructions trapped to EL2 according to HCR_EL2. Stage-2 translation is configured via VTTBR_EL2 (Virtualization Translation Table Base Register) and VTCR_EL2.
ARMv8.1 added VHE (Virtualization Host Extensions) — a feature that allows the host kernel to run in EL2 directly with the same semantics as EL1, simplifying KVM-style hypervisors. With VHE, the host kernel doesn't need to context-switch into EL2 to handle a guest exit.
RISC-V H Extension
The H extension adds:
- HS-mode: hypervisor-extended supervisor mode.
- VS-mode: virtual supervisor (guest kernel).
- VU-mode: virtual user (guest user).
- hgatp: stage-2 translation base.
- hstatus, hideleg, hedeleg, hvip, htval, htinst: hypervisor CSRs.
Trap routing: traps in VS/VU can be routed to HS-mode (the hypervisor) or, where the hypervisor delegates them, directly back to VS-mode (avoiding a hypervisor round-trip for guest-internal traps).
The model is conceptually analogous to ARM's: a privilege level (HS) above the guest's most-privileged level (VS), with stage-2 translation and configurable trap conditions.
05. Two-Stage Memory Translation
The most performance-critical hardware addition is two-stage translation. Without it, the hypervisor must shadow the guest's page tables: every guest update to its own page tables must be intercepted, validated, and translated into host page table updates. This was the largest source of virtualization overhead in pre-EPT systems.
With two-stage translation:
- The guest manages its own page tables (guest virtual → guest physical).
- The hypervisor manages a separate set of tables (guest physical → host physical).
- The MMU walks both, on demand. A TLB entry caches the combined translation.
Guest OS modifications to its page tables are no longer special — they're just memory writes the hypervisor doesn't need to see.
The cost: a TLB miss now requires walking both sets of tables. In the worst case, each step of walking the guest's table requires walking the hypervisor's table to find the physical address of the next guest table. For a 4-level guest table and a 4-level host table, that's 4 × 4 + 4 = 20 memory accesses for a single TLB miss. In practice, page-walk caches (Chapter 19) absorb most of this, and TLB hit rates dominate runtime.
06. Trap and Emulate
For events that must involve the hypervisor — I/O port access, hypercalls, sensitive system register accesses, hardware faults — the trap-and-emulate model still applies, just at a lower frequency:
- Guest executes the sensitive operation.
- Hardware exits to the hypervisor with exit reason information.
- Hypervisor decodes the operation, performs the equivalent action (perhaps emulating a device, perhaps forwarding to a device the guest is allowed to access).
- Hypervisor returns to the guest, which resumes after the trapping instruction.
The exits per second a guest causes is a key performance metric. Compute-bound guests cause very few; I/O-bound guests can cause tens of thousands per second. Modern designs aim to keep exits to a minimum.
07. Virtual Devices
A guest sees a set of "virtual" devices: a virtual NIC, a virtual disk, a virtual graphics card. These are emulated by the hypervisor. The simplest approach: the hypervisor pretends to be a real (often old, well-supported) device. The guest's existing driver works. Examples: emulated Realtek 8139 NIC, IDE disk controller.
Faster: paravirtualized devices. The guest runs a special driver that knows it's in a VM and uses an efficient ring-buffer interface to the hypervisor. virtio is the dominant standard: virtio-net for networking, virtio-blk for storage, virtio-gpu for graphics.
Faster still: device passthrough. The hypervisor hands a physical device directly to the guest. The guest uses its real driver, talking to real hardware. This is gated by IOMMU support — see below.
IOMMU and SR-IOV
An IOMMU (Input-Output Memory Management Unit) translates DMA addresses just as the CPU's MMU translates memory addresses. With an IOMMU:
- Devices doing DMA see "I/O virtual addresses" that the IOMMU translates to physical.
- The hypervisor configures the IOMMU per-device, so a passed-through device can only DMA to the guest's memory.
Without an IOMMU, a passthrough device could DMA anywhere, escaping the VM's isolation. With an IOMMU, passthrough is safe.
Intel's IOMMU is VT-d; AMD's is AMD-Vi or just IOMMU. ARM has SMMU (System MMU). RISC-V has the IOMMU specification.
SR-IOV (Single Root I/O Virtualization): a PCIe feature where one physical device exposes multiple "virtual functions" that can be passed through individually to different guests. A 100 Gbps NIC might expose 64 VFs, each gettable to a guest, each receiving its share of bandwidth. This avoids the hypervisor entirely on the data path.
08. Interrupt Virtualization
Delivering interrupts to a guest without exiting to the hypervisor requires hardware help. Intel's APICv (APIC virtualization) and AMD's AVIC (Advanced Virtual Interrupt Controller) let the hypervisor configure the interrupt controller such that guest reads from / writes to the APIC don't cause exits. A pending interrupt for the guest is delivered directly into the guest VCPU when the guest is running.
ARM has GICv3 / GICv4 with virtualization features: vCPU interfaces, virtual SGIs and LPIs.
RISC-V's IMSIC + AIA (Advanced Interrupt Architecture) supports virtualization with similar mechanisms.
These features cut tens of thousands of exits per second from I/O-heavy workloads.
09. Nested Virtualization
A guest hypervisor running inside another hypervisor. Used for cloud-in-cloud scenarios, or for development of hypervisors. Both VT-x and AMD SVM support nested virtualization, though it's complex. The outer hypervisor must emulate the VMX/SVM instructions for the inner hypervisor, allowing the inner hypervisor to "create VMs" that are actually configured into the outer hypervisor's VMCS.
ARM's virtualization is friendlier to nesting (with NV / NV2 features in ARMv8.3+). RISC-V H extension is designed to support nesting cleanly.
10. KVM and QEMU
On Linux, the dominant hypervisor is KVM (Kernel-based Virtual Machine), upstream since 2007. KVM is a kernel module that provides the low-level hypervisor functions: VCPU management, memory translation setup, exit handling. It does not provide device emulation; for that, KVM is paired with QEMU.
QEMU is a complete machine emulator that can run as a VM monitor on top of KVM. The architecture:
- The user runs QEMU with arguments specifying VM resources.
- QEMU opens /dev/kvm, creates a VM, allocates VCPU file descriptors.
- For each VCPU, QEMU spawns a thread that calls KVM_RUN; the thread enters guest mode, runs guest code, and returns when an exit needs handling.
- Exits that QEMU can handle in user space (most I/O) are returned to the QEMU thread; QEMU emulates the device and re-enters.
- Exits that need fast handling (e.g., MMIO to a paravirtualized device) can be handled in the kernel via the eventfd mechanism.
KVM + QEMU is the basis of nearly all Linux-based virtualization: GCE, AWS Nitro (custom; uses some KVM heritage), OpenStack, Kubevirt, libvirt-managed VMs. Other Linux hypervisors include Cloud Hypervisor (Rust, modern, minimal) and Firecracker (used by AWS Lambda, also Rust, focused on microVMs for serverless).
11. Containers
Closely related but distinct: containers. A container is a packaged process (or process group) running on a shared kernel, isolated by namespaces and constrained by control groups (cgroups). Docker, Podman, Kubernetes pods — all containers.
Containers and VMs differ:
| Aspect | VM | Container |
|---|---|---|
| Kernel | Per-VM | Shared with host |
| Boot time | Seconds | Milliseconds |
| Memory overhead | Per-VM kernel + system | Per-process |
| Isolation strength | Strong (hardware boundary) | Weaker (namespace boundary) |
| OS flexibility | Any OS | Same kernel ABI |
Containers are great for packaging applications, less great for running untrusted code, weak for running heterogeneous OSes. Modern systems often combine: Kata Containers run each container in a microVM for stronger isolation while keeping container-style management.
The CPU's contribution to containers is via the kernel namespaces and cgroups — purely software features. No special hardware support needed (though some hardware features like ARM PAC or Intel CET enhance container security).
12. Confidential Computing
A more recent thread: confidential VMs that protect against the hypervisor itself.
- AMD SEV / SEV-ES / SEV-SNP: encrypted memory for VMs, with attestation.
- Intel TDX (Trust Domain Extensions): similar concept, post-SGX.
- ARM CCA (Confidential Compute Architecture): RME (Realm Management Extension) for confidential VMs.
- RISC-V CoVE (Confidential VM Extension): in-progress.
The threat model: the cloud provider's hypervisor might be compromised or untrustworthy. Confidential VMs encrypt their memory with keys not known to the hypervisor and use attestation to prove their integrity to remote parties. Even if the hypervisor is malicious, it cannot read VM data.
This is a substantial architectural addition. The hardware provides memory encryption per-VM, integrity protection on memory, and a measurement / attestation framework. The hypervisor still schedules the VM but cannot inspect its contents.
13. Virtualization Performance
A typical compute-bound workload (say, compiling a large project) sees ~5% overhead vs. bare metal. I/O-bound workloads can see more, depending on the quality of the paravirtualized devices and whether IOMMU passthrough is used. Memory access is essentially native (TLB hit rates determine).
Improvements over time:
- VT-x / AMD SVM: reduced trap-and-emulate overhead.
- EPT / NPT: eliminated shadow page table cost.
- APICv / AVIC: eliminated interrupt-related exits.
- virtio: reduced per-I/O-operation overhead.
- SR-IOV: removed the hypervisor from the I/O data path.
A modern KVM guest on modern hardware is within a few percent of bare metal for most workloads. The cases where virtualization is expensive are rare (very high syscall rates, very heavy interrupt loads on emulated devices).
14. Summary
Virtualization is a deep stack: privilege levels above the guest OS, two-stage translation, configurable trap conditions, virtualized interrupt controllers, IOMMUs for safe device passthrough, paravirtualized devices for efficient I/O. Hardware extensions across all major ISAs make the runtime cost small enough that virtualization is the default deployment model for cloud workloads.
KVM + QEMU on Linux is the dominant open-source stack; VMware ESXi, Microsoft Hyper-V, and proprietary hypervisors fill out the rest. Containers, while not virtualization in the same sense, are the pragmatic choice for packaging applications when guest-OS flexibility isn't needed. Confidential computing is the latest layer, addressing trust between guest and hypervisor.
The hardware features that make all this efficient are some of the largest architectural additions of the past two decades. They are also among the most consequential — without them, the cloud as we know it would be impractical.
The next chapter looks at the security side: how the OS isolates processes, how the hardware features (NX bits, SMEP, PXN, PAN, ASLR, MTE, MAC, etc.) defend against attacks, and where speculative execution attacks fit into this picture (with a forward reference to Chapter 51 for the deep dive on Spectre and Meltdown).