Part V·ISA Case Studies·Chapter 45 of 62

Part VISA Case Studies

RISC-V Micro-Architecture

May 16, 2026·13 min read·advanced

This chapter surveys modern RISC-V micro-architecture: how high-performance cores are built, how they compare to AArch64 and x86-64 contemporaries, and where the strengths and weaknesses currently…

This chapter surveys modern RISC-V micro-architecture: how high-performance cores are built, how they compare to AArch64 and x86-64 contemporaries, and where the strengths and weaknesses currently lie. Earlier chapters in Parts V and VI covered the general principles of out-of-order execution, branch prediction, caches, and SMT. This chapter is the application of those principles to RISC-V silicon.

In late 2024 and 2025 the RISC-V landscape is genuinely diverse: from microcontrollers running at hundreds of MHz to data-center server cores targeting Neoverse-class performance. We focus on the application-class end — cores capable of running Linux, with multi-issue out-of-order pipelines and serious cache hierarchies.

01.The Implementation Landscape

A non-exhaustive list of significant RISC-V cores in 2024-2025:

SiFive Performance series.

P670: 4-wide OoO, designed for client and edge inference. Comparable to Arm Cortex-A78 in target performance.
P870: 6-wide OoO. RVA22 profile, V 1.0, hypervisor extension. SiFive's flagship application core.
U-series and S-series: simpler in-order cores for embedded.

T-Head (Alibaba).

C906: simple in-order, microcontroller-class.
C910: 3-wide OoO, used in early RISC-V Linux SoCs (e.g., the Sipeed LicheePi 4A). Has an early non-ratified V 0.7.1 in some chips.
C920: 4-wide OoO with H extension and ratified V 1.0. Used in newer designs like the BananaPi BPI-F3.
C930 / C950: announced higher-performance cores.

Ventana.

Veyron V1: 8-wide decode, large reorder buffer, server-class. Targeted at data-center workloads.
Veyron V2: announced; even wider, with H extension and V.

Tenstorrent.

Internal RISC-V cores for AI workloads (in their Wormhole / Blackhole accelerators), used as control processors and "Tensix" baby-cores.
Ascalon: announced 8-wide application-class core for general use.

SpacemiT.

K1 / X60: 4-wide OoO with V 1.0 + Zvfh. Used in BananaPi BPI-F3 and several other dev boards. The first widely available RVV 1.0 hardware for end users in 2024.

Andes Technology.

AX65: high-performance application core targeted at smartphones and edge devices.
D45 / N45: DSP-focused with vector extension.

Esperanto Technologies.

ET-SoC-1: 1088 small RISC-V cores with vector / tensor units, targeted at AI inference.

Microchip / Western Digital / Codasip / Imagination Technologies / NVIDIA (internal use): various other implementations, often customized.

This is a much wider field than ARM's or x86-64's, reflecting RISC-V's open licensing.

02.In-Order vs. Out-of-Order

The lower end of the RISC-V product spectrum is in-order — single-issue or dual-issue scalar pipelines with simple branch prediction. These cores (SiFive E-series, T-Head C906, Microchip's PIC64-RISC-V) target microcontrollers, IoT, and embedded use. Performance-per-MHz is modest (similar to ARM Cortex-M or Cortex-R).

For Linux-class workloads, OoO is essential. The cores worth detailed discussion are 3-wide and up. Below we step through the canonical pipeline of a representative high-end RISC-V core. The structure mirrors the AArch64 cores discussed in Chapter 41.

03.Front End

Fetch

Modern RISC-V high-performance cores fetch 32-64 bytes per cycle from a 64 KiB or larger L1 instruction cache. Because of the C extension, a 32-byte fetch contains anywhere from 8 to 16 instructions (depending on the mix of compressed and full instructions).

Fetch is decoupled from decode via a fetch queue. The front end runs ahead of the rest of the pipeline as long as the branch predictor stays accurate.

Branch Prediction

The branch predictor has the same components seen everywhere:

A direction predictor (TAGE-like or perceptron, multi-table).
An indirect target predictor (ITTAGE).
A return-address stack.
A branch target buffer.
Loop predictor optionally.

Sizes scale with the core: a small embedded core has BTB entries in the hundreds; a server core has tens of thousands. The Veyron V1's BTB is in the multi-thousand-entries range; SiFive P870's is similar.

A subtle RISC-V concern: there are no compare-and-branch fused operations in hardware (RISC-V has no flags). The branch predictor must predict every branch on its own merits. In practice, this is a non-issue for modern predictors — they predict by PC, not by what the branch tests. But the instruction count for a comparison-and-branch sequence is sometimes higher than ARM's, which a wider front end compensates for.

Decode

Decoders are typically 4-wide (P670, C920) up to 6-wide (P870) and 8-wide (Veyron V1, Ascalon). Each decoder slot can handle either a 32-bit or a 16-bit (compressed) instruction. C-extension instructions are typically expanded into their 32-bit equivalents at decode and then processed uniformly.

Some cores fuse common instruction pairs at decode. Examples:

lui + addi (form a 32-bit constant) fused into a single µop.
auipc + addi (PC-relative address) fused.
auipc + load (PC-relative load) fused.
slli + add (scaled-index) fused.

Without these fusions, RISC-V's instruction count for some operations would be higher than ARM's. With them, the µop count after decode is comparable.

Renaming

Standard register renaming. The architectural integer register file has 32 registers; high-end cores have physical register files of 200-400 entries (Veyron V1 at the high end). Floating-point and vector renaming is separate, with its own physical pool.

Several cores use zero-cycle moves for mv instructions — the rename table is updated to point both source and destination to the same physical register, no execution needed.

04.Out-of-Order Engine

Reorder Buffer

ROB sizes:

C910: ~200 entries.
C920: 256 entries.
P670: 256 entries.
P870: 384 entries.
Veyron V1: ~600+ entries (server-class).
Ascalon: similarly large.

For comparison: AMD Zen 4 has ROB ~320; Intel Golden Cove ~512; ARM Cortex-X4 ~384; Apple M3 P-core ~600+. RISC-V cores have rapidly closed the ROB-size gap.

Scheduler (Issue Queues)

Most cores use distributed issue queues per execution unit cluster. Sizes typically 30-80 entries per cluster. The wakeup-and-select logic is the standard scheduling mechanism described in Chapter 26.

Execution Units

A typical 4-wide OoO RISC-V core has:

4 integer ALUs (one or two with shifter; one with multiply; one with branch resolution).
2 load/store units (some configurations 1 load + 1 store; better cores 2 loads + 1 store, or 2 of each).
2 FP/SIMD units (fused multiply-add capable).
1 vector unit (in cores with V).

A 6-wide core (P870-class) might have 6 ALUs, 3 LSUs, 2-4 FP/vector pipes. The Veyron V1, at 8-wide, has more aggressive parallelism throughout.

Vector Engines

The V extension's implementation is where cores differ most. Two general approaches:

Vector microarchitecture A: dedicated vector pipe. A separate execution unit, possibly with its own register file. Vector operations issue in chunks (one VLEN per chunk), iterating internally for longer vectors. Examples: SpacemiT K1, T-Head C920.

Vector microarchitecture B: vectors on existing FP pipes. The vector unit shares physical lanes with the FP/SIMD unit, dispatching multiple cycles for longer vectors. Cheaper area; lower peak vector throughput.

Critical questions for V performance:

VLEN: how many bits per vector register? 128, 256, 512, etc.
DLEN (data path width): how many bits processed per cycle? May be less than VLEN; if VLEN=256 and DLEN=128, one operation takes 2 cycles.
LMUL handling: does the core use LMUL to enlarge effective registers, or to split into multiple uops? Typically LMUL=2 makes operations take twice as long.

The SpacemiT K1, for instance, has VLEN=256 and DLEN=128 — usable but not blazing. The Tenstorrent / Ventana flagships push DLEN higher.

Memory Subsystem

L1 caches are 32-64 KiB I and D, typically 4-8 way associative. L2 is private per core, 256 KiB - 2 MiB. L3 is shared across cores, growing with the chip:

BananaPi BPI-F3 (SpacemiT K1): 1 MiB shared L2, no L3.
LicheePi 4A (T-Head C910 quad-core): 1 MiB shared L2.
Ventana V1 systems: large shared L3 (tens of MiB), server-class.
SiFive P870 reference systems: configurable, typical 2-4 MiB shared L3.

Cache coherence uses the standard MESI/MOESI variants over a chip-internal interconnect. T-Head's TileLink is one common interconnect; AMBA CHI is also seen in some designs (especially when the IP integrates with ARM ecosystem).

Memory consistency: RVWMO (RISC-V Weak Memory Ordering). Loads and stores can be reordered freely subject to dependencies; FENCE provides explicit barriers. The Ztso extension provides TSO (total store ordering, x86-style) for better x86 binary translation; some implementations (Ventana, possibly others) include it as an option.

05.Concrete Walk-Throughs

SpacemiT K1 (BananaPi BPI-F3)

The most accessible RVV 1.0 hardware as of 2024. 8 cores of SpacemiT X60 design.

4-wide OoO.
VLEN=256, DLEN=128 (some sources).
ROB ~150-200 entries.
32 KiB L1I, 32 KiB L1D per core.
1 MiB L2 shared across the cluster.
1.6 GHz typical clock.

Per-core performance is roughly mid-range Cortex-A55 to A75 territory: respectable but not flagship. The vector unit is functional but not winning benchmarks against AVX-512 systems. As a development platform, however, the K1 is invaluable: it's the first widely available system with V 1.0 silicon, used by compiler and library teams to optimize RVV code.

T-Head C910 / C920

T-Head's C910 was the first widely-deployed Linux-capable RISC-V core (used in the LicheePi 4A, Sipeed boards, and several Chinese SoCs).

3-wide OoO (C910), 4-wide (C920).
Older C910s had pre-ratified V 0.7.1 (incompatible with V 1.0).
32 KiB L1I, 64 KiB L1D.
1 MiB L2 shared.
1.85 GHz typical (LicheePi 4A); higher in C920.

The C920 (used in BananaPi BPI-F3's competitor designs) is a meaningful upgrade: V 1.0, H extension, 4-wide. Performance approaches Cortex-A76 in some benchmarks, though it lags in others.

SiFive P870

SiFive's flagship application core, announced in late 2023, sampling in 2024.

6-wide decode, 6-wide issue.
ROB ~280 entries.
VLEN=128, DLEN=128.
Out-of-order load/store with multiple ports.
Hypervisor extension and full RVA22 profile.

SiFive positions the P870 against ARM's Cortex-A720 / A725 — i.e., mid-range mobile / client. Test systems show competitive results in scalar workloads.

Ventana Veyron V1

Targeted at servers and edge AI; chiplet-based for scaling core count.

8-wide OoO.
ROB likely 500+.
Large L1 caches (64 KiB each), private L2 (1 MiB+), shared L3.
Peak frequency 3.6 GHz quoted.

V1 is positioned as comparable to AmpereOne (ARM Neoverse N2/V2 class). Independent silicon-level benchmarks were limited as of late 2024; the architecture is promising but real-world numbers are still emerging.

Tenstorrent Ascalon

Announced in 2023, in detail at Hot Chips. Designed to be a high-IPC application core.

8-wide decode and issue.
Aggressive front end with large BTB.
Vector unit with long vectors.

Ascalon is intended for servers and is part of Tenstorrent's strategy of using RISC-V as the host CPU for their AI accelerators. As of 2025, silicon was sampling.

06.Performance Comparison

Approximate single-thread performance, normalized for fairness — these numbers come from various benchmarks and should be taken as ballpark only.

Core	SPECint2017 (est., per GHz)	Notes
SpacemiT X60	~3-4	Mid-range mobile-class
T-Head C910	~3.5	Older 3-wide core
T-Head C920	~5	4-wide, V 1.0
SiFive P670	~5-6	4-wide
SiFive P870	~7-8	6-wide
Ventana V1	~9-10	8-wide server class
Tenstorrent Ascalon	~10+ (target)	8-wide
ARM Cortex-A78	~5	reference
ARM Cortex-A720	~6	reference
ARM Neoverse V2	~9	reference
Apple M3 P-core	~13	reference
Intel Raptor Cove	~10-11	reference
AMD Zen 4	~10	reference

The high-end RISC-V cores (P870, V1, Ascalon) are positioning to be competitive with ARM's mid-to-high range and Intel/AMD's contemporary x86 cores. There is currently no shipping RISC-V core that matches Apple's M-series in single-thread performance, but the gap is narrowing rapidly.

07.Software Stack Maturity

Performance is half the story; the software ecosystem determines whether the hardware is usable.

Compilers. GCC and LLVM both have first-class RISC-V back ends. Auto-vectorization for V is improving rapidly in LLVM (mature for typical patterns) and GCC (catching up). Hand-written intrinsics for V are now widely used in libraries.

Operating systems. Linux has had upstream RISC-V support for years; Debian, Fedora, openSUSE, and Ubuntu all ship RISC-V ports. Android RISC-V support is in early stages (Google's Cuttlefish runs on RISC-V; commercial Android phones on RISC-V are not yet shipping). Windows on RISC-V is experimental.

Libraries. glibc, OpenSSL, libjpeg, etc. all run on RISC-V. Optimization for V is uneven: hot kernels in OpenSSL, video codecs (FFmpeg), and ML libraries are getting hand-tuned RVV. General-purpose libraries are mostly still scalar.

Toolchains. A typical RISC-V development setup uses cross-compilation (e.g., riscv64-linux-gnu-gcc on an x86 host) or native compilation on a slow RISC-V dev board. QEMU-system-riscv64 with KVM acceleration is widely used.

08.Where RISC-V Is and Isn't Competitive

Strengths.

Microcontrollers and embedded: thousands of designs already shipping; mature ecosystem.
Custom silicon: companies (NVIDIA, Western Digital, Tenstorrent, Apple's Secure Enclave-style coprocessors) use RISC-V for control and management cores in their SoCs.
Sovereign / national initiatives: China, India, Europe each have programs to develop RISC-V silicon for strategic reasons.
Research and academia: free, modifiable, with mature simulators (gem5, Spike).

Open challenges.

Single-thread performance: highest-end RISC-V cores trail Apple Mx and top Intel/AMD by ~30%.
Mobile ecosystem: Android and the surrounding software stack are nascent on RISC-V.
Server market: only Ventana and Tenstorrent are visibly targeting this; production deployments are minimal.
ISA fragmentation: profiles help, but in-the-wild RISC-V chips have varied extension subsets, complicating distribution.
Validation and quality: open-source cores have varied verification rigor; commercial cores (SiFive, Ventana) have professional QA but the whole field has less collective silicon-validation experience than ARM or x86.

The trajectory is clear: adoption is growing, performance is rising, the software ecosystem is maturing. Whether RISC-V displaces ARM in any given segment depends on factors beyond technology — licensing economics, geopolitics, ecosystem inertia.

09.Summary of Part IX

Part IX has covered RISC-V end to end:

Chapter 42 introduced RISC-V's history, design philosophy, naming, and ecosystem.
Chapter 43 went through the unprivileged ISA: base integer instructions, M, A, F/D, C, B, V extensions, and calling conventions.
Chapter 44 covered the privileged architecture: privilege modes, CSRs, traps, virtual memory (Sv39 et al.), PMP, SBI, the hypervisor extension, and the boot process.
Chapter 45 surveyed micro-architecture: the spectrum of cores from microcontrollers to flagship server-class designs, with concrete walk-throughs of contemporary silicon.

A reader who has worked through Parts I-IX has now seen the three major contemporary ISAs — x86-64, AArch64, and RISC-V — at architectural and micro-architectural depth, including their system-level definitions and concrete implementations. The strengths of each are now visible: x86-64's enormous installed base and binary compatibility, AArch64's clean modern design and dominant mobile ecosystem, RISC-V's openness and flexibility.

The remaining parts of the book step back to broader concerns: the operating system and firmware interface (Part X), advanced topics in caching, prediction, security, and physical implementation (Part XI), and the broader compute ecosystem beyond the CPU — GPUs, accelerators, embedded systems, and emerging architectures (Part XII). Several threads first introduced in earlier chapters — hardware-software co-design, security (Spectre and friends), the limits of microarchitecture — will be picked up and developed.

The next chapter, beginning Part X, is the operating-system view of the CPU: what the kernel sees, how syscalls work in detail, and how user/kernel transitions are implemented across the three ISAs we've now studied.

Book mode