Part VIIAdvanced and Frontier

Reconfigurable and Emerging Architectures

May 16, 2026·14 min read·advanced

The final chapter of the main text looks beyond conventional CPUs and accelerators at the architectures that are either already in commercial use but not mainstream (FPGAs, dataflow chips), or are…

The final chapter of the main text looks beyond conventional CPUs and accelerators at the architectures that are either already in commercial use but not mainstream (FPGAs, dataflow chips), or are research / emerging technologies that may shape future computing (neuromorphic, quantum, optical, in-memory). Some of these have already had decades of development; others are speculative. All of them illustrate that the conventional CPU is not the only path to compute, and the post-Moore landscape may look quite different.

01. Reconfigurable Computing: FPGAs

A Field-Programmable Gate Array is a chip whose digital logic can be reprogrammed by configuration data after manufacturing. Where an ASIC bakes its logic permanently in silicon, an FPGA's logic is determined by a "bitstream" loaded into configuration memory. The same FPGA can be a video processor today and a network packet filter tomorrow.

The basic FPGA elements:

LUTs (Look-Up Tables): tiny RAMs (typically 6-input to 1-output) that implement arbitrary Boolean functions. Configuring the LUT's contents implements a specific gate function.

Flip-Flops: registers paired with LUTs to build sequential logic.

Routing fabric: a programmable interconnect that connects LUTs and FFs in arbitrary topologies. The routing matrix is most of the FPGA's silicon area.

Block RAMs: small dedicated RAMs (typically 18 Kb or 36 Kb each) for memory storage.

DSP slices: hardened multiply-accumulate units, more efficient than building multipliers from LUTs.

Hard IP: PCIe controllers, DDR memory controllers, Ethernet MACs, embedded ARM cores (in modern parts), all implemented as fixed-function blocks within the FPGA.

A modern flagship FPGA (AMD/Xilinx Versal Premium, Intel Agilex 9) contains millions of LUTs, thousands of DSP slices, terabits-per-second of transceiver bandwidth, and embedded ARM cores. Power and area are significant: a flagship FPGA can dissipate 100W+ and cost tens of thousands of dollars.

When FPGAs Win

FPGAs occupy a niche between ASICs and CPUs/GPUs:

Low-volume custom logic: when you need ASIC-like specialization but the volume doesn't justify ASIC NRE costs.

Bit-level operations: image processing, networking, crypto. CPUs are slow at fine-grained bit manipulation; FPGAs do it natively.

Very deep pipelining: pipelines hundreds of stages long; throughput per area can exceed CPUs by orders of magnitude.

Latency-critical: high-frequency trading, certain control applications. An FPGA can respond to a network packet in ~1 µs vs. ~10-50 µs on a CPU.

Hardware prototyping: design and test ASIC-bound logic before committing to silicon.

Networking: Smart NICs, packet inspection, stateful firewalls. Major cloud providers use FPGAs for packet processing (Microsoft's Catapult / AccelNet, AWS's F1 instances).

When FPGAs Lose

For floating-point throughput, GPUs win comfortably. For general-purpose computation, CPUs win comfortably. FPGAs require specialized tools and expertise; "FPGA programming" is more like hardware design than software development. Iteration is slow (compiles take hours); debugging is painful (you can't printf in hardware).

The FPGA market is therefore mature but specialized. AMD's acquisition of Xilinx (2022) and Intel's of Altera (2015, recently announced as a planned spinoff) reflect both the strategic importance and the limited overall market size.

Coarse-Grained Reconfigurable Arrays

A research line aimed at improving FPGA efficiency: CGRAs (Coarse-Grained Reconfigurable Arrays). Instead of bit-level LUTs, a CGRA has an array of word-level (8/16/32-bit) ALUs connected by a configurable interconnect. Less flexible than an FPGA but much more efficient for arithmetic-heavy workloads.

Examples: Wave Computing (defunct), SambaNova's RDU, Cerebras's compute fabric, some MIT research designs. The idea overlaps with dataflow accelerators.

02. Dataflow Architectures

A different angle on reconfigurable: build the chip not around instructions but around dataflow graphs.

The classical von Neumann model executes instructions sequentially, fetching from memory. A dataflow architecture instead instantiates the computation graph in hardware: each node of the graph is a hardware function unit; edges between nodes are physical wires carrying values; the computation runs as data flows through.

For a given workload, this can be far more efficient — no instruction fetch overhead, no scheduler, no branch prediction. The cost: the chip must be "configured" for the workload (compiling the dataflow graph onto the hardware), which is a slow and complex process.

Examples in commercial / research:

  • SambaNova RDU (Reconfigurable Dataflow Unit): full chip designed for AI dataflow.
  • Cerebras WSE: the wafer-scale chip is essentially a vast dataflow-friendly fabric.
  • Tenstorrent's tile-based architecture: each tile is a small RISC-V processor with vector / matrix engines connected by a routable mesh.
  • Groq LPU: deterministically scheduled; the compiler statically schedules every operation cycle by cycle.

Whether dataflow displaces conventional architectures or remains a niche depends on tooling: compiler support is the bottleneck, not hardware.

03. Neuromorphic Computing

A different paradigm: build chips that mimic the structure of biological neural networks. Spiking neural networks (SNNs) communicate via discrete spikes rather than continuous values. Synapses connect neurons with configurable weights; neurons accumulate input until they spike.

Neuromorphic chips:

  • IBM TrueNorth (2014): 1 million neurons, 256 million synapses; 70 mW power.
  • Intel Loihi / Loihi 2 (2017, 2021): research neuromorphic chip; spiking neurons with on-chip learning.
  • BrainChip Akida: commercial spiking neural network accelerator for edge AI.
  • SpiNNaker (2018): a massively parallel simulator of neural networks (1 million ARM cores).

The motivation: biological brains are extraordinarily energy-efficient (~20W for the human brain doing tasks far beyond GPU capability). Silicon brains modeled on this principle might achieve comparable efficiency.

Realities: neuromorphic computing has been promising for decades but has not yet found killer commercial workloads. Mainstream deep learning uses non-spiking, dense matrix multiplication, which traditional GPU/TPU accelerators handle well. Neuromorphic may shine in always-on, low-power inference (sensor processing, hearing aids) where its event-driven nature wins.

04. Analog and In-Memory Computing

Digital computation has dominated for decades, but the energy costs of moving data are pushing renewed interest in analog and in-memory approaches.

Analog compute: perform multiplication and addition in the analog domain using current and voltage rather than discrete bits. Mythic AI's analog matrix processor uses flash-memory cells as analog multipliers; storing weights as conductances and applying inputs as voltages produces output currents proportional to the dot product.

Pros: massive density, very low energy per operation. Cons: limited precision (analog noise), calibration challenges, manufacturing variability.

In-Memory Computing (PIM, Processing-in-Memory): do compute inside or adjacent to memory, eliminating the data-movement cost. Several flavors:

  • HBM-PIM (Samsung): logic units inside HBM stacks for accelerated AI inference.
  • PIM-DIMM (SK Hynix): logic modules on DRAM DIMMs.
  • 3D-stacked logic-on-memory: future architectures with logic dies bonded to memory dies.

The motivation: in modern systems, moving a 64-bit value across a chip can cost more energy than the arithmetic on it. In-memory approaches eliminate (or reduce) this cost.

Adoption is still early. Many announced products have struggled to find broad applicability — programming models for in-memory compute are immature, and the architectural advantages haven't translated to clear cost wins yet.

05. Optical Computing

Photons travel faster than electrons through wires (literally — speed of light is about 60% of c in fiber, 25% in copper PCB traces). For large interconnects (chip-to-chip, rack-scale), optical links offer bandwidth and energy advantages over electrical.

Optical interconnect is the most mature application:

  • Long-distance datacenter networking has been optical for decades (single-mode fiber, transceivers).
  • Short-distance within-rack and on-board optics are rapidly displacing copper at 400G and 800G.
  • Co-packaged optics: integrate the photonic transceivers next to (or on) the switch / accelerator chip itself. Reduces electrical-to-optical conversion costs.

Photonic compute is more speculative:

  • Lightmatter, Lightelligence, PsiQuantum: matrix multiplication via photonic interferometers.
  • Optical neural networks: weight values encoded as fixed photonic interference patterns; inference essentially "free" in time.

The difficulty: precision, reconfigurability, integration with electrical control. Photonic compute will likely complement rather than replace electrical for the foreseeable future.

06. Quantum Computing

A fundamentally different model: quantum computers use qubits — quantum bits whose state is a superposition of 0 and 1 — and gates that operate on superpositions and entanglement. For certain problems, this offers exponential speedups over classical computation.

The current state (early 2026):

  • Hardware: 1000+ physical qubits on superconducting (IBM, Google) and ion-trap (IonQ, Quantinuum) systems. Trapped neutral atoms (Atom Computing, QuEra) are gaining ground.
  • Error rates: physical qubits have ~10⁻³ error per gate; quantum error correction promises ~10⁻⁹ or better at the cost of many physical qubits per logical qubit.
  • Logical qubits: a few demonstrated; full fault-tolerant computers with hundreds or thousands of logical qubits are still years away.
  • Applications: limited so far. Shor's algorithm (factoring) is the famous one but requires very large fault-tolerant systems. Quantum simulation of chemistry and materials is the most plausible near-term application. Optimization (variational algorithms) is being explored but with mixed results.
  • NISQ (Noisy Intermediate-Scale Quantum) era: current devices are too noisy for full algorithms but can run heuristic methods.

Quantum computing is real but cannot run general-purpose programs. It's an accelerator for specific problem types — like a GPU but more specialized. Most workloads (databases, web servers, graphics) will run on classical computers indefinitely.

The relationship between quantum and classical: a quantum computer needs a classical control plane to orchestrate gate operations and measurements. The classical control runs on conventional CPUs. Quantum supplements rather than replaces.

For cryptography: shorter horizon. Once a cryptographically-relevant fault-tolerant quantum computer exists (estimates range from 10 to 30+ years), most current public-key crypto becomes breakable. The transition to post-quantum cryptography (lattice-based schemes like Kyber, hash-based signatures) is happening now, ahead of the threat. NIST standardized PQC algorithms in 2024.

07. Chiplets and Heterogeneity Beyond AI

Chapter 55 covered chiplets in current CPU/GPU products. Looking forward, chiplets enable a more compositional model: pick best-of-breed dies and integrate them. The UCIe standard targets this. Future systems may include:

  • A CPU chiplet from one vendor.
  • A GPU chiplet from another.
  • An NPU chiplet from a third.
  • A networking chiplet (smart NIC, DPU).
  • A security chiplet.
  • HBM stacks for memory.

All bonded onto a shared interposer or substrate, presenting as one system. The economic implications are significant: smaller players can produce competitive chiplets without designing a full SoC, and customers can mix and match for specialized needs.

This is largely an aspiration as of 2026; commercial multi-vendor chiplet products are emerging slowly.

08. Beyond Moore: New Devices

Several new transistor and memory technologies are being explored:

GAA (Gate-All-Around) transistors: replacing FinFETs at 2 nm and beyond. Better leakage control, improved performance.

CFET (Complementary FET): stacking n-type and p-type transistors vertically. Density gains.

2D materials (graphene, MoS₂, etc.): research ongoing; not yet at production scale.

Spintronics: using electron spin rather than charge for logic and memory. MRAM (Magnetoresistive RAM) is a commercial product; spin-logic remains research.

Carbon nanotubes: room-temperature high-mobility transistors; MIT and others have demonstrated working processors but commercial scale-up is uncertain.

Memristors and ReRAM: programmable resistance for analog compute and dense non-volatile memory.

Phase-change memory (PCM): non-volatile, faster than NAND. Intel Optane was based on this; the product line was discontinued, but research continues.

Each of these technologies has been "5-10 years away" for many years. Some will happen; predicting which is hard.

09. Software Implications

All this hardware diversity raises a software problem: how do programmers and compilers target it?

Domain-specific languages (DSLs): PyTorch, TensorFlow, JAX, Triton — high-level code that lowers to whatever hardware is available. The dominant pattern in AI.

Heterogeneous programming models: SYCL, oneAPI, OpenCL, HIP. C++-centric, attempting cross-vendor compatibility.

Compiler infrastructure: MLIR (Multi-Level Intermediate Representation, from LLVM project) provides a framework for representing computations at multiple abstraction levels and lowering through transformations to specific hardware.

Auto-parallelization: Halide, TVM, Triton's autotune. Compilers explore configurations to find the best mapping to the target hardware.

The gap between hardware capability and software ability to use it remains the main bottleneck. Many of the hardware ideas in this chapter are constrained more by software immaturity than by hardware feasibility.

10. Where This Leaves the CPU

Despite all this diversity, the CPU is not going away. Specifically:

The general-purpose CPU runs the OS, the orchestration, the parts of every workload that don't fit the accelerator.

The CPU is the integration point. Even in heavily-accelerated workloads, ~30-50% of total system effort goes through the CPU: scheduling, network handling, file I/O, memory management, error handling, logging.

The CPU is the testing ground for new ideas. Vector instructions, dataflow ideas, in-memory hints — many of these will eventually land in CPUs, just as SIMD did in the 1990s.

The CPU is the most heterogeneous device of all: CPU + GPU + NPU + media engines + Secure Enclave on Apple Silicon; same on Snapdragon; integrated GPU + NPU on AMD APU. The CPU's surrounding ecosystem is what defines a modern SoC.

ISA evolution continues: SVE on ARM, AVX-10 on Intel, vector and crypto extensions on RISC-V. CPUs add new capabilities continuously, even as accelerators take on more work.

11. Conclusion of the Main Text

We've reached the end of Part XII and the main body of the book. We started in Part I with binary numbers and Boolean logic and built up:

  • The basic computer organization (Part II) and instruction set architecture (Part III).
  • The memory hierarchy (Part IV) and the microarchitectural techniques that extract performance — pipelining, branch prediction, OoO, load/store buffers, microcode (Part V).
  • Parallelism in its various forms: ILP, DLP, TLP, with cache coherence and memory consistency (Part VI).
  • Three modern instruction sets in detail: x86-64 (Part VII), AArch64 (Part VIII), RISC-V (Part IX).
  • The system-software interface — OS, firmware, virtualization, security (Part X).
  • Advanced topics that revisited cache, prediction, power, reliability, performance, packaging (Part XI).
  • And finally, the world beyond the CPU: GPUs, accelerators, embedded systems, and the emerging architectures (Part XII).

If you've worked through all of it, you have a comprehensive view of computer architecture as a working discipline: how hardware actually behaves, what trade-offs drive design choices, what tools and methodologies practitioners use, and where the field is heading. You're equipped to read computer architecture papers, vendor optimization guides, and microarchitectural references; to reason about performance from physical first principles; to recognize how hardware shapes software possibilities and vice versa.

What remains is the appendices, which provide reference material: a mathematical and logical refresher (Appendix A), a guide to reading assembly across the three reference ISAs (Appendix B), an ISA comparison summary (Appendix C), and suggested labs and projects to anchor the material in hands-on practice (Appendix D).

Computer architecture has always evolved — from vacuum tubes through transistors and integrated circuits to today's billion-transistor chips. The next generation of breakthroughs will come from a mix of conventional scaling (whatever survives of Moore's Law), specialization (more accelerators, more diverse), packaging innovation (chiplets, optics), and new computational paradigms (in-memory, quantum, neuromorphic). The architects of those breakthroughs need the foundations this book has tried to provide — and they will write the next chapters that we cannot yet predict.

12. Summary

This final chapter has surveyed the architectures beyond conventional CPUs: FPGAs and CGRAs (reconfigurable logic), dataflow engines, neuromorphic chips, analog and in-memory compute, optical interconnects and computing, and quantum computers. Each addresses limitations of conventional architectures in different ways; each has its niches and its open challenges. None replaces the CPU, but together they comprise the post-Moore landscape of computing. The CPU itself continues to evolve — adding vector capability, security features, virtualization — and serves as the integration point for an increasingly heterogeneous compute environment.

Part XII has covered:

  • Chapter 56: GPUs and accelerators — SIMT execution, tensor cores, NPUs, fixed-function blocks, integration models, programming.
  • Chapter 57: Embedded and real-time systems — microcontrollers, RTOSes, WCET analysis, safety-critical certification, low power, embedded buses.
  • Chapter 58: Reconfigurable and emerging — FPGAs, dataflow, neuromorphic, in-memory, optical, quantum, post-Moore directions.

The appendices follow: math/logic refresher, reading assembly, ISA comparison, suggested labs.

Book mode
computer-architecturegpuacceleratorbeyond-cpu
Was this helpful?