PRISM-Q
A Rust quantum circuit simulator built for speed.
PRISM-Q runs quantum circuits fast by matching each one to the right simulation strategy. It dispatches across eight CPU backends plus an optional CUDA path, optimizes circuits through a multi-pass fusion pipeline, and uses AVX2, FMA, and BMI2 SIMD in the inner loop. Input is OpenQASM 3.0, with backward-compatible 2.0 syntax. The same library handles a two-qubit Bell pair and a thousand-qubit Clifford circuit.
#![allow(unused)] fn main() { use prism_q::CircuitBuilder; let result = CircuitBuilder::new(2).h(0).cx(0, 1).run(42).unwrap(); let probs = result.probabilities.unwrap(); // |00> = 0.5, |11> = 0.5 }
What it does
- Eight CPU backends selected automatically per circuit: statevector, stabilizer (with factored and filtered variants), sparse, MPS, product state, tensor network, and dynamic factored split-state.
- Compiled shot samplers that sample without rebuilding the full statevector each shot, including noisy and detector/QEC paths.
- Clifford+T strategies (stabilizer rank, stochastic and deterministic Pauli propagation) for circuits beyond the reach of a dense statevector.
- Optional CUDA backend for statevector and stabilizer execution.
Where to go next
- Installation and Your First Circuit: a hands-on start.
- Architecture: backends, dispatch tree, fusion pipeline, and SIMD strategy.
- Glossary: terminology used across the documentation.
- API reference: generated Rust API documentation.
- Source and issues: the repository on GitHub.
Installation
PRISM-Q is a Rust library. Add it to a project with Cargo:
cargo add prism-q # Rayon parallelism + faer SVD (default)
cargo add prism-q --no-default-features # single-threaded, minimal dependencies
Or add it to Cargo.toml directly:
[dependencies]
prism-q = "0.16"
Feature flags
| Feature | Default | Enables |
|---|---|---|
parallel | yes | Rayon parallel kernels (≥14 qubits) and the faer SVD path for MPS |
serialization | no | serde derives for circuits and results |
gpu | no | Optional CUDA backend (see the GPU guide) |
The published benchmarks were taken with parallel enabled. Without it, 16+ qubit runs
fall back to single-threaded kernels and are not comparable to the baselines. Disable it
only when you need a minimal-dependency, single-threaded build.
Building from source
git clone https://github.com/AbeCoull/prism-q
cd prism-q
cargo build --release
Running the test suite
cargo nextest run --all-features # unit + integration tests
cargo test --doc --all-features # doctests
cargo clippy --all-targets --all-features -- -D warnings # lint
Use cargo test --all-features if cargo-nextest is unavailable.
Next: build Your First Circuit.
Your First Circuit
There are two ways to build a circuit: the fluent CircuitBuilder API, or by parsing
OpenQASM text.
With the builder
CircuitBuilder chains gate calls and runs the result. This builds a Bell pair, the
two-qubit entangled state (|00⟩ + |11⟩) / √2:
#![allow(unused)] fn main() { use prism_q::CircuitBuilder; let result = CircuitBuilder::new(2) .h(0) .cx(0, 1) .run(42) // seed = 42 .expect("simulation failed"); let probs = result.probabilities.expect("no probabilities"); for i in 0..probs.len() { let p = probs.get(i); if p > 1e-10 { println!("|{i:02b}> = {p:.4}"); } } // |00> = 0.5000 // |11> = 0.5000 }
A larger structurally similar circuit, the 5-qubit GHZ state, renders like this (diagram generated by PRISM-Q's own SVG renderer):
From OpenQASM
The same Bell pair, written in OpenQASM 3.0 and parsed:
#![allow(unused)] fn main() { use prism_q::circuit::openqasm; use prism_q::simulate; let qasm = r#" OPENQASM 3.0; include "stdgates.inc"; qubit[2] q; h q[0]; cx q[0], q[1]; "#; let circuit = openqasm::parse(qasm).expect("failed to parse QASM"); let result = simulate(&circuit).seed(42).run().expect("simulation failed"); }
run_qasm(qasm, seed) is a shortcut that parses and simulates in one call. See the
OpenQASM Support guide for the supported subset.
q[0] is the least significant bit, so x q[0] produces state index 1, not 2. Bitstrings
print most-significant qubit first.
Next: sample measurement outcomes in Shots and Sampling.
Shots and Sampling
Probabilities give you the exact distribution. Shots give you sampled measurement outcomes, the way real hardware reports results. PRISM-Q samples deterministically from a fixed seed.
Sampling shots
#![allow(unused)] fn main() { use prism_q::circuit::openqasm; use prism_q::simulate; let qasm = r#" OPENQASM 3.0; include "stdgates.inc"; qubit[2] q; bit[2] c; h q[0]; cx q[0], q[1]; c[0] = measure q[0]; c[1] = measure q[1]; "#; let circuit = openqasm::parse(qasm).expect("failed to parse QASM"); let result = simulate(&circuit).seed(42).shots(1024).expect("shots failed"); print!("{result}"); // ShotsResult implements Display }
The same seed always produces the same samples. Pass rand::random() as the seed for
non-deterministic sampling.
Counts and marginals
For large shot counts, you usually want aggregates rather than raw shots:
#![allow(unused)] fn main() { // Frequency histogram: bitstring -> count let counts = simulate(&circuit).seed(42).sample_counts(100_000).unwrap(); // Per-qubit P(measuring |1>), without the full joint distribution let marginals = simulate(&circuit).seed(42).marginals().unwrap(); }
sample_counts and shots route through PRISM-Q's compiled samplers, which propagate
measurements through the circuit instead of materializing the full statevector on every
shot. For Clifford circuits this scales to thousands of qubits. See
Compiled Samplers.
Noisy sampling
Attach a NoiseModel to sample under depolarizing or readout noise:
#![allow(unused)] fn main() { use prism_q::{simulate, BackendKind, NoiseModel}; let noise = NoiseModel::uniform_depolarizing(&circuit, 0.001); let result = simulate(&circuit) .backend(BackendKind::Statevector) .noise(noise) .seed(42) .shots(1024) .unwrap(); }
The Noise and QEC guide covers noise models and detector sampling in depth.
Next: learn how PRISM-Q picks a representation in Choosing a Backend.
Choosing a Backend
By default PRISM-Q inspects the circuit and picks a backend for you. You only need to choose explicitly when you know something the auto-dispatcher cannot infer, or when you are benchmarking a specific representation.
Let it choose
#![allow(unused)] fn main() { use prism_q::simulate; let result = simulate(&circuit).seed(42).run().unwrap(); // BackendKind::Auto }
Auto-dispatch walks this decision tree:
flowchart TD
A[Auto] --> E{Entangling gates?}
E -- none --> PS[ProductState]
E -- yes --> CL{All Clifford?}
CL -- yes --> STB[Stabilizer]
CL -- no --> MEM{Above memory limit?}
MEM -- "yes, sparse-friendly" --> SPR[Sparse]
MEM -- "yes, otherwise" --> MPS[MPS bond 256]
MEM -- no --> IND{Partial independence?}
IND -- yes --> FAC[Factored]
IND -- no --> SV[Statevector]
Choose explicitly
#![allow(unused)] fn main() { use prism_q::{simulate, BackendKind}; let result = simulate(&circuit) .backend(BackendKind::Stabilizer) .seed(42) .run() .unwrap(); }
Symptom to backend
| If your circuit... | Use | Why |
|---|---|---|
| Is Clifford-only (H, S, CX, CZ, ...) | Stabilizer | O(n²), scales to thousands of qubits |
| Has no entangling gates | ProductState | O(n), per-qubit state |
| Is dense and ≤ ~28 qubits | Statevector | Exact, fastest for the general case |
| Stays concentrated in few basis states | Sparse | O(k) in nonzero amplitudes |
| Has low entanglement but many qubits | Mps { max_bond_dim } | Polynomial memory in bond dim |
| Splits into independent sub-registers | Factored | Simulates blocks separately, merges lazily |
| Is Clifford + a few T gates | See Clifford+T | Beats dense statevector |
ProductState errors on any entangling gate by design. Auto-dispatch only selects it for
circuits that have none. Choose it explicitly only when you know the circuit is a product
state throughout.
The Backends Deep Dive and the architecture reference cover each backend's internals.
Backends Deep Dive
PRISM-Q does not have one simulation algorithm. It has eight, each optimal for a different class of circuit. This guide is the task-oriented companion to the architecture reference: it focuses on scaling and when to reach for each one. To select a backend in code, see Choosing a Backend. For which CPU and GPU architectures each backend supports, see the Capability and Support Matrix.
Scaling at a glance
| Backend | Memory | Best for | Ceiling |
|---|---|---|---|
| Statevector | Dense general circuits | ~28 qubits (RAM-bound) | |
| Stabilizer | Clifford-only circuits | Thousands of qubits | |
| Factored Stabilizer | per cluster | Clifford with independent blocks | Thousands of qubits |
| Sparse | nonzero | Concentrated support | Large , small |
| MPS | Low entanglement | Large , bounded | |
| Product | No entanglement | Unbounded | |
| Tensor Network | order-dependent | Shallow / structured | prob qubits |
| Factored | worst case | Partially independent | Block-bound |
Statevector
The default for dense circuits. Exact, fully general, and the fastest option whenever the
state fits in RAM. The memory cap is derived from system RAM (overridable with
PRISM_MAX_SV_QUBITS). Above it, auto-dispatch falls back to Sparse or MPS.
Stabilizer
If your circuit uses only Clifford gates (H, S, Sdg, SX, SXdg, CX, CZ, SWAP, measurement), the stabilizer tableau simulates it in and scales to thousands of qubits. Auto-dispatch selects it whenever the circuit is Clifford-only.
Add even a single non-Clifford gate (T, Rz(θ) with arbitrary θ) and the stabilizer
backend no longer applies. For a small number of such gates, see
Clifford+T Simulation.
Sparse, MPS, Product, Tensor Network, Factored
- Sparse wins when the state stays concentrated in a handful of computational-basis states (amplitude pruning keeps the map small).
- MPS trades exactness for polynomial memory in the bond dimension. Ideal for low-entanglement circuits over many qubits.
- Product is the degenerate, entanglement-free case: memory, per 1q gate.
- Tensor Network defers contraction until measurement, useful for shallow or structured circuits.
- Factored detects partial independence and simulates sub-registers separately, merging lazily via a Kronecker product computed on demand.
For the internal kernels behind each of these, read the architecture reference. For raw speed mechanics, see Performance and SIMD.
Capability and Support Matrix
This page records which CPU and GPU architectures each PRISM-Q backend supports, and where distributed execution stands. CPU backends are written in portable Rust and run on every supported architecture; SIMD acceleration (AVX2/FMA/BMI2 on x86-64, NEON on ARM64) is selected at runtime where a kernel exists, otherwise a scalar path is used.
Legend
| Mark | Meaning |
|---|---|
| Yes | Supported |
| SIMD | Supported with a dedicated SIMD-accelerated kernel on this architecture |
| Scalar | Runs, but without a dedicated SIMD kernel (portable fallback) |
| No | Not available for this backend |
| Planned | Not implemented yet; on the roadmap |
Backend support by architecture
| Backend | x86-64 | AVX2/FMA/BMI2 | ARM64 | NEON | CUDA (NVIDIA) | ROCm (AMD) | Distributed |
|---|---|---|---|---|---|---|---|
| Statevector | Yes | SIMD | Yes | SIMD | Yes | Planned | Planned |
| Stabilizer | Yes | SIMD | Yes | SIMD | Yes | Planned | Planned |
| Factored Stabilizer | Yes | SIMD | Yes | SIMD | No | Planned | Planned |
| Sparse | Yes | Scalar | Yes | Scalar | No | Planned | Planned |
| MPS | Yes | SIMD | Yes | SIMD | No | Planned | Planned |
| Product State | Yes | Scalar | Yes | Scalar | No | Planned | Planned |
| Tensor Network | Yes | Scalar | Yes | Scalar | No | Planned | Planned |
| Factored | Yes | SIMD | Yes | SIMD | No | Planned | Planned |
| Stabilizer Rank | Yes | SIMD | Yes | SIMD | No | Planned | Planned |
| Stochastic Pauli | Yes | Scalar | Yes | Scalar | No | Planned | Planned |
| Deterministic Pauli | Yes | Scalar | Yes | Scalar | No | Planned | Planned |
Notes:
- AVX2/FMA/BMI2 is the x86-64 SIMD tier. The active tier is chosen at runtime (AVX2+FMA, then FMA, then SSE2 baseline). See Threading, SIMD, and Memory Layout.
- NEON is the ARM64 SIMD tier. Backends marked
SIMDcarry a NEON kernel that mirrors the x86-64 path; the rest fall back to scalar code on ARM64. - CUDA covers the optional
gpufeature. Only the statevector and stabilizer paths have device kernels; every other backend runs on CPU. See the GPU Backend guide.
Not yet supported
| Target | Status | Notes |
|---|---|---|
| ROCm (AMD GPU) | Planned | No AMD device kernels; the GPU path is CUDA-only |
| Distributed CPU | Planned | No multi-node execution |
| Distributed GPU | Planned | No multi-node GPU execution |
| Multi-GPU | Planned | A GPU context binds a single device |
These targets are listed so the matrix reflects the roadmap rather than hiding the gaps. Distributed execution is not a code path PRISM-Q exposes today.
Performance and SIMD
Performance is the primary product requirement. This guide explains the mechanisms that make PRISM-Q fast and the knobs you can turn. The internals live in the architecture reference under Fusion Pipeline and Threading, SIMD, and Memory Layout.
The three levers
- Fusion collapses many small gate passes into fewer, larger ones before execution, reducing memory traffic over the statevector. It is qubit-count gated and zero-cost when it does not apply.
- Cache-resident tiling keeps batched gates (
MultiFused,Multi2q) operating on L2/L3-sized tiles so repeated passes reuse hot data. - SIMD vectorizes the inner complex-arithmetic loop with AVX2+FMA, FMA, and BMI2, with a scalar fallback on non-x86_64.
Threading
Rayon parallel kernels engage at ≥14 qubits (below that, thread-pool overhead
dominates), with MIN_PAR_ELEMS = 4096 per task. The pool defaults to all logical cores.
Set RAYON_NUM_THREADS to cap parallelism. Hyperthreading helps at 24+ qubits by hiding
memory latency, but on a contended host it adds noise to benchmarks.
Determinism
Same circuit plus same seed yields the same result regardless of thread count. Parallel backends use deterministic work partitioning, so reproducibility never costs correctness.
Tuning environment variables
| Variable | Effect |
|---|---|
PRISM_MAX_SV_QUBITS | Override the statevector memory cap |
RAYON_NUM_THREADS | Cap Rayon thread count |
PRISM_NO_AVX2_2Q | Force the 128-bit FMA 2q kernel (A/B comparison) |
PRISM_NO_REORDER | Disable disjoint Fused2q tier grouping |
PRISM_GPU_MIN_QUBITS | GPU crossover qubit count (with the gpu feature) |
Benchmarking
Always run benchmarks with --features parallel. The baselines were taken with Rayon
enabled; without it, large circuits run single-threaded and are not comparable. Never run
two cargo bench processes at once: competing Rayon pools cause large swings.
cargo bench --bench circuits --features parallel # circuit macrobenchmarks
cargo bench --bench bench_driver --features parallel # gate microbenchmarks
For current wall-clock numbers across the circuit suite, see the Benchmarks page.
OpenQASM Support
PRISM-Q parses a practical subset of OpenQASM 3.0, with backward compatibility for common 2.0 syntax. The parser converts text directly to the Circuit IR with no intermediate AST.
Parsing and running
#![allow(unused)] fn main() { use prism_q::circuit::openqasm; use prism_q::simulate; let circuit = openqasm::parse(qasm_str).expect("parse error"); let result = simulate(&circuit).seed(42).run().unwrap(); }
run_qasm(qasm, seed) parses and simulates in one call.
Declarations and measurement
OPENQASM 3.0;
include "stdgates.inc";
qubit[3] q; // OpenQASM 3.0 register
bit[3] c;
h q[0];
cx q[0], q[1];
c[0] = measure q[0]; // OQ3 measurement
OpenQASM 2.0 syntax also works: qreg q[3]; / creg c[3]; declarations and
measure q[0] -> c[0]; measurement.
Supported gates
- Standard / aliases: x, y, z, h, s, sdg, t, tdg, sx, rx, ry, rz, p/phase, cx/CX/cnot, cy, cz, cp/cphase, crx, cry, crz, ch, swap, ccx/toffoli, cswap/fredkin, cu, u1, u2, u3/u/U.
- Qiskit / exporter: sxdg, cs, csdg, csx, ccz, r, rzz, rxx, ryy, xx_plus_yy, xx_minus_yy, ecr, iswap, dcx, c3x, c4x, mcx, rccx, rc3x/rcccx.
- Hardware-native: gpi, gpi2, ms, syc, sqrt_iswap, sqrt_iswap_inv.
Other supported constructs
- Gate modifiers:
ctrl @,inv @,pow(k) @. - User-defined
gateblocks. - Classical
ifconditionals. - Multi-register broadcast,
barrier, and an expression evaluator with math functions.
for / while loops, subroutines, and classical expressions beyond if. A construct
that parses as valid OpenQASM but is unsupported returns UnsupportedConstruct rather
than panicking; see the Error Model.
Clifford+T Simulation
Circuits that mix Clifford gates with a modest number of T gates sit between the
efficient stabilizer regime and the exponential statevector regime. PRISM-Q offers three
strategies. The right one depends on your T-count, qubit count, and whether you need
exact answers or can tolerate Monte Carlo error.
- Few T gates, exact result needed: stabilizer rank (
run_stabilizer_rank). - Many T gates, marginals only: stochastic Pauli propagation (
run_spp). - Moderate T-count, exact or bounded-error expectation values: deterministic sparse
Pauli dynamics (
run_spd).
These route through the Clifford+T strategies before the standard dispatch tree when the T-count permits.
Stabilizer rank (src/sim/stabilizer_rank.rs)
Exact probability output remains capped because it returns a dense vector with
2^n entries. Shot sampling uses coherent weighted MPS branches instead of a
dense statevector fallback. Clifford gates mutate each branch state, T and
Tdg split branches, and measurement computes outcome probabilities from the
weighted branch ensemble before projecting every branch to the sampled outcome.
This removes the hard qubit-count cap from run_stabilizer_rank_shots;
practical scaling is governed by branch count, MPS bond growth, and measurement
count.
The dense probability path maintains a weighted sum of stabilizer states. Each T
gate doubles the term count via the T = alpha*I + beta*Z decomposition.
Clifford gates are O(n²) per term and weighted amplitudes are accumulated for
exact probabilities.
| Function | Use |
|---|---|
run_stabilizer_rank | Exact probabilities (t ≤ 20, n ≤ 25) |
run_stabilizer_rank_approx | Approximate with Monte Carlo (higher t counts) |
run_stabilizer_rank_shots | Shot-based sampling with no fixed qubit cap |
stabilizer_overlap_sq | Inner product between stabilizer states |
Stochastic Pauli Propagation (src/sim/unified_pauli.rs)
Backward-propagates measurement observables as Pauli strings. Clifford gates conjugate in O(1). T gates branch stochastically into two Pauli paths with appropriate weights. Per-path cost O(d×n/64), independent of T-gate count. Returns marginal probabilities via Monte Carlo estimation.
#![allow(unused)] fn main() { run_spp(circuit, num_samples, seed) // -> SppResult }
Deterministic Sparse Pauli Dynamics (src/sim/unified_pauli.rs)
Backward-propagates as a weighted sum of Pauli strings stored in a HashMap. T gates deterministically branch X/Y terms. Identical strings auto-merge. Optional ε-truncation for approximate mode. Exact for small T-counts, approximate with bounded error for larger ones.
#![allow(unused)] fn main() { run_spd(circuit, epsilon, max_terms) // -> SpdResult }
You can also build Clifford+T test circuits directly with
clifford_t_circuit.
Noise and QEC
PRISM-Q models noise and quantum error correction without falling back to a dense statevector per shot. The machinery is the compiled samplers and the native QEC program IR; this guide shows how they fit together.
Noisy shot sampling
Attach a NoiseModel and sample:
#![allow(unused)] fn main() { use prism_q::{simulate, BackendKind, NoiseModel}; let noise = NoiseModel::uniform_depolarizing(&circuit, 0.001); let result = simulate(&circuit) .backend(BackendKind::Statevector) .noise(noise) .seed(42) .shots(1024) .unwrap(); }
NoiseModel carries per-instruction depolarizing channels (NoiseOp { qubit, px, py, pz })
and supports readout error and amplitude damping. For Clifford circuits, the noisy
compiled sampler propagates noise sensitivity rows and XORs fired channels into each
sample, avoiding per-shot state evolution entirely.
Detector sampling
For repeated syndrome extraction, compile_detector_sampler compiles a Clifford circuit
with measurement and reset reuse into a packed sampler, then derives detector and
observable records as parity rows over the measurement record. Reset reuse becomes fresh
qubit aliases, so there is no per-shot tableau replay.
Native QEC programs
When you need detectors, logical observables, postselection, and Pauli-noise annotations
as first-class constructs, use the native QEC program IR rather than a Circuit:
#![allow(unused)] fn main() { use prism_q::{parse_qec_program, run_qec_program}; let program = parse_qec_program(qec_text).unwrap(); let result = run_qec_program(&program).unwrap(); }
run_qec_program lowers Clifford-compatible programs into the packed compiled sampler.
run_qec_program_reference is the per-shot statevector oracle for validating small
programs.
Clifford gates, basis resets and measurements, MPP Pauli-product measurements,
detectors, observables, postselection, and X_ERROR / Z_ERROR / DEPOLARIZE1 /
DEPOLARIZE2 noise. Non-Clifford gates and EXP_VAL are rejected until their production
strategies land. See the QEC IR reference for the full
grammar and the V1 reset requirement.
Homological sampling
run_shots_homological and ErrorChainComplex model the GF(2) chain complex over noise
locations, identifying undetectable error cycles. noisy_marginals_analytical computes
marginals in closed form from the parity matrix and noise rates, with no Monte Carlo.
GPU Backend
The GPU backend is optional and gated behind the gpu feature. It requires the CUDA
toolkit (12.x or newer) and a CUDA-capable device.
cargo build --release --features "parallel gpu"
cargo test --features "parallel gpu" --test golden_gpu
CUDA acceleration covers statevector execution, stabilizer execution, and compiled BTS sampling. Five entry points are available:
BackendKind::StatevectorGpu { context }. Public dispatch path for statevector GPU execution. It routes throughsimulate(circuit).backend(kind).seed(seed).run(), keeps fusion and subsystem decomposition, and usescrate::gpu::min_qubits()(default 14,PRISM_GPU_MIN_QUBITSoverride) to keep small sub-circuits on CPU.BackendKind::StabilizerGpu { context }. Public dispatch path for stabilizer GPU execution. Gate application uses a device tableau and one batched Clifford kernel (stab_apply_batch). Measurement and reset keep pivot search, row cascade, phase fixup, and deterministic outcomes on the device. The default crossover stays conservative (STABILIZER_MIN_QUBITS_DEFAULT = 100_000,PRISM_STABILIZER_GPU_MIN_QUBITSoverride) until benchmarks justify lowering it. Direct backend benchmarks should useStabilizerBackend::with_gpu(ctx)to exclude diagnostic readbacks fromprobabilities(),export_tableau(), andexport_statevector(). Golden tests cover every kernel path, including 500q GHZ measure-all.StatevectorBackend::new(seed).with_gpu(ctx). Direct statevector GPU opt-in. Every instruction routes to CUDA after the context is attached. No crossover or subsystem decomposition applies.StabilizerBackend::new(seed).with_gpu(ctx). Direct stabilizer GPU opt-in for kernel benchmarks and targeted correctness tests.run_shots_compiled_with_gpu(orCompiledSampler::with_gpu(ctx)). GPU BTS sampling for flat sparse parity. The path launches one kernel per65_536-shot chunk, uses random bits generated on the host, and preserves the CPUsample_bts_meas_majorlayout. The sampler caches sparse parity CSR arrays, packed reference bits, and reusable scratch on the device. It is active only whennum_shots >= BTS_MIN_SHOTS_DEFAULT(131_072by default,PRISM_GPU_BTS_MIN_SHOTSoverride).sample_bulk_packed_devicereturns aDevicePackedShotshandle. Marginals reduce to one counter per measurement row on the device. Exact counts use a bounded device hash reduction for up to 8 packed measurement words when the compact result is cheaper to transfer than the full shot matrix. Otherwise the API uses a host copy for correctness.
When a GPU context is attached, Backend::init allocates state on the device instead
of a host Vec<Complex64> and every instruction routes to a CUDA kernel.
Module layout (src/gpu/)
| File | Role |
|---|---|
mod.rs | GpuContext, GpuState public entry points |
device.rs | GpuDevice: cudarc wrapper, compiles PTX at device construction |
memory.rs | GpuBuffer: device Complex64 storage |
kernels/mod.rs | KERNEL_NAMES, composed kernel_source() concatenating dense + stabilizer |
kernels/dense.rs | PTX source and Rust launcher for every Gate variant |
kernels/stabilizer.rs | PTX source and launchers for tableau init, 11 Clifford gates, rowmul_words |
Kernel coverage
Every variant in the Gate enum has a dedicated kernel. Batched
variants (BatchPhase, BatchRzz, DiagonalBatch, MultiFused { all_diagonal: true })
use LUT kernels that consume the same host table builders as the CPU path.
Non-diagonal MultiFused uses a shared memory tiled kernel (apply_multi_fused_tiled,
TILE_Q = 10, TILE_SIZE = 1024). Sub-gates whose target bit is inside the tile apply in
shared memory. Sub-gates whose target bit is outside the tile fall back to per gate
launches. Multi2q still launches once per sub-gate; rare in practice.
PTX template substitution: the CUDA C source is held as a template string
(KERNEL_SOURCE_TEMPLATE) with placeholders such as {{BP_TABLE_SIZE}} and
{{TILE_Q}}. The kernel_source() function substitutes them at device construction from
the Rust constants in src/backend/statevector/kernels.rs, keeping CPU and GPU in sync.
Correctness
tests/golden_gpu.rs drives 20 cross-checks comparing GPU amplitudes
against the CPU statevector within 1e-10. Covers every gate variant, fusion paths, and
the BackendKind::StatevectorGpu public dispatch path at the crossover boundary.
Current limits
BackendKind::Autodoes not yet dispatch to GPU. TheStatevectorGpuvariant is the explicit opt-in untilAutocan receive a default GPU context.- Several fused launchers (
launch_apply_fused_2q,launch_apply_mcu,launch_apply_mcu_phase,launch_apply_batch_phase,launch_apply_batch_rzz,launch_apply_diagonal_batch,launch_apply_multi_fused_nondiag) upload small metadata tables per dispatch viaclone_htod. This is the next launch overhead target. measure_prob_oneallocates a device partials buffer and reduces on the host every call. Matters for measurement-heavy circuits.- Kernel design and crossover analysis live in the module docstrings on
src/gpu/kernels/dense.rs.
Architecture: Overview and Layered Design
This section is the technical reference for how PRISM-Q is built. See the Glossary for definitions of terms used throughout.
Goals
- Primary: Fastest practical quantum circuit simulation in Rust.
- Correct simulation of supported gate sets across multiple backend strategies.
- Clean backend plugin model. New simulation strategies can be added without touching the core.
Non-goals
- Full OpenQASM 3.0 compliance (supports a practical subset).
- GUI or notebook integration (library-first).
- Hardware backend / QPU connectivity.
Layered design
A circuit flows top to bottom: text is parsed into a backend-agnostic IR, optimized by the fusion pipeline, then dispatched by the simulation engine to one of the backends or a compiled sampler.
flowchart TD
U[User / Application]
API["Public API — run_qasm, simulate (src/lib.rs)"]
P["OpenQASM 3.0 Parser — &str to Circuit IR (src/circuit/openqasm.rs)"]
IR["Circuit IR — gates, measures, barriers, conditionals (src/circuit/mod.rs)"]
F["Fusion Pipeline — cancel, fuse, reorder, batch (src/circuit/fusion.rs)"]
E["Simulation Engine — dispatch, decompose, execute (src/sim/mod.rs)"]
U --> API --> P --> IR --> F --> E
E --> B[Backends]
E --> C["Compiled Samplers — shot-based (src/sim/compiled, noise.rs, homological.rs)"]
B --> SV[Statevector]
B --> TN[Tensor Network]
B --> MPS[MPS]
B --> SP[Sparse]
B --> PR[Product]
B --> ST[Stabilizer]
B --> FA[Factored]
The remaining pages in this section follow that flow: the parser and circuit IR, the fusion pipeline, the simulation engine and dispatch, the individual backends, the compiled samplers, the native QEC program IR, the threading, SIMD, and memory layout, and the error model and public API surface.
Parser and Circuit IR
Parser
Handwritten parser targeting a practical OpenQASM 3.0 subset. It processes input line by line and converts &str directly to Circuit IR with no intermediate AST.
Supported: qubit/bit declarations, OpenQASM standard gates and aliases (x, y, z, h, s, sdg, t, tdg, sx, rx, ry, rz, p/phase, cx/CX/cnot, cy, cz, cp/cphase, crx, cry, crz, ch, swap, ccx/toffoli, cswap/fredkin, cu, u1, u2, u3/u/U), Qiskit and exporter gates (sxdg, cs, csdg, csx, ccz, r, rzz, rxx, ryy, xx_plus_yy, xx_minus_yy, ecr, iswap, dcx, c3x, c4x, mcx, rccx, rc3x/rcccx), hardware-native gates (gpi, gpi2, ms, syc, sqrt_iswap, sqrt_iswap_inv), gate modifiers (ctrl @, inv @, pow(k) @), user-defined gate blocks, classical if conditionals, multi-register broadcast, measure, barrier, expression evaluator with math functions. OpenQASM 2.0 backward compatibility (qreg/creg, measure q -> c syntax).
Unsupported: for/while loops, subroutines, classical expressions beyond if.
See the OpenQASM Support guide for a user-facing walkthrough.
Circuit IR
Circuit holds num_qubits, num_classical_bits, and Vec<Instruction>. Instructions are an enum:
| Variant | Fields | Description |
|---|---|---|
Gate | gate, targets | Gate application |
Measure | qubit, classical_bit | Destructive measurement |
Barrier | qubits | Synchronization barrier |
Conditional | condition, gate, targets | Classical-controlled gate |
Targets use SmallVec<[usize; 4]>, inline storage for up to 4 qubits, no heap allocation for typical gates.
Gate enum
Gate is a Clone enum kept at 16 bytes. Simple variants carry parameters inline. Composite variants use Box to stay within the 16-byte budget for cache-friendly dispatch.
Adding inline data larger than 16 bytes pollutes cache lines and has caused 40-130%
regressions. Always check size_of::<Gate>() after adding a variant, and Box large
payloads.
| Variant | Data | Size |
|---|---|---|
Id, X, Y, Z, H, S, Sdg, T, Tdg, SX, SXdg | None | 16B |
Rx(f64), Ry(f64), Rz(f64), P(f64), Rzz(f64) | Inline f64 | 16B |
Cx, Cz, Swap | None | 16B |
Cu(Box<[[Complex64; 2]; 2]>) | Boxed 2×2 | 16B |
Mcu(Box<McuData>) | Boxed matrix + control count | 16B |
Fused(Box<[[Complex64; 2]; 2]>) | Boxed pre-fused 1q matrix | 16B |
Fused2q(Box<[[Complex64; 4]; 4]>) | Boxed pre-fused 2q matrix | 16B |
MultiFused(Box<MultiFusedData>) | Batched 1q gates for tiled pass | 16B |
Multi2q(Box<Multi2qData>) | Batched 2q gates for tiled pass | 16B |
BatchPhase(Box<BatchPhaseData>) | Batched cphase with shared control | 16B |
BatchRzz(Box<BatchRzzData>) | Batched ZZ rotations | 16B |
DiagonalBatch(Box<DiagonalBatchData>) | Mixed diagonal 1q/2q batch | 16B |
Fusion Pipeline
Gate optimizations before execution, gated by qubit count thresholds. Every pass returns Cow<Circuit>. Borrowed when no optimization applies, so circuits that do not benefit pay zero overhead.
flowchart TD
IN[Input Circuit] --> P0["cancel_self_inverse_pairs (always)"]
P0 --> P0r["fuse_rzz (always): CX·Rz·CX to Rzz"]
P0r --> P0b["fuse_batch_rzz (>=16q): N×Rzz to BatchRzz"]
P0b --> G{"qubits >= MIN_QUBITS_FOR_FUSION (10)?"}
G -- no --> OUT[Output Circuit]
G -- yes --> P1["fuse_single_qubit_gates (>=10q)"]
P1 --> P1r["reorder_1q_gates (>=10q)"]
P1r --> P1c["cancel_self_inverse_pairs (>=10q)"]
P1c --> P1f["fuse_single_qubit_gates re-fuse (>=10q)"]
P1f --> P2q["fuse_2q_gates (>=12q): CX/CZ + adjacent 1q to Fused2q"]
P2q --> P2qb["fuse_same_pair_2q_blocks (>=12q)"]
P2qb --> P2["fuse_multi_1q_gates (>=14q) to MultiFused"]
P2 --> P2qr["reorder_disjoint_fused2q (>=12q)"]
P2qr --> Pm2q["fuse_multi_2q_gates (>=12q) to Multi2q"]
Pm2q --> Pcp["fuse_controlled_phases (>=16q) to BatchPhase"]
Pcp --> Pdb["fuse_diagonal_batch (>=16q) to DiagonalBatch"]
Pdb --> Ppp["batch_post_phase_1q (>=18q)"]
Ppp --> OUT
Threshold constants
| Constant | Value | Rationale |
|---|---|---|
MIN_QUBITS_FOR_FUSION | 10 | Below this, clone cost exceeds simulation savings |
MIN_QUBITS_FOR_MULTI_FUSION | 14 | MultiFused tiling overhead vs benefit |
MIN_QUBITS_FOR_DIAG_BATCH | 16 | Diagonal batch, cphase, and Rzz batching |
MIN_QUBITS_FOR_POST_PHASE_BATCH | 18 | Post-phase 1q re-batching |
MIN_QUBITS_FOR_2Q_FUSION | 12 | Benchmarked QV and random sweeps show memory-pass reduction wins from 12q |
MIN_QUBITS_FOR_MULTI_2Q_FUSION | 12 | Same as 2q fusion |
Fusion is not on the hot path. Worst-case fusion cost is on the order of microseconds against tens of milliseconds of gate application, so these passes are tuned for correctness and clarity, not for their own runtime.
Simulation Engine and Dispatch
Backend trait
#![allow(unused)] fn main() { pub trait Backend { fn name(&self) -> &'static str; fn init(&mut self, num_qubits: usize, num_classical_bits: usize) -> Result<()>; fn apply(&mut self, instruction: &Instruction) -> Result<()>; fn classical_results(&self) -> &[bool]; fn probabilities(&self) -> Result<Vec<f64>>; fn num_qubits(&self) -> usize; // Optional overrides: fn apply_instructions(&mut self, instructions: &[Instruction]) -> Result<()>; // batch apply fn supports_fused_gates(&self) -> bool; // false for symbolic backends (stabilizer) fn export_statevector(&self) -> Result<Vec<Complex64>>; // for backend transitions } }
Contract: init before apply. Instructions arrive in circuit order. Measurement is destructive. Deterministic given same RNG seed.
Entry points
Orchestration layer in src/sim/mod.rs.
| Function | Description |
|---|---|
simulate(circuit).seed(seed).run() | Auto-dispatch, full output |
simulate(circuit).backend(kind).seed(seed).run() | Explicit backend selection |
simulate(circuit).seed(seed).shots(shots) | Multi-shot sampling |
simulate(circuit).backend(kind).seed(seed).shots(shots) | Multi-shot with backend selection |
simulate(circuit).backend(kind).noise(noise).seed(seed).shots(shots) | Noisy multi-shot |
simulate(circuit).seed(seed).sample_counts(shots) | Auto-dispatched frequency histogram |
simulate(circuit).backend(kind).seed(seed).sample_counts(shots) | Frequency histogram with backend selection |
simulate(circuit).seed(seed).marginals() | Auto-dispatched per-qubit marginal probabilities |
simulate(circuit).backend(kind).seed(seed).marginals() | Per-qubit marginal probabilities with backend selection |
run_on(backend, circuit) | Pre-constructed backend |
run_qasm(qasm, seed) | Parse + simulate |
RunOutcome::probabilities is None only when the selected backend cannot
expose a dense probability distribution for the requested circuit, such as
factored stabilizer or decomposed runs above the dense output cap. Other
probability extraction failures propagate as errors. marginals() requires
either a direct Pauli marginal route or backend probability output; it returns
BackendUnsupported instead of fabricating uniform marginals when neither path
is available. Stochastic and deterministic Pauli marginal backends accept only
unitary Clifford+T circuits without measurement, reset, or conditional
instructions.
Auto-dispatch decision tree
flowchart TD
A[BackendKind::Auto] --> E{Entangling gates?}
E -- none --> PS["ProductState (O(n))"]
E -- yes --> CL{All Clifford?}
CL -- yes --> STB["Stabilizer (O(n^2))"]
CL -- no --> MEM{Above memory limit?}
MEM -- "yes, sparse-friendly" --> SPR["Sparse (O(k))"]
MEM -- "yes, otherwise" --> MPS["MPS (bond dim 256)"]
MEM -- no --> IND{Partial independence?}
IND -- yes --> FAC["Factored (split-state)"]
IND -- no --> SV["Statevector (exact)"]
Memory limit is dynamically computed from available system RAM (50% budget, capped at 33 qubits). Overridable via PRISM_MAX_SV_QUBITS environment variable. Falls back to 28 qubits (4 GB) when detection unavailable.
For a user-facing version of this decision, see Choosing a Backend.
Subsystem decomposition
Union-find detects independent qubit groups in O(n·α(n)). Each block runs separately with per-block Auto dispatch. Results merge lazily via Probabilities::Factored, a Kronecker product computed on demand per element in O(K), avoiding the O(2^N) dense materialization unless explicitly requested.
Block-level Rayon parallelism when all blocks are <14 qubits (avoids oversubscription with block-internal parallelism).
Temporal Clifford decomposition
For Clifford+T circuits: Clifford prefix runs on the Stabilizer backend, state is exported to Statevector for the non-Clifford tail. Saves exponential memory for circuits with a long Clifford preamble.
Backend dispatch variants
All BackendKind variants:
| Variant | Backend | Selection |
|---|---|---|
Auto | Decision tree (see above) | Default |
Statevector | Full state-vector | Explicit |
Stabilizer | Aaronson-Gottesman tableau | Explicit or auto (all Clifford) |
FactoredStabilizer | Per-cluster tableaux | Explicit or auto (large independent Clifford blocks) |
Sparse | HashMap state | Explicit or auto (above memory limit, sparse-friendly) |
Mps { max_bond_dim } | Matrix Product State | Explicit or auto (above memory limit) |
ProductState | Per-qubit product | Explicit or auto (no entangling) |
TensorNetwork | Deferred contraction | Explicit |
Factored | Dynamic split-state | Explicit or auto (partial independence) |
StabilizerRank | Weighted stabilizer sum | Explicit |
StochasticPauli { num_samples } | SPP | Explicit |
DeterministicPauli { epsilon, max_terms } | SPD | Explicit |
Backends
PRISM-Q ships seven CPU backends plus an optional CUDA path. The simulation engine picks one automatically, or you can select explicitly. For a task-oriented version of this material, see the Backends Deep Dive guide.
The diagrams below are rendered directly from PRISM-Q's own SVG circuit renderer.
Statevector
Full-state simulation in a flat Vec<Complex64> of 2^n amplitudes. The primary backend for circuits up to ~28 qubits.
Gate kernels use enum dispatch with specialized routines for CX, CZ, SWAP, Cu, MCU, Rzz, BatchRzz, BatchPhase, DiagonalBatch, and MultiFused. Single-qubit gates go through PreparedGate1q with FMA-vectorized SIMD. MultiFused gates use a three-tier tiled kernel (L2 16K / L3 131K / individual passes) for cache locality. MultiFused batches where all gates are diagonal dispatch to a dedicated fast path (1 complex multiply/element vs 4+2 for full 2×2).
Rayon parallelism at ≥14 qubits with par_chunks_mut and MIN_PAR_ELEMS = 4096 per task. BMI2 _pext_u64 accelerates BatchPhase, BatchRzz, and DiagonalBatch LUT indexing.
Deferred measurement normalization: pending_norm accumulates normalization factors without full-state scaling passes. Zero-cost for circuits without measurements.
The Quantum Fourier Transform is a representative statevector workload, dense with controlled-phase gates that the fusion pipeline batches:
Stabilizer
Aaronson-Gottesman bit-packed tableau for Clifford circuits. O(n²) time and space. Scales to thousands of qubits. Gate kernels use wordwise bitwise ops and popcount for phase computation. Supports H, S, Sdg, SX, SXdg, CX, CZ, SWAP, and measurement.
Word-group batching fuses multiple 1q gate flushes into single tableau passes. Type-grouped masks apply all gates of the same Pauli type with one wordwise op instead of per-gate dispatch. Sparse Generator Indexing (SGI) tracks per-qubit active generator lists, enabling targeted row operations instead of full-tableau scans. Lazy destabilizer materialization defers destabilizer rows until probabilities are requested.
Probability extraction uses coset-based enumeration with GF(2) Gaussian elimination. O(2^k) where k is the number of non-diagonal generators, rather than O(2^n).
Factored Stabilizer (FactoredStabilizerBackend): Per-cluster tableaux with dynamic merging. Starts with one qubit per cluster. Cross-cluster 2q gates merge tableaux. Measurement and reset can split independent sub-tableaux again. Independent subsystems avoid full-tableau work when product structure is preserved.
Sparse
HashMap<usize, Complex64> for states with few non-zero amplitudes. O(k) memory. Amplitude pruning (|a|² < 1e-16) after each gate. Best for circuits whose support stays concentrated in computational-basis states at large qubit counts.
MPS (Matrix Product State)
Chain of rank-3 tensors with adaptive bond dimension (default max 256). O(n·χ²) memory. Single-qubit gates absorb via FMA-vectorized SIMD over bond-dimension slices. Two-qubit gates contract adjacent sites, apply the gate, then SVD-truncate back. Non-adjacent gates route through SWAP chains.
Hybrid SVD dispatch: faer (bidiag+D&C) for matrices with m×n ≥ 256, hand-rolled Jacobi for small matrices.
Product State
Per-qubit [Complex64; 2] storage. O(n) memory, O(1) per single-qubit gate. Rejects entangling gates. Selected automatically for circuits with no 2q gates.
Tensor Network
Deferred contraction with a greedy min-size heuristic. Gates append tensors; contraction happens lazily at measurement or probability extraction. MAX_PROB_QUBITS = 25 guards against exponential blowup.
Factored
Dynamic split-state simulation. Starts with n independent 1-qubit states, merges via tensor product only when 2q gates bridge groups. Parallel kernels match statevector patterns for sub-states ≥14 qubits. Selected when subsystem decomposition detects partial independence.
The GPU backend is documented as a user guide.
Compiled Samplers
For multi-shot sampling without materializing the full statevector on every shot.
Noiseless compiled sampler (src/sim/compiled/)
Backward path (compile_measurements): Propagates Pauli Z observables backward through the circuit. Each measurement qubit becomes a row in a GF(2) parity matrix M. Clifford gates conjugate Pauli strings in O(1). The resulting M encodes which input qubits each measurement depends on.
Forward path (compile_forward): Tracks stabilizer generator dependencies forward through the circuit. Produces the same parity matrix via dependency tracking.
Sampling: Random bits for independent generators, then XOR-cascade through the parity matrix. Multiple dispatch tiers:
| Strategy | Condition | Method |
|---|---|---|
FlipLut | Small rank | 256-entry XOR lookup table |
SparseParity | Sparse rows | Only flip non-zero columns |
XorDag | General | Optimal XOR-reduction DAG |
ParityBlocks | Blocked structure | Per-block independent sampling |
ShotAccumulator trait: Pluggable result collection.
| Accumulator | Output | Use case |
|---|---|---|
HistogramAccumulator | Bitstring → count map | Standard shot output |
MarginalsAccumulator | Per-qubit P(1) | Marginal probabilities |
PauliExpectationAccumulator | ⟨P⟩ for Pauli observables | VQE/QAOA |
CorrelatorAccumulator | ⟨Z_i Z_j⟩ correlations | Entanglement analysis |
NullAccumulator | Nothing | Benchmarking raw sampling speed |
PackedShots raw format: PackedShots::RAW_FORMAT_VERSION is the replay
contract for raw_data() and into_data(). Version 1 stores little-endian bit
order within each u64. ShotMajor stores one row per shot with
m_words() = ceil(num_measurements / 64). MeasMajor stores one row per
measurement with s_words() = ceil(num_shots / 64). The checked
try_from_shot_major and try_from_meas_major constructors reject shape
mismatches and non-zero semantic padding. Histograms, marginals, parity rows,
and accumulators mask only semantic padding: measurement-tail bits in
shot-major data and shot-tail bits in measurement-major data.
Detector sampler (compile_detector_sampler): Compiles Clifford circuits
with measurement and reset reuse into the same packed measurement sampler, then
derives detector and observable records as packed parity rows over measurement
record indices. Reset reuse is represented by fresh qubit aliases, so repeated
syndrome extraction avoids per-shot tableau replay. The sampler can return
packed measurements, packed detectors, packed observables, detector counts, or
feed packed detector chunks into any ShotAccumulator.
Noisy compiled sampler (src/sim/noise.rs)
Backward Pauli propagation through circuit + noise sensitivity analysis. Each noise location gets an X-flip and Z-flip sensitivity row. During sampling, Bernoulli coin flips determine which noise channels fire, then XOR the sensitivity rows into the sample.
NoiseModel: Per-instruction depolarizing noise. NoiseOp { qubit, px, py, pz }.
Homological sampler (src/sim/homological.rs)
ErrorChainComplex: GF(2) chain complex over the circuit's noise locations. Computes the kernel (null space) of the boundary map to identify error cycles that are undetectable by syndrome measurements. HomologicalSampler uses this for sampling with topological error correction awareness.
noisy_marginals_analytical: Closed-form marginal computation using the parity matrix and noise rates. Avoids Monte Carlo sampling entirely.
See the Noise and QEC guide for how these fit together in practice.
Native QEC Program IR
QecProgram in src/qec/mod.rs is a measurement-record IR for QEC workloads
that need detectors, logical observables, postselection, expectation metadata,
and Pauli-noise annotations before sampler lowering. It is separate from
Circuit so measurement-record programs do not need to fit final-measurement
OpenQASM semantics.
QecOp stores gates, basis measurements, MPP-style Pauli-product
measurements, resets, detector rows, observable includes, expectation-value
metadata, postselection predicates, noise annotations, and tick separators.
Record references can be absolute indices or rec[-k] style lookbacks.
Construction validates qubit bounds, gate arity, finite coordinates and
coefficients, finite probabilities, and measurement-record scope. Detector,
observable, and postselection rows can be resolved to absolute measurement
indices for later compilation into packed samplers.
Parsing
parse_qec_program and QecProgram::from_text parse the native QEC text
subset used by current benchmark planning: H, S, S_DAG, T, T_DAG,
CX, CZ, R/RX/RY, M/MX/MY, MR variants, MPP, DETECTOR,
OBSERVABLE_INCLUDE, POSTSELECT, EXP_VAL, Pauli-noise instructions, TICK,
QUBIT_COORDS, SHIFT_COORDS, and flattened REPEAT blocks. The parser
resolves rec[-k] references while building the program. Numeric arguments on
basis measurements, such as M(0.001), lower to pre-measurement Pauli flips
that affect the measurement record.
Lowering
compile_qec_program_rows lowers basis measurements and MPP records into the
same packed X/Z Pauli row representation used by the compiled sampler internals.
It also carries detector, observable, and postselection rows forward as
absolute measurement-record indices. Detector, observable, and postselection
projection uses PackedShots::parity_rows, so the QEC layer reuses the existing
packed parity engine instead of maintaining a second one. This is a
sampler-lowering artifact, not an execution engine. Gate, reset, and noise
execution lives in run_qec_program; EXP_VAL remains unsupported in the
packed runner until estimator semantics land.
Execution
run_qec_program is the scalable native QEC execution path. It lowers
Clifford-compatible QEC programs into the existing packed compiled sampler, so
sampling uses packed measurement records instead of dense amplitudes. It
supports Clifford gates, basis resets and measurements, MPP, detectors,
observables, postselection, Pauli-noise annotations, and optional
raw-measurement retention. Noisy Clifford programs compile X_ERROR,
Z_ERROR, DEPOLARIZE1, and correlated DEPOLARIZE2 events into packed
sensitivity rows, then XOR those rows into the packed measurement records.
Non-Clifford gates and EXP_VAL reject clearly until their production
strategies land.
V1 requires a reset before any later gate reuse of a measured qubit, because the
compiled lowering defers measurements to terminal records. QecOptions::chunk_size
bounds compiled-runner shot batches. When raw measurements are omitted, chunking avoids
materializing the full measurement record matrix before detector, observable,
postselection, and logical-error accounting.
run_qec_program_reference is the small-program correctness oracle. It runs one
state-vector simulation per shot and supports Pauli-noise annotations, but it is
not the production performance path.
Threading, SIMD, and Memory Layout
For which SIMD tiers and architectures each backend supports, see the Capability and Support Matrix.
Memory layout
| Backend | State representation | Memory | Access pattern |
|---|---|---|---|
| Statevector | Vec<Complex64> (2^n) | Strided pair iteration | |
| Stabilizer | Bit-packed Vec<u64> tableau | bytes | Sequential row iteration |
| Sparse | HashMap<usize, Complex64> | , = nonzero | Hash-based random access |
| MPS | Chain of rank-3 tensors | Sequential site access | |
| Product | Vec<[Complex64; 2]> | Per-qubit independent | |
| Tensor Network | Network of dense tensors | Contraction-order dependent | |
| Factored | Vec<Option<SubState>> | worst case | Dispatch per substate |
Threading
Gate kernels have _par variants using par_chunks_mut for safe Rayon parallelism (behind the parallel feature flag):
- <14 qubits: Single-threaded. Thread-pool overhead exceeds computation.
- ≥14 qubits: Rayon parallel iterators with
MIN_PAR_ELEMS = 4096(64KB per task).
Thread pool defaults to all logical cores (HT helps at 24q+ by hiding memory latency). Overridable via RAYON_NUM_THREADS.
SIMD
Complex64 maps to 128-bit SIMD naturally. Single-qubit gate kernels use PreparedGate1q with runtime CPU detection and tiered dispatch:
- AVX2+FMA (256-bit): 2 complex pairs per iteration. Gated by
MAX_AVX2_STATEfor full-state passes (Skylake frequency throttling), but used freely within MultiFused L2 tiles where data is cache-resident. - FMA (128-bit): Default for larger states. 3-op complex multiply (permute + mul + fmaddsub).
- BMI2:
_pext_u64for BatchPhase, BatchRzz, and DiagonalBatch LUT indexing. One BMI2 bit extraction replaces loops with repeated shifts and ORs. - Scalar fallback: No intrinsics. All SIMD functions have a
#[cfg(not(target_arch = "x86_64"))]fallback.
Two key SIMD structs hoist matrix broadcast at construction time, avoiding per-element dispatch:
PreparedGate1q: Broadcasts 2×2 matrix into SIMD registers. Methods:apply_full_sequential(full state),apply_tiled(cache-resident tile, no AVX2 throttle guard),apply_slice_pairs(MPS bond-dimension slices),apply_pair_ptr(Cu/Mcu parallel).PreparedGate2q: Broadcasts 4×4 matrix. Methods:apply_full(mask-based iteration),apply_tiled(cache-resident Multi2q tiles, AVX2 paired-group kernel when available),apply_group_ptr(4 scattered indices).
The 2q tiled AVX2 path processes paired k and k + 1 groups when the lower target qubit is above 0, which makes each row load contiguous. It falls back to the 128-bit FMA kernel for lo == 0 and when AVX2+FMA is unavailable. Set PRISM_NO_AVX2_2Q to compare against the 128-bit FMA path, or PRISM_NO_REORDER to disable disjoint Fused2q tier grouping for A/B timing.
Determinism
Same circuit + same seed = same result, regardless of thread count. Parallel backends use deterministic work partitioning.
Error Model and Public API
Error model
All public APIs return Result<T, PrismError>. Error variants:
| Variant | Category | Description |
|---|---|---|
Parse | Parsing | OpenQASM parse error with line number |
UnsupportedConstruct | Parsing | Valid OpenQASM not supported by PRISM-Q |
UndefinedRegister | Parsing | Reference to undeclared register |
InvalidQubit | Validation | Qubit index exceeds register size |
InvalidClassicalBit | Validation | Classical bit index exceeds register |
GateArity | Validation | Wrong number of qubits for gate |
InvalidParameter | Validation | Invalid gate parameter (NaN, etc.) |
BackendUnsupported | Runtime | Backend can't perform requested operation |
IncompatibleBackend | Runtime | Backend incompatible with circuit |
Public API surface
Top-level re-exports from src/lib.rs. The full generated documentation is on
docs.rs.
Simulation:
simulate, run_on, run_qasm, bitstring
Compiled sampling:
compile_measurements, compile_forward, compile_detector_sampler, compile_noisy, run_shots_compiled, run_shots_noisy, run_shots_homological, noisy_marginals_analytical
Native QEC:
parse_qec_program, compile_qec_program_rows, run_qec_program, run_qec_program_reference, QecProgram, QecOp, QecOptions, QecSampleResult, QecBasis, QecPauli, QecRecordRef, QecNoise, QecMeasurementRow, QecCompiledRows
Clifford+T:
run_stabilizer_rank, run_stabilizer_rank_approx, stabilizer_overlap_sq, stabilizer_inner_product, run_spp, run_spp_observable, run_spd, run_spd_observable, run_spd_observable_light_cone
Types:
Circuit, CircuitBuilder, Instruction, ClassicalCondition, Gate, BackendKind, RunOutcome, CountsResult, MarginalsResult, Probabilities, FactoredBlock, ShotsResult, PrismError, Result
Backends:
StatevectorBackend, StabilizerBackend, SparseBackend, MpsBackend, ProductStateBackend, TensorNetworkBackend, FactoredBackend
Accumulators:
ShotAccumulator, HistogramAccumulator, MarginalsAccumulator, PauliExpectationAccumulator, CorrelatorAccumulator, NullAccumulator, PackedShots, ShotLayout
Data types:
CompiledSampler, CompiledDetectorSampler, DetectorSampleBatch, NoisyCompiledSampler, HomologicalSampler, ErrorChainComplex, NoiseModel, NoiseOp, QecProgram, QecOp, QecSampleResult, StabRankResult, SppResult, SpdResult, SparseParity, ParityStats, PauliVec, MultiFusedData, BatchPhaseData, McuData, Multi2qData
Benchmarks
Wall-clock simulation time for PRISM-Q on a fixed circuit suite built from the prism_q::circuits generators, pushed toward this machine's limits. Every number is reproducible with the command at the bottom of this page.
Setup
- Date: 2026-05-29
- CPU: Intel64 Family 6 Model 94 Stepping 3, GenuineIntel
- Threads available: 8
- PRISM-Q version: 0.16.0
Methodology
- Metric: median wall-clock over repeated runs after warmup, lower is better.
- Timed region is simulation only. Circuit construction happens once per circuit outside the timer.
- Timings use the full
simulate().run()path, including the fusion pass and probability extraction. autolets backend dispatch choose a specialized backend per circuit. The dense families (QFT, HEA, QV) are bounded by the statevector memory cap; GHZ is Clifford and runs into the thousands of qubits on the stabilizer backend. QV uses square depth, so its gate count grows with the qubit count and it is compute-bound earlier than the fixed-depth families.
GHZ
| Qubits | auto |
|---|---|
| 24 | 37.9 us |
| 28 | 79.2 us |
| 256 | 825.8 us |
| 1024 | 2.73 ms |
| 4096 | 35.97 ms |
QFT
| Qubits | auto |
|---|---|
| 16 | 715.0 us |
| 20 | 25.34 ms |
| 24 | 639.72 ms |
| 26 | 3.786 s |
| 28 | 17.727 s |
HEA
| Qubits | auto |
|---|---|
| 16 | 3.49 ms |
| 20 | 63.81 ms |
| 24 | 1.588 s |
| 26 | 7.330 s |
| 28 | 33.303 s |
QV
| Qubits | auto |
|---|---|
| 16 | 7.73 ms |
| 20 | 190.02 ms |
| 24 | 5.811 s |
Reproducing
cargo run --release --features parallel --example bench_suite
Times PRISM-Q on every circuit in the suite and rewrites this page.
Circuit Builders
Pre-built circuits for benchmarking and testing, in src/circuits.rs. Each returns a
Circuit you can pass straight to simulate(&circuit) or any backend.
| Function | Description |
|---|---|
qft_circuit(n) | Quantum Fourier Transform |
random_circuit(n, depth, seed) | Random gates at given depth |
hardware_efficient_ansatz(n, layers, seed) | HEA with Ry/Rz + CX |
clifford_heavy_circuit(n, depth, seed) | Random Clifford (adjacent CX) |
clifford_random_pairs(n, depth, seed) | Random Clifford (random pair CX) |
ghz_circuit(n) | GHZ state (H + CX chain) |
qaoa_circuit(n, layers, seed) | QAOA MaxCut |
single_qubit_rotation_circuit(n, depth, seed) | 1q rotations only |
clifford_t_circuit(n, depth, t_fraction, seed) | Clifford+T with tunable T ratio |
w_state_circuit(n) | W state preparation |
quantum_volume_circuit(n, depth, seed) | Quantum volume (random SU(4)) |
cz_chain_circuit(n, depth, seed) | CZ chains |
phase_estimation_circuit(n) | Quantum phase estimation |
independent_bell_pairs(n_pairs) | Independent Bell pairs |
independent_random_blocks(blocks, size, depth, seed) | Independent random blocks |
Example
#![allow(unused)] fn main() { use prism_q::circuits::qft_circuit; use prism_q::simulate; let circuit = qft_circuit(10); let result = simulate(&circuit).seed(42).run().unwrap(); }
For hand-built circuits, use the CircuitBuilder
fluent API instead.
The complete generated API documentation lives on docs.rs.
Architecture Glossary
This glossary defines key terms used throughout the PRISM-Q architecture and documentation.
Terms
Backend
A simulation strategy implementing the Backend trait (e.g., statevector, stabilizer, MPS). Backends are swappable and selected automatically or explicitly per circuit.
Clifford A class of quantum gates (H, S, CNOT, etc.) that can be simulated efficiently on stabilizer tableaus in O(n²) time.
Count
A frequency histogram of measured bitstrings across multiple shots, returned by simulate(circuit).seed(seed).sample_counts(shots).
Fusion The pre-simulation optimization pipeline that merges, cancels, and reorders gates to reduce execution cost.
Marginal
Per-qubit probabilities of measuring |1⟩, extracted without computing the full multi-qubit distribution.
MPS (Matrix Product State) A tensor-network backend that stores the quantum state as a chain of rank-3 tensors with bounded bond dimension, trading exactness for polynomial memory.
Noise model A collection of probabilistic error channels (e.g., depolarizing noise) applied during noisy multi-shot simulation.
OpenQASM An assembly-like language for describing quantum circuits. PRISM-Q parses a practical subset of OpenQASM 3.0.
Shot A single execution of a quantum circuit from initialization to measurement, producing one bitstring sample.
State vector A flat vector of 2^n complex amplitudes representing the full quantum state, used by the statevector backend.
Stabilizer A representation of a Clifford state as a bit-packed tableau of Pauli generators, enabling simulation of thousands of qubits.
Tensor network A backend that represents gates as tensors and defers contraction until measurement or probability extraction.