Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

PRISM-Q

A Rust quantum circuit simulator built for speed.

PRISM-Q runs quantum circuits fast by matching each one to the right simulation strategy. It dispatches across eight CPU backends plus an optional CUDA path, optimizes circuits through a multi-pass fusion pipeline, and uses AVX2, FMA, and BMI2 SIMD in the inner loop. Input is OpenQASM 3.0, with backward-compatible 2.0 syntax. The same library handles a two-qubit Bell pair and a thousand-qubit Clifford circuit.

#![allow(unused)]
fn main() {
use prism_q::CircuitBuilder;

let result = CircuitBuilder::new(2).h(0).cx(0, 1).run(42).unwrap();
let probs = result.probabilities.unwrap();
// |00> = 0.5, |11> = 0.5
}

What it does

  • Eight CPU backends selected automatically per circuit: statevector, stabilizer (with factored and filtered variants), sparse, MPS, product state, tensor network, and dynamic factored split-state.
  • Compiled shot samplers that sample without rebuilding the full statevector each shot, including noisy and detector/QEC paths.
  • Clifford+T strategies (stabilizer rank, stochastic and deterministic Pauli propagation) for circuits beyond the reach of a dense statevector.
  • Optional CUDA backend for statevector and stabilizer execution.

Where to go next

Installation

PRISM-Q is a Rust library. Add it to a project with Cargo:

cargo add prism-q                          # Rayon parallelism + faer SVD (default)
cargo add prism-q --no-default-features    # single-threaded, minimal dependencies

Or add it to Cargo.toml directly:

[dependencies]
prism-q = "0.16"

Feature flags

FeatureDefaultEnables
parallelyesRayon parallel kernels (≥14 qubits) and the faer SVD path for MPS
serializationnoserde derives for circuits and results
gpunoOptional CUDA backend (see the GPU guide)

Keep parallel on for performance

The published benchmarks were taken with parallel enabled. Without it, 16+ qubit runs fall back to single-threaded kernels and are not comparable to the baselines. Disable it only when you need a minimal-dependency, single-threaded build.

Building from source

git clone https://github.com/AbeCoull/prism-q
cd prism-q
cargo build --release

Running the test suite

cargo nextest run --all-features                          # unit + integration tests
cargo test --doc --all-features                           # doctests
cargo clippy --all-targets --all-features -- -D warnings  # lint

Use cargo test --all-features if cargo-nextest is unavailable.

Next: build Your First Circuit.

Your First Circuit

There are two ways to build a circuit: the fluent CircuitBuilder API, or by parsing OpenQASM text.

With the builder

CircuitBuilder chains gate calls and runs the result. This builds a Bell pair, the two-qubit entangled state (|00⟩ + |11⟩) / √2:

#![allow(unused)]
fn main() {
use prism_q::CircuitBuilder;

let result = CircuitBuilder::new(2)
    .h(0)
    .cx(0, 1)
    .run(42)                       // seed = 42
    .expect("simulation failed");

let probs = result.probabilities.expect("no probabilities");
for i in 0..probs.len() {
    let p = probs.get(i);
    if p > 1e-10 {
        println!("|{i:02b}> = {p:.4}");
    }
}
// |00> = 0.5000
// |11> = 0.5000
}

A larger structurally similar circuit, the 5-qubit GHZ state, renders like this (diagram generated by PRISM-Q's own SVG renderer):

GHZ state preparation circuit

From OpenQASM

The same Bell pair, written in OpenQASM 3.0 and parsed:

#![allow(unused)]
fn main() {
use prism_q::circuit::openqasm;
use prism_q::simulate;

let qasm = r#"
    OPENQASM 3.0;
    include "stdgates.inc";
    qubit[2] q;
    h q[0];
    cx q[0], q[1];
"#;

let circuit = openqasm::parse(qasm).expect("failed to parse QASM");
let result = simulate(&circuit).seed(42).run().expect("simulation failed");
}

run_qasm(qasm, seed) is a shortcut that parses and simulates in one call. See the OpenQASM Support guide for the supported subset.

Qubit ordering

q[0] is the least significant bit, so x q[0] produces state index 1, not 2. Bitstrings print most-significant qubit first.

Next: sample measurement outcomes in Shots and Sampling.

Shots and Sampling

Probabilities give you the exact distribution. Shots give you sampled measurement outcomes, the way real hardware reports results. PRISM-Q samples deterministically from a fixed seed.

Sampling shots

#![allow(unused)]
fn main() {
use prism_q::circuit::openqasm;
use prism_q::simulate;

let qasm = r#"
    OPENQASM 3.0;
    include "stdgates.inc";
    qubit[2] q;
    bit[2] c;
    h q[0];
    cx q[0], q[1];
    c[0] = measure q[0];
    c[1] = measure q[1];
"#;
let circuit = openqasm::parse(qasm).expect("failed to parse QASM");

let result = simulate(&circuit).seed(42).shots(1024).expect("shots failed");
print!("{result}");   // ShotsResult implements Display
}

The same seed always produces the same samples. Pass rand::random() as the seed for non-deterministic sampling.

Counts and marginals

For large shot counts, you usually want aggregates rather than raw shots:

#![allow(unused)]
fn main() {
// Frequency histogram: bitstring -> count
let counts = simulate(&circuit).seed(42).sample_counts(100_000).unwrap();

// Per-qubit P(measuring |1>), without the full joint distribution
let marginals = simulate(&circuit).seed(42).marginals().unwrap();
}

Sampling scales past the statevector

sample_counts and shots route through PRISM-Q's compiled samplers, which propagate measurements through the circuit instead of materializing the full statevector on every shot. For Clifford circuits this scales to thousands of qubits. See Compiled Samplers.

Noisy sampling

Attach a NoiseModel to sample under depolarizing or readout noise:

#![allow(unused)]
fn main() {
use prism_q::{simulate, BackendKind, NoiseModel};

let noise = NoiseModel::uniform_depolarizing(&circuit, 0.001);
let result = simulate(&circuit)
    .backend(BackendKind::Statevector)
    .noise(noise)
    .seed(42)
    .shots(1024)
    .unwrap();
}

The Noise and QEC guide covers noise models and detector sampling in depth.

Next: learn how PRISM-Q picks a representation in Choosing a Backend.

Choosing a Backend

By default PRISM-Q inspects the circuit and picks a backend for you. You only need to choose explicitly when you know something the auto-dispatcher cannot infer, or when you are benchmarking a specific representation.

Let it choose

#![allow(unused)]
fn main() {
use prism_q::simulate;

let result = simulate(&circuit).seed(42).run().unwrap();   // BackendKind::Auto
}

Auto-dispatch walks this decision tree:

flowchart TD
    A[Auto] --> E{Entangling gates?}
    E -- none --> PS[ProductState]
    E -- yes --> CL{All Clifford?}
    CL -- yes --> STB[Stabilizer]
    CL -- no --> MEM{Above memory limit?}
    MEM -- "yes, sparse-friendly" --> SPR[Sparse]
    MEM -- "yes, otherwise" --> MPS[MPS bond 256]
    MEM -- no --> IND{Partial independence?}
    IND -- yes --> FAC[Factored]
    IND -- no --> SV[Statevector]

Choose explicitly

#![allow(unused)]
fn main() {
use prism_q::{simulate, BackendKind};

let result = simulate(&circuit)
    .backend(BackendKind::Stabilizer)
    .seed(42)
    .run()
    .unwrap();
}

Symptom to backend

If your circuit...UseWhy
Is Clifford-only (H, S, CX, CZ, ...)StabilizerO(n²), scales to thousands of qubits
Has no entangling gatesProductStateO(n), per-qubit state
Is dense and ≤ ~28 qubitsStatevectorExact, fastest for the general case
Stays concentrated in few basis statesSparseO(k) in nonzero amplitudes
Has low entanglement but many qubitsMps { max_bond_dim }Polynomial memory in bond dim
Splits into independent sub-registersFactoredSimulates blocks separately, merges lazily
Is Clifford + a few T gatesSee Clifford+TBeats dense statevector

ProductState rejects entanglement

ProductState errors on any entangling gate by design. Auto-dispatch only selects it for circuits that have none. Choose it explicitly only when you know the circuit is a product state throughout.

The Backends Deep Dive and the architecture reference cover each backend's internals.

Backends Deep Dive

PRISM-Q does not have one simulation algorithm. It has eight, each optimal for a different class of circuit. This guide is the task-oriented companion to the architecture reference: it focuses on scaling and when to reach for each one. To select a backend in code, see Choosing a Backend. For which CPU and GPU architectures each backend supports, see the Capability and Support Matrix.

Scaling at a glance

BackendMemoryBest forCeiling
StatevectorDense general circuits~28 qubits (RAM-bound)
StabilizerClifford-only circuitsThousands of qubits
Factored Stabilizer per clusterClifford with independent blocksThousands of qubits
Sparse nonzeroConcentrated supportLarge , small
MPSLow entanglementLarge , bounded
ProductNo entanglementUnbounded
Tensor Networkorder-dependentShallow / structured prob qubits
Factored worst casePartially independentBlock-bound

Statevector

The default for dense circuits. Exact, fully general, and the fastest option whenever the state fits in RAM. The memory cap is derived from system RAM (overridable with PRISM_MAX_SV_QUBITS). Above it, auto-dispatch falls back to Sparse or MPS.

Stabilizer

If your circuit uses only Clifford gates (H, S, Sdg, SX, SXdg, CX, CZ, SWAP, measurement), the stabilizer tableau simulates it in and scales to thousands of qubits. Auto-dispatch selects it whenever the circuit is Clifford-only.

Tip

Add even a single non-Clifford gate (T, Rz(θ) with arbitrary θ) and the stabilizer backend no longer applies. For a small number of such gates, see Clifford+T Simulation.

Sparse, MPS, Product, Tensor Network, Factored

  • Sparse wins when the state stays concentrated in a handful of computational-basis states (amplitude pruning keeps the map small).
  • MPS trades exactness for polynomial memory in the bond dimension. Ideal for low-entanglement circuits over many qubits.
  • Product is the degenerate, entanglement-free case: memory, per 1q gate.
  • Tensor Network defers contraction until measurement, useful for shallow or structured circuits.
  • Factored detects partial independence and simulates sub-registers separately, merging lazily via a Kronecker product computed on demand.

For the internal kernels behind each of these, read the architecture reference. For raw speed mechanics, see Performance and SIMD.

Capability and Support Matrix

This page records which CPU and GPU architectures each PRISM-Q backend supports, and where distributed execution stands. CPU backends are written in portable Rust and run on every supported architecture; SIMD acceleration (AVX2/FMA/BMI2 on x86-64, NEON on ARM64) is selected at runtime where a kernel exists, otherwise a scalar path is used.

Legend

MarkMeaning
YesSupported
SIMDSupported with a dedicated SIMD-accelerated kernel on this architecture
ScalarRuns, but without a dedicated SIMD kernel (portable fallback)
NoNot available for this backend
PlannedNot implemented yet; on the roadmap

Backend support by architecture

Backendx86-64AVX2/FMA/BMI2ARM64NEONCUDA (NVIDIA)ROCm (AMD)Distributed
StatevectorYesSIMDYesSIMDYesPlannedPlanned
StabilizerYesSIMDYesSIMDYesPlannedPlanned
Factored StabilizerYesSIMDYesSIMDNoPlannedPlanned
SparseYesScalarYesScalarNoPlannedPlanned
MPSYesSIMDYesSIMDNoPlannedPlanned
Product StateYesScalarYesScalarNoPlannedPlanned
Tensor NetworkYesScalarYesScalarNoPlannedPlanned
FactoredYesSIMDYesSIMDNoPlannedPlanned
Stabilizer RankYesSIMDYesSIMDNoPlannedPlanned
Stochastic PauliYesScalarYesScalarNoPlannedPlanned
Deterministic PauliYesScalarYesScalarNoPlannedPlanned

Notes:

  • AVX2/FMA/BMI2 is the x86-64 SIMD tier. The active tier is chosen at runtime (AVX2+FMA, then FMA, then SSE2 baseline). See Threading, SIMD, and Memory Layout.
  • NEON is the ARM64 SIMD tier. Backends marked SIMD carry a NEON kernel that mirrors the x86-64 path; the rest fall back to scalar code on ARM64.
  • CUDA covers the optional gpu feature. Only the statevector and stabilizer paths have device kernels; every other backend runs on CPU. See the GPU Backend guide.

Not yet supported

TargetStatusNotes
ROCm (AMD GPU)PlannedNo AMD device kernels; the GPU path is CUDA-only
Distributed CPUPlannedNo multi-node execution
Distributed GPUPlannedNo multi-node GPU execution
Multi-GPUPlannedA GPU context binds a single device

These targets are listed so the matrix reflects the roadmap rather than hiding the gaps. Distributed execution is not a code path PRISM-Q exposes today.

Performance and SIMD

Performance is the primary product requirement. This guide explains the mechanisms that make PRISM-Q fast and the knobs you can turn. The internals live in the architecture reference under Fusion Pipeline and Threading, SIMD, and Memory Layout.

The three levers

  1. Fusion collapses many small gate passes into fewer, larger ones before execution, reducing memory traffic over the statevector. It is qubit-count gated and zero-cost when it does not apply.
  2. Cache-resident tiling keeps batched gates (MultiFused, Multi2q) operating on L2/L3-sized tiles so repeated passes reuse hot data.
  3. SIMD vectorizes the inner complex-arithmetic loop with AVX2+FMA, FMA, and BMI2, with a scalar fallback on non-x86_64.

Threading

Rayon parallel kernels engage at ≥14 qubits (below that, thread-pool overhead dominates), with MIN_PAR_ELEMS = 4096 per task. The pool defaults to all logical cores.

Control the thread pool

Set RAYON_NUM_THREADS to cap parallelism. Hyperthreading helps at 24+ qubits by hiding memory latency, but on a contended host it adds noise to benchmarks.

Determinism

Same circuit plus same seed yields the same result regardless of thread count. Parallel backends use deterministic work partitioning, so reproducibility never costs correctness.

Tuning environment variables

VariableEffect
PRISM_MAX_SV_QUBITSOverride the statevector memory cap
RAYON_NUM_THREADSCap Rayon thread count
PRISM_NO_AVX2_2QForce the 128-bit FMA 2q kernel (A/B comparison)
PRISM_NO_REORDERDisable disjoint Fused2q tier grouping
PRISM_GPU_MIN_QUBITSGPU crossover qubit count (with the gpu feature)

Benchmarking

Benchmark with the parallel feature

Always run benchmarks with --features parallel. The baselines were taken with Rayon enabled; without it, large circuits run single-threaded and are not comparable. Never run two cargo bench processes at once: competing Rayon pools cause large swings.

cargo bench --bench circuits --features parallel       # circuit macrobenchmarks
cargo bench --bench bench_driver --features parallel   # gate microbenchmarks

For current wall-clock numbers across the circuit suite, see the Benchmarks page.

OpenQASM Support

PRISM-Q parses a practical subset of OpenQASM 3.0, with backward compatibility for common 2.0 syntax. The parser converts text directly to the Circuit IR with no intermediate AST.

Parsing and running

#![allow(unused)]
fn main() {
use prism_q::circuit::openqasm;
use prism_q::simulate;

let circuit = openqasm::parse(qasm_str).expect("parse error");
let result = simulate(&circuit).seed(42).run().unwrap();
}

run_qasm(qasm, seed) parses and simulates in one call.

Declarations and measurement

OPENQASM 3.0;
include "stdgates.inc";
qubit[3] q;          // OpenQASM 3.0 register
bit[3] c;
h q[0];
cx q[0], q[1];
c[0] = measure q[0]; // OQ3 measurement

OpenQASM 2.0 syntax also works: qreg q[3]; / creg c[3]; declarations and measure q[0] -> c[0]; measurement.

Supported gates

  • Standard / aliases: x, y, z, h, s, sdg, t, tdg, sx, rx, ry, rz, p/phase, cx/CX/cnot, cy, cz, cp/cphase, crx, cry, crz, ch, swap, ccx/toffoli, cswap/fredkin, cu, u1, u2, u3/u/U.
  • Qiskit / exporter: sxdg, cs, csdg, csx, ccz, r, rzz, rxx, ryy, xx_plus_yy, xx_minus_yy, ecr, iswap, dcx, c3x, c4x, mcx, rccx, rc3x/rcccx.
  • Hardware-native: gpi, gpi2, ms, syc, sqrt_iswap, sqrt_iswap_inv.

Other supported constructs

  • Gate modifiers: ctrl @, inv @, pow(k) @.
  • User-defined gate blocks.
  • Classical if conditionals.
  • Multi-register broadcast, barrier, and an expression evaluator with math functions.

Not supported

for / while loops, subroutines, and classical expressions beyond if. A construct that parses as valid OpenQASM but is unsupported returns UnsupportedConstruct rather than panicking; see the Error Model.

Qubit ordering

q[0] is the least significant bit, so x q[0] produces state index 1, not 2.

Clifford+T Simulation

Circuits that mix Clifford gates with a modest number of T gates sit between the efficient stabilizer regime and the exponential statevector regime. PRISM-Q offers three strategies. The right one depends on your T-count, qubit count, and whether you need exact answers or can tolerate Monte Carlo error.

Which strategy?

  • Few T gates, exact result needed: stabilizer rank (run_stabilizer_rank).
  • Many T gates, marginals only: stochastic Pauli propagation (run_spp).
  • Moderate T-count, exact or bounded-error expectation values: deterministic sparse Pauli dynamics (run_spd).

These route through the Clifford+T strategies before the standard dispatch tree when the T-count permits.

Stabilizer rank (src/sim/stabilizer_rank.rs)

Exact probability output remains capped because it returns a dense vector with 2^n entries. Shot sampling uses coherent weighted MPS branches instead of a dense statevector fallback. Clifford gates mutate each branch state, T and Tdg split branches, and measurement computes outcome probabilities from the weighted branch ensemble before projecting every branch to the sampled outcome. This removes the hard qubit-count cap from run_stabilizer_rank_shots; practical scaling is governed by branch count, MPS bond growth, and measurement count.

The dense probability path maintains a weighted sum of stabilizer states. Each T gate doubles the term count via the T = alpha*I + beta*Z decomposition. Clifford gates are O(n²) per term and weighted amplitudes are accumulated for exact probabilities.

FunctionUse
run_stabilizer_rankExact probabilities (t ≤ 20, n ≤ 25)
run_stabilizer_rank_approxApproximate with Monte Carlo (higher t counts)
run_stabilizer_rank_shotsShot-based sampling with no fixed qubit cap
stabilizer_overlap_sqInner product between stabilizer states

Stochastic Pauli Propagation (src/sim/unified_pauli.rs)

Backward-propagates measurement observables as Pauli strings. Clifford gates conjugate in O(1). T gates branch stochastically into two Pauli paths with appropriate weights. Per-path cost O(d×n/64), independent of T-gate count. Returns marginal probabilities via Monte Carlo estimation.

#![allow(unused)]
fn main() {
run_spp(circuit, num_samples, seed) // -> SppResult
}

Deterministic Sparse Pauli Dynamics (src/sim/unified_pauli.rs)

Backward-propagates as a weighted sum of Pauli strings stored in a HashMap. T gates deterministically branch X/Y terms. Identical strings auto-merge. Optional ε-truncation for approximate mode. Exact for small T-counts, approximate with bounded error for larger ones.

#![allow(unused)]
fn main() {
run_spd(circuit, epsilon, max_terms) // -> SpdResult
}

You can also build Clifford+T test circuits directly with clifford_t_circuit.

Noise and QEC

PRISM-Q models noise and quantum error correction without falling back to a dense statevector per shot. The machinery is the compiled samplers and the native QEC program IR; this guide shows how they fit together.

Noisy shot sampling

Attach a NoiseModel and sample:

#![allow(unused)]
fn main() {
use prism_q::{simulate, BackendKind, NoiseModel};

let noise = NoiseModel::uniform_depolarizing(&circuit, 0.001);
let result = simulate(&circuit)
    .backend(BackendKind::Statevector)
    .noise(noise)
    .seed(42)
    .shots(1024)
    .unwrap();
}

NoiseModel carries per-instruction depolarizing channels (NoiseOp { qubit, px, py, pz }) and supports readout error and amplitude damping. For Clifford circuits, the noisy compiled sampler propagates noise sensitivity rows and XORs fired channels into each sample, avoiding per-shot state evolution entirely.

Detector sampling

For repeated syndrome extraction, compile_detector_sampler compiles a Clifford circuit with measurement and reset reuse into a packed sampler, then derives detector and observable records as parity rows over the measurement record. Reset reuse becomes fresh qubit aliases, so there is no per-shot tableau replay.

Native QEC programs

When you need detectors, logical observables, postselection, and Pauli-noise annotations as first-class constructs, use the native QEC program IR rather than a Circuit:

#![allow(unused)]
fn main() {
use prism_q::{parse_qec_program, run_qec_program};

let program = parse_qec_program(qec_text).unwrap();
let result = run_qec_program(&program).unwrap();
}

run_qec_program lowers Clifford-compatible programs into the packed compiled sampler. run_qec_program_reference is the per-shot statevector oracle for validating small programs.

What QEC programs support

Clifford gates, basis resets and measurements, MPP Pauli-product measurements, detectors, observables, postselection, and X_ERROR / Z_ERROR / DEPOLARIZE1 / DEPOLARIZE2 noise. Non-Clifford gates and EXP_VAL are rejected until their production strategies land. See the QEC IR reference for the full grammar and the V1 reset requirement.

Homological sampling

run_shots_homological and ErrorChainComplex model the GF(2) chain complex over noise locations, identifying undetectable error cycles. noisy_marginals_analytical computes marginals in closed form from the parity matrix and noise rates, with no Monte Carlo.

GPU Backend

Info

The GPU backend is optional and gated behind the gpu feature. It requires the CUDA toolkit (12.x or newer) and a CUDA-capable device.

cargo build --release --features "parallel gpu"
cargo test --features "parallel gpu" --test golden_gpu

CUDA acceleration covers statevector execution, stabilizer execution, and compiled BTS sampling. Five entry points are available:

  • BackendKind::StatevectorGpu { context }. Public dispatch path for statevector GPU execution. It routes through simulate(circuit).backend(kind).seed(seed).run(), keeps fusion and subsystem decomposition, and uses crate::gpu::min_qubits() (default 14, PRISM_GPU_MIN_QUBITS override) to keep small sub-circuits on CPU.
  • BackendKind::StabilizerGpu { context }. Public dispatch path for stabilizer GPU execution. Gate application uses a device tableau and one batched Clifford kernel (stab_apply_batch). Measurement and reset keep pivot search, row cascade, phase fixup, and deterministic outcomes on the device. The default crossover stays conservative (STABILIZER_MIN_QUBITS_DEFAULT = 100_000, PRISM_STABILIZER_GPU_MIN_QUBITS override) until benchmarks justify lowering it. Direct backend benchmarks should use StabilizerBackend::with_gpu(ctx) to exclude diagnostic readbacks from probabilities(), export_tableau(), and export_statevector(). Golden tests cover every kernel path, including 500q GHZ measure-all.
  • StatevectorBackend::new(seed).with_gpu(ctx). Direct statevector GPU opt-in. Every instruction routes to CUDA after the context is attached. No crossover or subsystem decomposition applies.
  • StabilizerBackend::new(seed).with_gpu(ctx). Direct stabilizer GPU opt-in for kernel benchmarks and targeted correctness tests.
  • run_shots_compiled_with_gpu (or CompiledSampler::with_gpu(ctx)). GPU BTS sampling for flat sparse parity. The path launches one kernel per 65_536-shot chunk, uses random bits generated on the host, and preserves the CPU sample_bts_meas_major layout. The sampler caches sparse parity CSR arrays, packed reference bits, and reusable scratch on the device. It is active only when num_shots >= BTS_MIN_SHOTS_DEFAULT (131_072 by default, PRISM_GPU_BTS_MIN_SHOTS override). sample_bulk_packed_device returns a DevicePackedShots handle. Marginals reduce to one counter per measurement row on the device. Exact counts use a bounded device hash reduction for up to 8 packed measurement words when the compact result is cheaper to transfer than the full shot matrix. Otherwise the API uses a host copy for correctness.

When a GPU context is attached, Backend::init allocates state on the device instead of a host Vec<Complex64> and every instruction routes to a CUDA kernel.

Module layout (src/gpu/)

FileRole
mod.rsGpuContext, GpuState public entry points
device.rsGpuDevice: cudarc wrapper, compiles PTX at device construction
memory.rsGpuBuffer: device Complex64 storage
kernels/mod.rsKERNEL_NAMES, composed kernel_source() concatenating dense + stabilizer
kernels/dense.rsPTX source and Rust launcher for every Gate variant
kernels/stabilizer.rsPTX source and launchers for tableau init, 11 Clifford gates, rowmul_words

Kernel coverage

Every variant in the Gate enum has a dedicated kernel. Batched variants (BatchPhase, BatchRzz, DiagonalBatch, MultiFused { all_diagonal: true }) use LUT kernels that consume the same host table builders as the CPU path. Non-diagonal MultiFused uses a shared memory tiled kernel (apply_multi_fused_tiled, TILE_Q = 10, TILE_SIZE = 1024). Sub-gates whose target bit is inside the tile apply in shared memory. Sub-gates whose target bit is outside the tile fall back to per gate launches. Multi2q still launches once per sub-gate; rare in practice.

PTX template substitution: the CUDA C source is held as a template string (KERNEL_SOURCE_TEMPLATE) with placeholders such as {{BP_TABLE_SIZE}} and {{TILE_Q}}. The kernel_source() function substitutes them at device construction from the Rust constants in src/backend/statevector/kernels.rs, keeping CPU and GPU in sync.

Correctness

tests/golden_gpu.rs drives 20 cross-checks comparing GPU amplitudes against the CPU statevector within 1e-10. Covers every gate variant, fusion paths, and the BackendKind::StatevectorGpu public dispatch path at the crossover boundary.

Current limits

  • BackendKind::Auto does not yet dispatch to GPU. The StatevectorGpu variant is the explicit opt-in until Auto can receive a default GPU context.
  • Several fused launchers (launch_apply_fused_2q, launch_apply_mcu, launch_apply_mcu_phase, launch_apply_batch_phase, launch_apply_batch_rzz, launch_apply_diagonal_batch, launch_apply_multi_fused_nondiag) upload small metadata tables per dispatch via clone_htod. This is the next launch overhead target.
  • measure_prob_one allocates a device partials buffer and reduces on the host every call. Matters for measurement-heavy circuits.
  • Kernel design and crossover analysis live in the module docstrings on src/gpu/kernels/dense.rs.

Architecture: Overview and Layered Design

This section is the technical reference for how PRISM-Q is built. See the Glossary for definitions of terms used throughout.

Goals

  • Primary: Fastest practical quantum circuit simulation in Rust.
  • Correct simulation of supported gate sets across multiple backend strategies.
  • Clean backend plugin model. New simulation strategies can be added without touching the core.

Non-goals

  • Full OpenQASM 3.0 compliance (supports a practical subset).
  • GUI or notebook integration (library-first).
  • Hardware backend / QPU connectivity.

Layered design

A circuit flows top to bottom: text is parsed into a backend-agnostic IR, optimized by the fusion pipeline, then dispatched by the simulation engine to one of the backends or a compiled sampler.

flowchart TD
    U[User / Application]
    API["Public API &mdash; run_qasm, simulate (src/lib.rs)"]
    P["OpenQASM 3.0 Parser &mdash; &amp;str to Circuit IR (src/circuit/openqasm.rs)"]
    IR["Circuit IR &mdash; gates, measures, barriers, conditionals (src/circuit/mod.rs)"]
    F["Fusion Pipeline &mdash; cancel, fuse, reorder, batch (src/circuit/fusion.rs)"]
    E["Simulation Engine &mdash; dispatch, decompose, execute (src/sim/mod.rs)"]
    U --> API --> P --> IR --> F --> E
    E --> B[Backends]
    E --> C["Compiled Samplers &mdash; shot-based (src/sim/compiled, noise.rs, homological.rs)"]
    B --> SV[Statevector]
    B --> TN[Tensor Network]
    B --> MPS[MPS]
    B --> SP[Sparse]
    B --> PR[Product]
    B --> ST[Stabilizer]
    B --> FA[Factored]

The remaining pages in this section follow that flow: the parser and circuit IR, the fusion pipeline, the simulation engine and dispatch, the individual backends, the compiled samplers, the native QEC program IR, the threading, SIMD, and memory layout, and the error model and public API surface.

Parser and Circuit IR

Parser

Handwritten parser targeting a practical OpenQASM 3.0 subset. It processes input line by line and converts &str directly to Circuit IR with no intermediate AST.

Supported: qubit/bit declarations, OpenQASM standard gates and aliases (x, y, z, h, s, sdg, t, tdg, sx, rx, ry, rz, p/phase, cx/CX/cnot, cy, cz, cp/cphase, crx, cry, crz, ch, swap, ccx/toffoli, cswap/fredkin, cu, u1, u2, u3/u/U), Qiskit and exporter gates (sxdg, cs, csdg, csx, ccz, r, rzz, rxx, ryy, xx_plus_yy, xx_minus_yy, ecr, iswap, dcx, c3x, c4x, mcx, rccx, rc3x/rcccx), hardware-native gates (gpi, gpi2, ms, syc, sqrt_iswap, sqrt_iswap_inv), gate modifiers (ctrl @, inv @, pow(k) @), user-defined gate blocks, classical if conditionals, multi-register broadcast, measure, barrier, expression evaluator with math functions. OpenQASM 2.0 backward compatibility (qreg/creg, measure q -> c syntax).

Unsupported: for/while loops, subroutines, classical expressions beyond if.

See the OpenQASM Support guide for a user-facing walkthrough.

Circuit IR

Circuit holds num_qubits, num_classical_bits, and Vec<Instruction>. Instructions are an enum:

VariantFieldsDescription
Gategate, targetsGate application
Measurequbit, classical_bitDestructive measurement
BarrierqubitsSynchronization barrier
Conditionalcondition, gate, targetsClassical-controlled gate

Targets use SmallVec<[usize; 4]>, inline storage for up to 4 qubits, no heap allocation for typical gates.

Gate enum

Gate is a Clone enum kept at 16 bytes. Simple variants carry parameters inline. Composite variants use Box to stay within the 16-byte budget for cache-friendly dispatch.

Keep the enum at 16 bytes

Adding inline data larger than 16 bytes pollutes cache lines and has caused 40-130% regressions. Always check size_of::<Gate>() after adding a variant, and Box large payloads.

VariantDataSize
Id, X, Y, Z, H, S, Sdg, T, Tdg, SX, SXdgNone16B
Rx(f64), Ry(f64), Rz(f64), P(f64), Rzz(f64)Inline f6416B
Cx, Cz, SwapNone16B
Cu(Box<[[Complex64; 2]; 2]>)Boxed 2×216B
Mcu(Box<McuData>)Boxed matrix + control count16B
Fused(Box<[[Complex64; 2]; 2]>)Boxed pre-fused 1q matrix16B
Fused2q(Box<[[Complex64; 4]; 4]>)Boxed pre-fused 2q matrix16B
MultiFused(Box<MultiFusedData>)Batched 1q gates for tiled pass16B
Multi2q(Box<Multi2qData>)Batched 2q gates for tiled pass16B
BatchPhase(Box<BatchPhaseData>)Batched cphase with shared control16B
BatchRzz(Box<BatchRzzData>)Batched ZZ rotations16B
DiagonalBatch(Box<DiagonalBatchData>)Mixed diagonal 1q/2q batch16B

Qubit ordering

q[0] is the least significant bit. Applying x q[0] produces state index 1, not 2.

Fusion Pipeline

Gate optimizations before execution, gated by qubit count thresholds. Every pass returns Cow<Circuit>. Borrowed when no optimization applies, so circuits that do not benefit pay zero overhead.

flowchart TD
    IN[Input Circuit] --> P0["cancel_self_inverse_pairs (always)"]
    P0 --> P0r["fuse_rzz (always): CX&middot;Rz&middot;CX to Rzz"]
    P0r --> P0b["fuse_batch_rzz (>=16q): N&times;Rzz to BatchRzz"]
    P0b --> G{"qubits >= MIN_QUBITS_FOR_FUSION (10)?"}
    G -- no --> OUT[Output Circuit]
    G -- yes --> P1["fuse_single_qubit_gates (>=10q)"]
    P1 --> P1r["reorder_1q_gates (>=10q)"]
    P1r --> P1c["cancel_self_inverse_pairs (>=10q)"]
    P1c --> P1f["fuse_single_qubit_gates re-fuse (>=10q)"]
    P1f --> P2q["fuse_2q_gates (>=12q): CX/CZ + adjacent 1q to Fused2q"]
    P2q --> P2qb["fuse_same_pair_2q_blocks (>=12q)"]
    P2qb --> P2["fuse_multi_1q_gates (>=14q) to MultiFused"]
    P2 --> P2qr["reorder_disjoint_fused2q (>=12q)"]
    P2qr --> Pm2q["fuse_multi_2q_gates (>=12q) to Multi2q"]
    Pm2q --> Pcp["fuse_controlled_phases (>=16q) to BatchPhase"]
    Pcp --> Pdb["fuse_diagonal_batch (>=16q) to DiagonalBatch"]
    Pdb --> Ppp["batch_post_phase_1q (>=18q)"]
    Ppp --> OUT

Threshold constants

ConstantValueRationale
MIN_QUBITS_FOR_FUSION10Below this, clone cost exceeds simulation savings
MIN_QUBITS_FOR_MULTI_FUSION14MultiFused tiling overhead vs benefit
MIN_QUBITS_FOR_DIAG_BATCH16Diagonal batch, cphase, and Rzz batching
MIN_QUBITS_FOR_POST_PHASE_BATCH18Post-phase 1q re-batching
MIN_QUBITS_FOR_2Q_FUSION12Benchmarked QV and random sweeps show memory-pass reduction wins from 12q
MIN_QUBITS_FOR_MULTI_2Q_FUSION12Same as 2q fusion

Tip

Fusion is not on the hot path. Worst-case fusion cost is on the order of microseconds against tens of milliseconds of gate application, so these passes are tuned for correctness and clarity, not for their own runtime.

Simulation Engine and Dispatch

Backend trait

#![allow(unused)]
fn main() {
pub trait Backend {
    fn name(&self) -> &'static str;
    fn init(&mut self, num_qubits: usize, num_classical_bits: usize) -> Result<()>;
    fn apply(&mut self, instruction: &Instruction) -> Result<()>;
    fn classical_results(&self) -> &[bool];
    fn probabilities(&self) -> Result<Vec<f64>>;
    fn num_qubits(&self) -> usize;

    // Optional overrides:
    fn apply_instructions(&mut self, instructions: &[Instruction]) -> Result<()>;  // batch apply
    fn supports_fused_gates(&self) -> bool;   // false for symbolic backends (stabilizer)
    fn export_statevector(&self) -> Result<Vec<Complex64>>;  // for backend transitions
}
}

Contract: init before apply. Instructions arrive in circuit order. Measurement is destructive. Deterministic given same RNG seed.

Entry points

Orchestration layer in src/sim/mod.rs.

FunctionDescription
simulate(circuit).seed(seed).run()Auto-dispatch, full output
simulate(circuit).backend(kind).seed(seed).run()Explicit backend selection
simulate(circuit).seed(seed).shots(shots)Multi-shot sampling
simulate(circuit).backend(kind).seed(seed).shots(shots)Multi-shot with backend selection
simulate(circuit).backend(kind).noise(noise).seed(seed).shots(shots)Noisy multi-shot
simulate(circuit).seed(seed).sample_counts(shots)Auto-dispatched frequency histogram
simulate(circuit).backend(kind).seed(seed).sample_counts(shots)Frequency histogram with backend selection
simulate(circuit).seed(seed).marginals()Auto-dispatched per-qubit marginal probabilities
simulate(circuit).backend(kind).seed(seed).marginals()Per-qubit marginal probabilities with backend selection
run_on(backend, circuit)Pre-constructed backend
run_qasm(qasm, seed)Parse + simulate

RunOutcome::probabilities is None only when the selected backend cannot expose a dense probability distribution for the requested circuit, such as factored stabilizer or decomposed runs above the dense output cap. Other probability extraction failures propagate as errors. marginals() requires either a direct Pauli marginal route or backend probability output; it returns BackendUnsupported instead of fabricating uniform marginals when neither path is available. Stochastic and deterministic Pauli marginal backends accept only unitary Clifford+T circuits without measurement, reset, or conditional instructions.

Auto-dispatch decision tree

flowchart TD
    A[BackendKind::Auto] --> E{Entangling gates?}
    E -- none --> PS["ProductState (O(n))"]
    E -- yes --> CL{All Clifford?}
    CL -- yes --> STB["Stabilizer (O(n^2))"]
    CL -- no --> MEM{Above memory limit?}
    MEM -- "yes, sparse-friendly" --> SPR["Sparse (O(k))"]
    MEM -- "yes, otherwise" --> MPS["MPS (bond dim 256)"]
    MEM -- no --> IND{Partial independence?}
    IND -- yes --> FAC["Factored (split-state)"]
    IND -- no --> SV["Statevector (exact)"]

Memory limit is dynamically computed from available system RAM (50% budget, capped at 33 qubits). Overridable via PRISM_MAX_SV_QUBITS environment variable. Falls back to 28 qubits (4 GB) when detection unavailable.

For a user-facing version of this decision, see Choosing a Backend.

Subsystem decomposition

Union-find detects independent qubit groups in O(n·α(n)). Each block runs separately with per-block Auto dispatch. Results merge lazily via Probabilities::Factored, a Kronecker product computed on demand per element in O(K), avoiding the O(2^N) dense materialization unless explicitly requested.

Block-level Rayon parallelism when all blocks are <14 qubits (avoids oversubscription with block-internal parallelism).

Temporal Clifford decomposition

For Clifford+T circuits: Clifford prefix runs on the Stabilizer backend, state is exported to Statevector for the non-Clifford tail. Saves exponential memory for circuits with a long Clifford preamble.

Backend dispatch variants

All BackendKind variants:

VariantBackendSelection
AutoDecision tree (see above)Default
StatevectorFull state-vectorExplicit
StabilizerAaronson-Gottesman tableauExplicit or auto (all Clifford)
FactoredStabilizerPer-cluster tableauxExplicit or auto (large independent Clifford blocks)
SparseHashMap stateExplicit or auto (above memory limit, sparse-friendly)
Mps { max_bond_dim }Matrix Product StateExplicit or auto (above memory limit)
ProductStatePer-qubit productExplicit or auto (no entangling)
TensorNetworkDeferred contractionExplicit
FactoredDynamic split-stateExplicit or auto (partial independence)
StabilizerRankWeighted stabilizer sumExplicit
StochasticPauli { num_samples }SPPExplicit
DeterministicPauli { epsilon, max_terms }SPDExplicit

Backends

PRISM-Q ships seven CPU backends plus an optional CUDA path. The simulation engine picks one automatically, or you can select explicitly. For a task-oriented version of this material, see the Backends Deep Dive guide.

The diagrams below are rendered directly from PRISM-Q's own SVG circuit renderer.

GHZ state preparation circuit

Statevector

Full-state simulation in a flat Vec<Complex64> of 2^n amplitudes. The primary backend for circuits up to ~28 qubits.

Gate kernels use enum dispatch with specialized routines for CX, CZ, SWAP, Cu, MCU, Rzz, BatchRzz, BatchPhase, DiagonalBatch, and MultiFused. Single-qubit gates go through PreparedGate1q with FMA-vectorized SIMD. MultiFused gates use a three-tier tiled kernel (L2 16K / L3 131K / individual passes) for cache locality. MultiFused batches where all gates are diagonal dispatch to a dedicated fast path (1 complex multiply/element vs 4+2 for full 2×2).

Rayon parallelism at ≥14 qubits with par_chunks_mut and MIN_PAR_ELEMS = 4096 per task. BMI2 _pext_u64 accelerates BatchPhase, BatchRzz, and DiagonalBatch LUT indexing.

Deferred measurement normalization: pending_norm accumulates normalization factors without full-state scaling passes. Zero-cost for circuits without measurements.

The Quantum Fourier Transform is a representative statevector workload, dense with controlled-phase gates that the fusion pipeline batches:

Quantum Fourier Transform circuit

Stabilizer

Aaronson-Gottesman bit-packed tableau for Clifford circuits. O(n²) time and space. Scales to thousands of qubits. Gate kernels use wordwise bitwise ops and popcount for phase computation. Supports H, S, Sdg, SX, SXdg, CX, CZ, SWAP, and measurement.

Word-group batching fuses multiple 1q gate flushes into single tableau passes. Type-grouped masks apply all gates of the same Pauli type with one wordwise op instead of per-gate dispatch. Sparse Generator Indexing (SGI) tracks per-qubit active generator lists, enabling targeted row operations instead of full-tableau scans. Lazy destabilizer materialization defers destabilizer rows until probabilities are requested.

Probability extraction uses coset-based enumeration with GF(2) Gaussian elimination. O(2^k) where k is the number of non-diagonal generators, rather than O(2^n).

Factored Stabilizer (FactoredStabilizerBackend): Per-cluster tableaux with dynamic merging. Starts with one qubit per cluster. Cross-cluster 2q gates merge tableaux. Measurement and reset can split independent sub-tableaux again. Independent subsystems avoid full-tableau work when product structure is preserved.

Sparse

HashMap<usize, Complex64> for states with few non-zero amplitudes. O(k) memory. Amplitude pruning (|a|² < 1e-16) after each gate. Best for circuits whose support stays concentrated in computational-basis states at large qubit counts.

MPS (Matrix Product State)

Chain of rank-3 tensors with adaptive bond dimension (default max 256). O(n·χ²) memory. Single-qubit gates absorb via FMA-vectorized SIMD over bond-dimension slices. Two-qubit gates contract adjacent sites, apply the gate, then SVD-truncate back. Non-adjacent gates route through SWAP chains.

Hybrid SVD dispatch: faer (bidiag+D&C) for matrices with m×n ≥ 256, hand-rolled Jacobi for small matrices.

Product State

Per-qubit [Complex64; 2] storage. O(n) memory, O(1) per single-qubit gate. Rejects entangling gates. Selected automatically for circuits with no 2q gates.

Tensor Network

Deferred contraction with a greedy min-size heuristic. Gates append tensors; contraction happens lazily at measurement or probability extraction. MAX_PROB_QUBITS = 25 guards against exponential blowup.

Factored

Dynamic split-state simulation. Starts with n independent 1-qubit states, merges via tensor product only when 2q gates bridge groups. Parallel kernels match statevector patterns for sub-states ≥14 qubits. Selected when subsystem decomposition detects partial independence.

The GPU backend is documented as a user guide.

Compiled Samplers

For multi-shot sampling without materializing the full statevector on every shot.

Noiseless compiled sampler (src/sim/compiled/)

Backward path (compile_measurements): Propagates Pauli Z observables backward through the circuit. Each measurement qubit becomes a row in a GF(2) parity matrix M. Clifford gates conjugate Pauli strings in O(1). The resulting M encodes which input qubits each measurement depends on.

Forward path (compile_forward): Tracks stabilizer generator dependencies forward through the circuit. Produces the same parity matrix via dependency tracking.

Sampling: Random bits for independent generators, then XOR-cascade through the parity matrix. Multiple dispatch tiers:

StrategyConditionMethod
FlipLutSmall rank256-entry XOR lookup table
SparseParitySparse rowsOnly flip non-zero columns
XorDagGeneralOptimal XOR-reduction DAG
ParityBlocksBlocked structurePer-block independent sampling

ShotAccumulator trait: Pluggable result collection.

AccumulatorOutputUse case
HistogramAccumulatorBitstring → count mapStandard shot output
MarginalsAccumulatorPer-qubit P(1)Marginal probabilities
PauliExpectationAccumulator⟨P⟩ for Pauli observablesVQE/QAOA
CorrelatorAccumulator⟨Z_i Z_j⟩ correlationsEntanglement analysis
NullAccumulatorNothingBenchmarking raw sampling speed

PackedShots raw format: PackedShots::RAW_FORMAT_VERSION is the replay contract for raw_data() and into_data(). Version 1 stores little-endian bit order within each u64. ShotMajor stores one row per shot with m_words() = ceil(num_measurements / 64). MeasMajor stores one row per measurement with s_words() = ceil(num_shots / 64). The checked try_from_shot_major and try_from_meas_major constructors reject shape mismatches and non-zero semantic padding. Histograms, marginals, parity rows, and accumulators mask only semantic padding: measurement-tail bits in shot-major data and shot-tail bits in measurement-major data.

Detector sampler (compile_detector_sampler): Compiles Clifford circuits with measurement and reset reuse into the same packed measurement sampler, then derives detector and observable records as packed parity rows over measurement record indices. Reset reuse is represented by fresh qubit aliases, so repeated syndrome extraction avoids per-shot tableau replay. The sampler can return packed measurements, packed detectors, packed observables, detector counts, or feed packed detector chunks into any ShotAccumulator.

Noisy compiled sampler (src/sim/noise.rs)

Backward Pauli propagation through circuit + noise sensitivity analysis. Each noise location gets an X-flip and Z-flip sensitivity row. During sampling, Bernoulli coin flips determine which noise channels fire, then XOR the sensitivity rows into the sample.

NoiseModel: Per-instruction depolarizing noise. NoiseOp { qubit, px, py, pz }.

Homological sampler (src/sim/homological.rs)

ErrorChainComplex: GF(2) chain complex over the circuit's noise locations. Computes the kernel (null space) of the boundary map to identify error cycles that are undetectable by syndrome measurements. HomologicalSampler uses this for sampling with topological error correction awareness.

noisy_marginals_analytical: Closed-form marginal computation using the parity matrix and noise rates. Avoids Monte Carlo sampling entirely.

See the Noise and QEC guide for how these fit together in practice.

Native QEC Program IR

QecProgram in src/qec/mod.rs is a measurement-record IR for QEC workloads that need detectors, logical observables, postselection, expectation metadata, and Pauli-noise annotations before sampler lowering. It is separate from Circuit so measurement-record programs do not need to fit final-measurement OpenQASM semantics.

QecOp stores gates, basis measurements, MPP-style Pauli-product measurements, resets, detector rows, observable includes, expectation-value metadata, postselection predicates, noise annotations, and tick separators. Record references can be absolute indices or rec[-k] style lookbacks. Construction validates qubit bounds, gate arity, finite coordinates and coefficients, finite probabilities, and measurement-record scope. Detector, observable, and postselection rows can be resolved to absolute measurement indices for later compilation into packed samplers.

Parsing

parse_qec_program and QecProgram::from_text parse the native QEC text subset used by current benchmark planning: H, S, S_DAG, T, T_DAG, CX, CZ, R/RX/RY, M/MX/MY, MR variants, MPP, DETECTOR, OBSERVABLE_INCLUDE, POSTSELECT, EXP_VAL, Pauli-noise instructions, TICK, QUBIT_COORDS, SHIFT_COORDS, and flattened REPEAT blocks. The parser resolves rec[-k] references while building the program. Numeric arguments on basis measurements, such as M(0.001), lower to pre-measurement Pauli flips that affect the measurement record.

Lowering

compile_qec_program_rows lowers basis measurements and MPP records into the same packed X/Z Pauli row representation used by the compiled sampler internals. It also carries detector, observable, and postselection rows forward as absolute measurement-record indices. Detector, observable, and postselection projection uses PackedShots::parity_rows, so the QEC layer reuses the existing packed parity engine instead of maintaining a second one. This is a sampler-lowering artifact, not an execution engine. Gate, reset, and noise execution lives in run_qec_program; EXP_VAL remains unsupported in the packed runner until estimator semantics land.

Execution

run_qec_program is the scalable native QEC execution path. It lowers Clifford-compatible QEC programs into the existing packed compiled sampler, so sampling uses packed measurement records instead of dense amplitudes. It supports Clifford gates, basis resets and measurements, MPP, detectors, observables, postselection, Pauli-noise annotations, and optional raw-measurement retention. Noisy Clifford programs compile X_ERROR, Z_ERROR, DEPOLARIZE1, and correlated DEPOLARIZE2 events into packed sensitivity rows, then XOR those rows into the packed measurement records. Non-Clifford gates and EXP_VAL reject clearly until their production strategies land.

V1 reset requirement

V1 requires a reset before any later gate reuse of a measured qubit, because the compiled lowering defers measurements to terminal records. QecOptions::chunk_size bounds compiled-runner shot batches. When raw measurements are omitted, chunking avoids materializing the full measurement record matrix before detector, observable, postselection, and logical-error accounting.

run_qec_program_reference is the small-program correctness oracle. It runs one state-vector simulation per shot and supports Pauli-noise annotations, but it is not the production performance path.

Threading, SIMD, and Memory Layout

For which SIMD tiers and architectures each backend supports, see the Capability and Support Matrix.

Memory layout

BackendState representationMemoryAccess pattern
StatevectorVec<Complex64> (2^n)Strided pair iteration
StabilizerBit-packed Vec<u64> tableau bytesSequential row iteration
SparseHashMap<usize, Complex64>, = nonzeroHash-based random access
MPSChain of rank-3 tensorsSequential site access
ProductVec<[Complex64; 2]>Per-qubit independent
Tensor NetworkNetwork of dense tensorsContraction-order dependent
FactoredVec<Option<SubState>> worst caseDispatch per substate

Threading

Gate kernels have _par variants using par_chunks_mut for safe Rayon parallelism (behind the parallel feature flag):

  • <14 qubits: Single-threaded. Thread-pool overhead exceeds computation.
  • ≥14 qubits: Rayon parallel iterators with MIN_PAR_ELEMS = 4096 (64KB per task).

Thread pool defaults to all logical cores (HT helps at 24q+ by hiding memory latency). Overridable via RAYON_NUM_THREADS.

SIMD

Complex64 maps to 128-bit SIMD naturally. Single-qubit gate kernels use PreparedGate1q with runtime CPU detection and tiered dispatch:

  1. AVX2+FMA (256-bit): 2 complex pairs per iteration. Gated by MAX_AVX2_STATE for full-state passes (Skylake frequency throttling), but used freely within MultiFused L2 tiles where data is cache-resident.
  2. FMA (128-bit): Default for larger states. 3-op complex multiply (permute + mul + fmaddsub).
  3. BMI2: _pext_u64 for BatchPhase, BatchRzz, and DiagonalBatch LUT indexing. One BMI2 bit extraction replaces loops with repeated shifts and ORs.
  4. Scalar fallback: No intrinsics. All SIMD functions have a #[cfg(not(target_arch = "x86_64"))] fallback.

Two key SIMD structs hoist matrix broadcast at construction time, avoiding per-element dispatch:

  • PreparedGate1q: Broadcasts 2×2 matrix into SIMD registers. Methods: apply_full_sequential (full state), apply_tiled (cache-resident tile, no AVX2 throttle guard), apply_slice_pairs (MPS bond-dimension slices), apply_pair_ptr (Cu/Mcu parallel).
  • PreparedGate2q: Broadcasts 4×4 matrix. Methods: apply_full (mask-based iteration), apply_tiled (cache-resident Multi2q tiles, AVX2 paired-group kernel when available), apply_group_ptr (4 scattered indices).

The 2q tiled AVX2 path processes paired k and k + 1 groups when the lower target qubit is above 0, which makes each row load contiguous. It falls back to the 128-bit FMA kernel for lo == 0 and when AVX2+FMA is unavailable. Set PRISM_NO_AVX2_2Q to compare against the 128-bit FMA path, or PRISM_NO_REORDER to disable disjoint Fused2q tier grouping for A/B timing.

Determinism

Same circuit + same seed = same result, regardless of thread count. Parallel backends use deterministic work partitioning.

Error Model and Public API

Error model

All public APIs return Result<T, PrismError>. Error variants:

VariantCategoryDescription
ParseParsingOpenQASM parse error with line number
UnsupportedConstructParsingValid OpenQASM not supported by PRISM-Q
UndefinedRegisterParsingReference to undeclared register
InvalidQubitValidationQubit index exceeds register size
InvalidClassicalBitValidationClassical bit index exceeds register
GateArityValidationWrong number of qubits for gate
InvalidParameterValidationInvalid gate parameter (NaN, etc.)
BackendUnsupportedRuntimeBackend can't perform requested operation
IncompatibleBackendRuntimeBackend incompatible with circuit

Note

No panics on user input. debug_assert! is used for internal invariants only.

Public API surface

Top-level re-exports from src/lib.rs. The full generated documentation is on docs.rs.

Simulation: simulate, run_on, run_qasm, bitstring

Compiled sampling: compile_measurements, compile_forward, compile_detector_sampler, compile_noisy, run_shots_compiled, run_shots_noisy, run_shots_homological, noisy_marginals_analytical

Native QEC: parse_qec_program, compile_qec_program_rows, run_qec_program, run_qec_program_reference, QecProgram, QecOp, QecOptions, QecSampleResult, QecBasis, QecPauli, QecRecordRef, QecNoise, QecMeasurementRow, QecCompiledRows

Clifford+T: run_stabilizer_rank, run_stabilizer_rank_approx, stabilizer_overlap_sq, stabilizer_inner_product, run_spp, run_spp_observable, run_spd, run_spd_observable, run_spd_observable_light_cone

Types: Circuit, CircuitBuilder, Instruction, ClassicalCondition, Gate, BackendKind, RunOutcome, CountsResult, MarginalsResult, Probabilities, FactoredBlock, ShotsResult, PrismError, Result

Backends: StatevectorBackend, StabilizerBackend, SparseBackend, MpsBackend, ProductStateBackend, TensorNetworkBackend, FactoredBackend

Accumulators: ShotAccumulator, HistogramAccumulator, MarginalsAccumulator, PauliExpectationAccumulator, CorrelatorAccumulator, NullAccumulator, PackedShots, ShotLayout

Data types: CompiledSampler, CompiledDetectorSampler, DetectorSampleBatch, NoisyCompiledSampler, HomologicalSampler, ErrorChainComplex, NoiseModel, NoiseOp, QecProgram, QecOp, QecSampleResult, StabRankResult, SppResult, SpdResult, SparseParity, ParityStats, PauliVec, MultiFusedData, BatchPhaseData, McuData, Multi2qData

Benchmarks

Wall-clock simulation time for PRISM-Q on a fixed circuit suite built from the prism_q::circuits generators, pushed toward this machine's limits. Every number is reproducible with the command at the bottom of this page.

Setup

  • Date: 2026-05-29
  • CPU: Intel64 Family 6 Model 94 Stepping 3, GenuineIntel
  • Threads available: 8
  • PRISM-Q version: 0.16.0

Methodology

  • Metric: median wall-clock over repeated runs after warmup, lower is better.
  • Timed region is simulation only. Circuit construction happens once per circuit outside the timer.
  • Timings use the full simulate().run() path, including the fusion pass and probability extraction.
  • auto lets backend dispatch choose a specialized backend per circuit. The dense families (QFT, HEA, QV) are bounded by the statevector memory cap; GHZ is Clifford and runs into the thousands of qubits on the stabilizer backend. QV uses square depth, so its gate count grows with the qubit count and it is compute-bound earlier than the fixed-depth families.

GHZ

Qubitsauto
2437.9 us
2879.2 us
256825.8 us
10242.73 ms
409635.97 ms

QFT

Qubitsauto
16715.0 us
2025.34 ms
24639.72 ms
263.786 s
2817.727 s

HEA

Qubitsauto
163.49 ms
2063.81 ms
241.588 s
267.330 s
2833.303 s

QV

Qubitsauto
167.73 ms
20190.02 ms
245.811 s

Reproducing

cargo run --release --features parallel --example bench_suite

Times PRISM-Q on every circuit in the suite and rewrites this page.

Circuit Builders

Pre-built circuits for benchmarking and testing, in src/circuits.rs. Each returns a Circuit you can pass straight to simulate(&circuit) or any backend.

FunctionDescription
qft_circuit(n)Quantum Fourier Transform
random_circuit(n, depth, seed)Random gates at given depth
hardware_efficient_ansatz(n, layers, seed)HEA with Ry/Rz + CX
clifford_heavy_circuit(n, depth, seed)Random Clifford (adjacent CX)
clifford_random_pairs(n, depth, seed)Random Clifford (random pair CX)
ghz_circuit(n)GHZ state (H + CX chain)
qaoa_circuit(n, layers, seed)QAOA MaxCut
single_qubit_rotation_circuit(n, depth, seed)1q rotations only
clifford_t_circuit(n, depth, t_fraction, seed)Clifford+T with tunable T ratio
w_state_circuit(n)W state preparation
quantum_volume_circuit(n, depth, seed)Quantum volume (random SU(4))
cz_chain_circuit(n, depth, seed)CZ chains
phase_estimation_circuit(n)Quantum phase estimation
independent_bell_pairs(n_pairs)Independent Bell pairs
independent_random_blocks(blocks, size, depth, seed)Independent random blocks

Example

#![allow(unused)]
fn main() {
use prism_q::circuits::qft_circuit;
use prism_q::simulate;

let circuit = qft_circuit(10);
let result = simulate(&circuit).seed(42).run().unwrap();
}

For hand-built circuits, use the CircuitBuilder fluent API instead.

The complete generated API documentation lives on docs.rs.

Architecture Glossary

This glossary defines key terms used throughout the PRISM-Q architecture and documentation.

Terms

Backend A simulation strategy implementing the Backend trait (e.g., statevector, stabilizer, MPS). Backends are swappable and selected automatically or explicitly per circuit.

Clifford A class of quantum gates (H, S, CNOT, etc.) that can be simulated efficiently on stabilizer tableaus in O(n²) time.

Count A frequency histogram of measured bitstrings across multiple shots, returned by simulate(circuit).seed(seed).sample_counts(shots).

Fusion The pre-simulation optimization pipeline that merges, cancels, and reorders gates to reduce execution cost.

Marginal Per-qubit probabilities of measuring |1⟩, extracted without computing the full multi-qubit distribution.

MPS (Matrix Product State) A tensor-network backend that stores the quantum state as a chain of rank-3 tensors with bounded bond dimension, trading exactness for polynomial memory.

Noise model A collection of probabilistic error channels (e.g., depolarizing noise) applied during noisy multi-shot simulation.

OpenQASM An assembly-like language for describing quantum circuits. PRISM-Q parses a practical subset of OpenQASM 3.0.

Shot A single execution of a quantum circuit from initialization to measurement, producing one bitstring sample.

State vector A flat vector of 2^n complex amplitudes representing the full quantum state, used by the statevector backend.

Stabilizer A representation of a Clifford state as a bit-packed tableau of Pauli generators, enabling simulation of thousands of qubits.

Tensor network A backend that represents gates as tensors and defers contraction until measurement or probability extraction.