Performance and SIMD
Performance is the primary product requirement. This guide explains the mechanisms that make PRISM-Q fast and the knobs you can turn. The internals live in the architecture reference under Fusion Pipeline and Threading, SIMD, and Memory Layout.
The three levers
- Fusion collapses many small gate passes into fewer, larger ones before execution, reducing memory traffic over the statevector. It is qubit-count gated and zero-cost when it does not apply.
- Cache-resident tiling keeps batched gates (
MultiFused,Multi2q) operating on L2/L3-sized tiles so repeated passes reuse hot data. - SIMD vectorizes the inner complex-arithmetic loop with AVX2+FMA, FMA, and BMI2, with a scalar fallback on non-x86_64.
Threading
Rayon parallel kernels engage at ≥14 qubits (below that, thread-pool overhead
dominates), with MIN_PAR_ELEMS = 4096 per task. The pool defaults to all logical cores.
Set RAYON_NUM_THREADS to cap parallelism. Hyperthreading helps at 24+ qubits by hiding
memory latency, but on a contended host it adds noise to benchmarks.
Determinism
Same circuit plus same seed yields the same result regardless of thread count. Parallel backends use deterministic work partitioning, so reproducibility never costs correctness.
Tuning environment variables
| Variable | Effect |
|---|---|
PRISM_MAX_SV_QUBITS | Override the statevector memory cap |
RAYON_NUM_THREADS | Cap Rayon thread count |
PRISM_NO_AVX2_2Q | Force the 128-bit FMA 2q kernel (A/B comparison) |
PRISM_NO_REORDER | Disable disjoint Fused2q tier grouping |
PRISM_GPU_MIN_QUBITS | GPU crossover qubit count (with the gpu feature) |
Benchmarking
Always run benchmarks with --features parallel. The baselines were taken with Rayon
enabled; without it, large circuits run single-threaded and are not comparable. Never run
two cargo bench processes at once: competing Rayon pools cause large swings.
cargo bench --bench circuits --features parallel # circuit macrobenchmarks
cargo bench --bench bench_driver --features parallel # gate microbenchmarks
For current wall-clock numbers across the circuit suite, see the Benchmarks page.