Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Threading, SIMD, and Memory Layout

For which SIMD tiers and architectures each backend supports, see the Capability and Support Matrix.

Memory layout

BackendState representationMemoryAccess pattern
StatevectorVec<Complex64> (2^n)Strided pair iteration
StabilizerBit-packed Vec<u64> tableau bytesSequential row iteration
SparseHashMap<usize, Complex64>, = nonzeroHash-based random access
MPSChain of rank-3 tensorsSequential site access
ProductVec<[Complex64; 2]>Per-qubit independent
Tensor NetworkNetwork of dense tensorsContraction-order dependent
FactoredVec<Option<SubState>> worst caseDispatch per substate

Threading

Gate kernels have _par variants using par_chunks_mut for safe Rayon parallelism (behind the parallel feature flag):

  • <14 qubits: Single-threaded. Thread-pool overhead exceeds computation.
  • ≥14 qubits: Rayon parallel iterators with MIN_PAR_ELEMS = 4096 (64KB per task).

Thread pool defaults to all logical cores (HT helps at 24q+ by hiding memory latency). Overridable via RAYON_NUM_THREADS.

SIMD

Complex64 maps to 128-bit SIMD naturally. Single-qubit gate kernels use PreparedGate1q with runtime CPU detection and tiered dispatch:

  1. AVX2+FMA (256-bit): 2 complex pairs per iteration. Gated by MAX_AVX2_STATE for full-state passes (Skylake frequency throttling), but used freely within MultiFused L2 tiles where data is cache-resident.
  2. FMA (128-bit): Default for larger states. 3-op complex multiply (permute + mul + fmaddsub).
  3. BMI2: _pext_u64 for BatchPhase, BatchRzz, and DiagonalBatch LUT indexing. One BMI2 bit extraction replaces loops with repeated shifts and ORs.
  4. Scalar fallback: No intrinsics. All SIMD functions have a #[cfg(not(target_arch = "x86_64"))] fallback.

Two key SIMD structs hoist matrix broadcast at construction time, avoiding per-element dispatch:

  • PreparedGate1q: Broadcasts 2×2 matrix into SIMD registers. Methods: apply_full_sequential (full state), apply_tiled (cache-resident tile, no AVX2 throttle guard), apply_slice_pairs (MPS bond-dimension slices), apply_pair_ptr (Cu/Mcu parallel).
  • PreparedGate2q: Broadcasts 4×4 matrix. Methods: apply_full (mask-based iteration), apply_tiled (cache-resident Multi2q tiles, AVX2 paired-group kernel when available), apply_group_ptr (4 scattered indices).

The 2q tiled AVX2 path processes paired k and k + 1 groups when the lower target qubit is above 0, which makes each row load contiguous. It falls back to the 128-bit FMA kernel for lo == 0 and when AVX2+FMA is unavailable. Set PRISM_NO_AVX2_2Q to compare against the 128-bit FMA path, or PRISM_NO_REORDER to disable disjoint Fused2q tier grouping for A/B timing.

Determinism

Same circuit + same seed = same result, regardless of thread count. Parallel backends use deterministic work partitioning.