GPU Backend
The GPU backend is optional and gated behind the gpu feature. It requires the CUDA
toolkit (12.x or newer) and a CUDA-capable device.
cargo build --release --features "parallel gpu"
cargo test --features "parallel gpu" --test golden_gpu
CUDA acceleration covers statevector execution, stabilizer execution, and compiled BTS sampling. Five entry points are available:
BackendKind::StatevectorGpu { context }. Public dispatch path for statevector GPU execution. It routes throughsimulate(circuit).backend(kind).seed(seed).run(), keeps fusion and subsystem decomposition, and usescrate::gpu::min_qubits()(default 14,PRISM_GPU_MIN_QUBITSoverride) to keep small sub-circuits on CPU.BackendKind::StabilizerGpu { context }. Public dispatch path for stabilizer GPU execution. Gate application uses a device tableau and one batched Clifford kernel (stab_apply_batch). Measurement and reset keep pivot search, row cascade, phase fixup, and deterministic outcomes on the device. The default crossover stays conservative (STABILIZER_MIN_QUBITS_DEFAULT = 100_000,PRISM_STABILIZER_GPU_MIN_QUBITSoverride) until benchmarks justify lowering it. Direct backend benchmarks should useStabilizerBackend::with_gpu(ctx)to exclude diagnostic readbacks fromprobabilities(),export_tableau(), andexport_statevector(). Golden tests cover every kernel path, including 500q GHZ measure-all.StatevectorBackend::new(seed).with_gpu(ctx). Direct statevector GPU opt-in. Every instruction routes to CUDA after the context is attached. No crossover or subsystem decomposition applies.StabilizerBackend::new(seed).with_gpu(ctx). Direct stabilizer GPU opt-in for kernel benchmarks and targeted correctness tests.run_shots_compiled_with_gpu(orCompiledSampler::with_gpu(ctx)). GPU BTS sampling for flat sparse parity. The path launches one kernel per65_536-shot chunk, uses random bits generated on the host, and preserves the CPUsample_bts_meas_majorlayout. The sampler caches sparse parity CSR arrays, packed reference bits, and reusable scratch on the device. It is active only whennum_shots >= BTS_MIN_SHOTS_DEFAULT(131_072by default,PRISM_GPU_BTS_MIN_SHOTSoverride).sample_bulk_packed_devicereturns aDevicePackedShotshandle. Marginals reduce to one counter per measurement row on the device. Exact counts use a bounded device hash reduction for up to 8 packed measurement words when the compact result is cheaper to transfer than the full shot matrix. Otherwise the API uses a host copy for correctness.
When a GPU context is attached, Backend::init allocates state on the device instead
of a host Vec<Complex64> and every instruction routes to a CUDA kernel.
Module layout (src/gpu/)
| File | Role |
|---|---|
mod.rs | GpuContext, GpuState public entry points |
device.rs | GpuDevice: cudarc wrapper, compiles PTX at device construction |
memory.rs | GpuBuffer: device Complex64 storage |
kernels/mod.rs | KERNEL_NAMES, composed kernel_source() concatenating dense + stabilizer |
kernels/dense.rs | PTX source and Rust launcher for every Gate variant |
kernels/stabilizer.rs | PTX source and launchers for tableau init, 11 Clifford gates, rowmul_words |
Kernel coverage
Every variant in the Gate enum has a dedicated kernel. Batched
variants (BatchPhase, BatchRzz, DiagonalBatch, MultiFused { all_diagonal: true })
use LUT kernels that consume the same host table builders as the CPU path.
Non-diagonal MultiFused uses a shared memory tiled kernel (apply_multi_fused_tiled,
TILE_Q = 10, TILE_SIZE = 1024). Sub-gates whose target bit is inside the tile apply in
shared memory. Sub-gates whose target bit is outside the tile fall back to per gate
launches. Multi2q still launches once per sub-gate; rare in practice.
PTX template substitution: the CUDA C source is held as a template string
(KERNEL_SOURCE_TEMPLATE) with placeholders such as {{BP_TABLE_SIZE}} and
{{TILE_Q}}. The kernel_source() function substitutes them at device construction from
the Rust constants in src/backend/statevector/kernels.rs, keeping CPU and GPU in sync.
Correctness
tests/golden_gpu.rs drives 20 cross-checks comparing GPU amplitudes
against the CPU statevector within 1e-10. Covers every gate variant, fusion paths, and
the BackendKind::StatevectorGpu public dispatch path at the crossover boundary.
Current limits
BackendKind::Autodoes not yet dispatch to GPU. TheStatevectorGpuvariant is the explicit opt-in untilAutocan receive a default GPU context.- Several fused launchers (
launch_apply_fused_2q,launch_apply_mcu,launch_apply_mcu_phase,launch_apply_batch_phase,launch_apply_batch_rzz,launch_apply_diagonal_batch,launch_apply_multi_fused_nondiag) upload small metadata tables per dispatch viaclone_htod. This is the next launch overhead target. measure_prob_oneallocates a device partials buffer and reduces on the host every call. Matters for measurement-heavy circuits.- Kernel design and crossover analysis live in the module docstrings on
src/gpu/kernels/dense.rs.