Host RSS only counts pages in CPU RAM, so GPU paths
appear flat on the host-memory panels above — the
algorithmic work is in HBM. NVML reports per-device
peak used from outside the kernel, so
this captures r_cuda's compiled .so,
python_cuda_native (ctypes), and python_cupy
uniformly. Only available on cells run after the
NVML wiring landed; older GPU runs show as gaps.
Matrix shape held constant; only the fraction of non-zero entries in Y varies (1% → 100%).
.mt, ncores=8) vs single-thread (.st) by family
6 algorithmically distinct CPU entry points displayed. The two dispatchers
(SecAct.inference and the deprecated
SecAct.inference.gsl.new) route to Tcol.mt on
Linux and reproduce its curve by construction — bit-equivalence
verified at the double-precision noise floor (worst residual
3.1×10⁻¹⁵ on z-score) via
tests/test_ridger_strategies_equivalence.R.
Both Tcol.mt and Yrow.mt use the same
cooperative-per-strip parallelism; with parallelism held equal,
the cheaper permutation operand wins —
Tπ at p·n ≈ 80 MB beats
Yπ at n·m ≈ 1.6 GB at m=200k, giving
Tcol.mt the 23 % edge visible in the panel.
Four curves at six density tiers (1%, 5%, 10%, 30%, 50%, 100%):
Tcol.gpu (compiled kernel) via the production
python_cuda_native sparse path,
Tcol.gpu (CuPy) via
cupyx.scipy.sparse with cp.take(T, π−1)
per iteration, Yrow.gpu (compiled kernel) via
ridge_cuda_sparse_yrow, and
Yrow.gpu (CuPy) via row-index relabel on the
resident CSC.
Runtime panel: Yrow.gpu shows a flat curve across
density because per-iteration overhead (cusparseXcsrsort
to restore within-column sort after the row-index relabel, plus
cuSPARSE descriptor create/destroy) is independent of nnz and
dominates the density-proportional cusparseSpMM compute
at nrand=8,000. At low density the SpMM itself is <1 ms/iter so
~100 ms/iter of upstream overhead × 8,000 iters dominates. Tcol.gpu
scales with nnz because its inner SpMM is the entire iteration cost
(the T_perm rebuild kernel is <1 ms).
HBM panel reports the NVML peak delta
from a pre-run baseline so the four curves separate by kernel +
framework choice rather than collapsing into the shared CUDA-context
baseline (~500 MB). The two CuPy paths sit ≈2× higher than their
native-kernel counterparts because of CuPy's separate memory pool
and intermediate allocations from cp.take(out=) +
cusparseSpMM via cupyx.
pure_r — maximum absolute difference per output (β, SE, z-score, p-value)
python_numpy(sparse) — maximum absolute difference per output
gsl.old — Y-row vs T-column, single vs multi-thread, on the shared GSL MT19937 perm seam
Per-tier CPU specs that matter for interpreting dgemm and permutation-step timing: socket memory bandwidth, SIMD width, and peak FP64 per core. Hostnames in the benchmark cluster map to these tiers by prefix; per-cell provenance is shown in §4.
GPU models targeted by the CUDA implementations
(r_cuda, python_cuda_native,
python_cupy). HBM bandwidth,
peak FP64, and SM count are the throughput
determinants for cuBLAS dgemm
and cusparseSpMM.
Each row of the synthetic sweeps records the CPU family it actually ran on. Generation is resolved at run time from the SLURM-assigned hostname and propagated into the result JSON. Filter by impl or sweep to inspect a specific comparison's silicon distribution.
Source: the master sweep script
(sbatch/bench_sweep.sbatch) and the
tiered retry helpers
(sbatch/retry_haswell_contam_*.sbatch).