ridge-bench — Ridge Regression Backend Benchmarks

CytoSig shape · 5,000 observations × 42 features (small-feature production case)

Run time vs sample count (m) log y

Same data, linear y linear y

SecAct shape · 8,000 observations × 1,248 features (full-signature production case)

Run time vs sample count (m) log y

Same data, linear y linear y

Stress shape · 14,000 observations × 2,000 features (large-on-both-axes stress test)

Run time vs sample count (m) log y

Same data, linear y linear y

Loading…

production Real datasets — run time per (implementation × dataset × format)

production Real datasets — peak memory per (implementation × dataset × format)

production Synthetic scaling — CytoSig shape (5k × 42)

production Synthetic scaling — SecAct shape (8k × 1,248)

production Synthetic scaling — stress shape (14k × 2,000) stress

production Synthetic scaling — peak memory vs sample count (m)

all 9 implementations Real datasets — run time

all 9 implementations Real datasets — peak memory

all 9 implementations Coverage matrix — implementation × (dataset / format), cell = run time

all 9 implementations CytoSig shape (5k × 42) — run time

all 9 implementations CytoSig shape (5k × 42) — peak memory

all 9 implementations SecAct shape (8k × 1,248) — run time

all 9 implementations SecAct shape (8k × 1,248) — peak memory

all 9 implementations Stress shape (14k × 2,000) — run time stress

all 9 implementations Stress shape (14k × 2,000) — peak memory stress

GPU paths only Device memory (HBM) — peak GPU memory across the three scaling shapes via NVML poller in the Python wrapper

CytoSig (5k × 42)

SecAct (8k × 1,248)

Stress (14k × 2,000)

Host RSS only counts pages in CPU RAM, so GPU paths appear flat on the host-memory panels above — the algorithmic work is in HBM. NVML reports per-device peak used from outside the kernel, so this captures r_cuda's compiled .so, python_cuda_native (ctypes), and python_cupy uniformly. Only available on cells run after the NVML wiring landed; older GPU runs show as gaps.

7 sparse-capable implementations Sparsity scaling — fixed shape (8k × 1,248 × m=10k), Y density varies — run time

Matrix shape held constant; only the fraction of non-zero entries in Y varies (1% → 100%).

7 sparse-capable implementations Sparsity scaling — fixed shape (8k × 1,248 × m=10k), Y density varies — peak memory

scaling · run time Run time vs sample count m, by entry point

scaling · peak memory Peak RSS vs sample count m, by entry point

speedup table Multi-thread (.mt, ncores=8) vs single-thread (.st) by family

6 algorithmically distinct CPU entry points displayed. The two dispatchers (SecAct.inference and the deprecated SecAct.inference.gsl.new) route to Tcol.mt on Linux and reproduce its curve by construction — bit-equivalence verified at the double-precision noise floor (worst residual 3.1×10⁻¹⁵ on z-score) via tests/test_ridger_strategies_equivalence.R. Both Tcol.mt and Yrow.mt use the same cooperative-per-strip parallelism; with parallelism held equal, the cheaper permutation operand wins — T_π at p·n ≈ 80 MB beats Y_π at n·m ≈ 1.6 GB at m=200k, giving Tcol.mt the 23 % edge visible in the panel.

sparse strategy comparison Tcol vs Yrow on GPU with sparse Y — runtime + HBM peak vs Y density (SecAct shape n=8k, p=1,248, m=10k)

Run time vs Y density

Peak GPU memory (HBM)

Four curves at six density tiers (1%, 5%, 10%, 30%, 50%, 100%): Tcol.gpu (compiled kernel) via the production python_cuda_native sparse path, Tcol.gpu (CuPy) via cupyx.scipy.sparse with cp.take(T, π⁻¹) per iteration, Yrow.gpu (compiled kernel) via ridge_cuda_sparse_yrow, and Yrow.gpu (CuPy) via row-index relabel on the resident CSC.

Runtime panel: Yrow.gpu shows a flat curve across density because per-iteration overhead (cusparseXcsrsort to restore within-column sort after the row-index relabel, plus cuSPARSE descriptor create/destroy) is independent of nnz and dominates the density-proportional cusparseSpMM compute at nrand=8,000. At low density the SpMM itself is <1 ms/iter so ~100 ms/iter of upstream overhead × 8,000 iters dominates. Tcol.gpu scales with nnz because its inner SpMM is the entire iteration cost (the T_perm rebuild kernel is <1 ms).

HBM panel reports the NVML peak delta from a pre-run baseline so the four curves separate by kernel + framework choice rather than collapsing into the shared CUDA-context baseline (~500 MB). The two CuPy paths sit ≈2× higher than their native-kernel counterparts because of CuPy's separate memory pool and intermediate allocations from cp.take(out=) + cusparseSpMM via cupyx.

cross-implementation · dense All 9 implementations vs pure_r — maximum absolute difference per output (β, SE, z-score, p-value)

cross-implementation · sparse 7 sparse-capable implementations vs python_numpy(sparse) — maximum absolute difference per output

format parity (dense vs sparse) Per implementation: dense input vs sparse input on the same data — maximum absolute difference per output

permutation strategy All 8 RidgeR entry points vs gsl.old — Y-row vs T-column, single vs multi-thread, on the shared GSL MT19937 perm seam

§1 CPU silicon — architectural reference

Per-tier CPU specs that matter for interpreting dgemm and permutation-step timing: socket memory bandwidth, SIMD width, and peak FP64 per core. Hostnames in the benchmark cluster map to these tiers by prefix; per-cell provenance is shown in §4.

§2 GPU silicon — architectural reference

GPU models targeted by the CUDA implementations (r_cuda, python_cuda_native, python_cupy). HBM bandwidth, peak FP64, and SM count are the throughput determinants for cuBLAS dgemm and cusparseSpMM.

§3 SLURM constraint mechanism

§4 Per-cell hardware provenance

Each row of the synthetic sweeps records the CPU family it actually ran on. Generation is resolved at run time from the SLURM-assigned hostname and propagated into the result JSON. Filter by impl or sweep to inspect a specific comparison's silicon distribution.

Impl filter: Sweep:

Resource budgets & scheduling rules

Hardware constraint: the master sweep script asks the scheduler to skip nodes a full generation behind, so every run lands on the modern tiers shown above.
Per-tier memory / wallclock budgets: small (m ≤ 1k) 32 GB / 4 h · medium (m = 10k) 32 GB / 12 h · large (m = 100k) 192 GB / 24 h · xl (m = 200k) 192 GB / 48 h. Sized from worst-case measured runs × ~1.5 so jobs don't get killed for OOM or timeout, without blocking the queue with oversized requests.
Threading policy: CPU impls run in one of two modes — single-threaded BLAS with OpenMP parallelizing the permutation loop, or multi-threaded BLAS over a single big matmul. The sweep harness picks per-impl based on workload shape (RHS bytes & output rows-per-thread) rather than hardcoding one policy.
GPU jobs request a single device (V100 by default; A100 for the largest sample sizes where memory or runtime needs the upgrade).

Source: the master sweep script (sbatch/bench_sweep.sbatch) and the tiered retry helpers (sbatch/retry_haswell_contam_*.sbatch).

Real datasets Synthetic sweeps