uc bench
Run downstream-task benchmarks on a compressed UltraCompress artifact.
Synopsis
uc bench <path> [--tasks <list>] [--limit <int>] [--batch-size <int>] [--device <str>] [--output-dir <path>]
Arguments
| Argument |
Required |
Description |
<path> |
yes |
Path to a directory produced by uc pull |
Options
| Option |
Default |
Description |
--tasks LIST |
hellaswag,arc_challenge |
Comma-separated lm-eval-harness task names |
--limit INT |
500 |
Samples per task |
--batch-size INT |
8 |
Batch size |
--device STR |
cuda:0 |
PyTorch device |
--output-dir PATH |
./bench-results |
Where to save per-sample logs and summary JSON |
Output
UltraCompress v0.1.0 · https://sipsalabs.com
Extreme compression for large language models. Patent pending — USPTO 64/049,511 + 64/049,517
→ Benchmarking ./models/sipsalabs_<model-id> on tasks: hellaswag,arc_challenge
limit=500 batch_size=8 device=cuda:0
Benchmark results
┏━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┓
┃ Task ┃ acc ┃ acc_norm ┃ stderr ┃
┡━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━┩
│ hellaswag │ 51.20% │ 67.60% │ +/-2.23% │
│ arc_challenge │ 38.40% │ 41.20% │ +/-2.18% │
└─────────────────┴─────────┴──────────┴──────────┘
Behavior
- Loads the compressed artifact via the UltraCompress reference loader.
- Runs
lm-eval-harness for each named task.
- Writes a per-sample log (
<task>_<timestamp>.jsonl) and a summary (summary.json) to <output-dir>.
- Prints a Rich-rendered table to stdout.
Requirements
uc bench requires:
- PyTorch (
pip install "ultracompress[torch]")
- A CUDA GPU (default device
cuda:0; specify --device cpu for CPU-only, but expect 100× slower)
- ~2-8 GB GPU memory for a 7B-parameter model at 2.798 bpw
Examples
# Default: 2 tasks, 500 samples each, batch 8
uc bench ./models/sipsalabs_<model-id>
# Quick smoke check
uc bench ./models/sipsalabs_<model-id> --tasks hellaswag --limit 50
# Multiple tasks with bigger sample size
uc bench ./models/sipsalabs_<model-id> \
--tasks hellaswag,arc_challenge,arc_easy,piqa,winogrande \
--limit 1000 --batch-size 16
# CPU fallback (slow!)
uc bench ./models/sipsalabs_<model-id> --device cpu --limit 50
# Custom output directory
uc bench ./models/sipsalabs_<model-id> -o /tmp/bench-runs/run-001
Reproducibility
- Every run uses a deterministic seed (default 42; configurable via
UC_BENCH_SEED env var when supported).
- Results are deterministic up to GPU-arithmetic non-determinism (which is bounded; aggregates don't differ in practice).
- The
summary.json includes the seed, sample count, batch size, and lm-eval-harness version used.
Exit codes
| Code |
Meaning |
| 0 |
OK |
| 1 |
Benchmark failed (PyTorch / CUDA / lm-eval-harness error) |
| 2 |
Invalid arguments (Click default) |
- Larger batch size → faster, but more GPU memory. Bump
--batch-size 16 or 32 if you have memory headroom.
- Smaller
--limit for quick iteration; full evaluation usually wants --limit 1000 or more.
- Device-specific tuning: on H100 we recommend
--batch-size 32; on consumer 4090/5090 --batch-size 16; on T4 --batch-size 4.
See also