uc compress design spec (v0.2 — target Q3 2026)¶
Status: 🔵 PLANNED for v0.2.
This page describes the planned uc compress command — self-compression of a customer-provided source model. It is NOT in v0.1; this page is the design spec we'll implement against.
uc compress is the most-requested feature in pre-launch design-partner conversations. It's also the feature most-tightly gated on the patent prosecution timeline (we hold the implementation closed until the non-provisional applications are filed in April 2027).
This feature is planned for v0.2 (Q3 2026). To register interest, email founder@sipsalabs.com.
Goals¶
- One-command compression of any HF Hub or local-disk transformer model
- Reproducibility — every output artifact ships with a complete provenance manifest
- Sensible defaults —
uc compress <model>Just Works for typical models without per-model tuning - Pluggable compression components — the shipping low-bpw codec and the roadmap architectural-compression component, accessible via flags
- Resumable — long-running compression jobs can resume after interruption
Synopsis (planned)¶
uc compress <source-model> [--bpw FLOAT] [--method STR] [--track STR]
[--output-dir PATH] [--device STR]
[--max-batch-size INT] [--seed INT]
[--license STR] [--api-key STR]
Options¶
| Option | Default | Description |
|---|---|---|
<source-model> |
required | HF Hub model ID (e.g., Qwen/Qwen3-1.7B) or local-disk path |
--bpw FLOAT |
5.0 |
Target bits per weight |
--method STR |
default |
Compression method to apply (component identifiers NDA-gated) |
--track STR |
a |
Which patent-pending component pipeline: a, b, a+b |
--output-dir PATH |
./models/<source-name>-uc<bpw> |
Where to save the compressed artifact |
--device STR |
cuda:0 |
PyTorch device for compression |
--max-batch-size INT |
8 |
Batch size for the calibration data pass |
--seed INT |
42 |
Deterministic seed for reproducibility |
--license STR |
sipsalabs-research-eval-1.0 |
License identifier to write into the manifest |
--api-key STR |
(env: UC_COMPRESS_API_KEY) |
API key for the remote compression service |
What it does (high-level flow)¶
1. Parse the source model: download from HF Hub if needed, or load from local disk
2. Validate the model is a transformer-class architecture (raise otherwise)
3. Compute the FP16 baseline metadata: parameter count, layer breakdown, etc.
4. Submit the compression job to Sipsa Labs' compression service
(compute-intensive; not feasible on a customer's local laptop in v0.2)
5. Stream the compressed artifact back as it's produced
6. Write the artifact to disk:
- `model.safetensors` — compressed weights
- `tokenizer/` — pre-loaded tokenizer (copy from source)
- `config.json` — pre-loaded config (copy from source)
- `ultracompress.json` — provenance manifest
- `LICENSE` — per the --license flag
7. Verify SHA-256 of the produced artifact against the manifest
8. Print a summary
Example¶
# Default settings — compress Qwen3-1.7B to 5 bpw via the default UltraCompress codec
uc compress Qwen/Qwen3-1.7B
# Output:
# ./models/<source-model-name>-uc<bpw>/
# ├── model.safetensors (1.04 GB)
# ├── tokenizer/ (2.7 MB, copied from source)
# ├── config.json (4 KB)
# ├── ultracompress.json (provenance manifest)
# └── LICENSE (Sipsa Labs Research and Evaluation License)
Architectural compression (roadmap component)¶
Architectural compression is the most aggressive variant; it produces a model with substantially fewer trainable parameters but requires a calibration pass on representative training data.
uc compress Qwen/Qwen3-1.7B --method architectural --output-dir ./models/qwen3-architectural \
--calibration-data ./calibration.jsonl
The calibration data is a JSONL file with prompt/response pairs representative of the customer's deployment workload. The compression service uses this to validate that the architectural-compression preserves quality on the customer's distribution.
Combined codec + architectural-compression pipeline¶
Stacks the roadmap architectural-compression component on top of the shipping codec. Component-by-component breakdown is NDA-gated; reference cohort numbers are at evidence/matrix.md.
Why a remote service vs. local¶
Compression at our quality bar requires:
- ~10-100 GPU-hours per model (depending on size + cohort verification)
- Specialized kernels not yet pip-installable
- Calibration data from a curated cohort to maintain quality
Running this on a customer's laptop / desktop is not practical in v0.2. We'll lift the local-only requirement as the methods mature, but the v0.2 release uses a remote compression service.
The customer's source model is uploaded to the service over TLS, processed, and the result returned. Customer data never persists on Sipsa Labs's infrastructure beyond the duration of the compression job (typically minutes).
Authentication¶
- Sign up at sipsalabs.com to get a
UC_COMPRESS_API_KEY - Free tier: 1 compression per month on models < 7B parameters
- Paid tiers: per
PRICING_CALCULATOR.md
Reproducibility¶
Every compression run is deterministic given the same seed + same source model. The ultracompress.json manifest captures:
- Source model SHA-256
- Compression method version
- Seed
- Calibration data SHA-256 (for the architectural-compression and combined pipelines)
- Compute environment fingerprint
A second uc compress run with the same inputs and same seed produces a byte-identical output (within GPU-arithmetic non-determinism bounds; we publish the bound).
Resumability¶
For long-running compression jobs (architectural-compression at 70B+ parameters, expected to take several hours), uc compress supports resume:
# Submit the job
uc compress Qwen/Qwen3-32B --method shared-block --bpw 2.5 --output-dir ./models/qwen3-32b-shared-block-2p5
# Job interrupted (say, network drop)
# Resume with the same command:
uc compress Qwen/Qwen3-32B --method shared-block --bpw 2.5 --output-dir ./models/qwen3-32b-shared-block-2p5
# CLI detects the partial state in the output dir, resumes from the last checkpoint
Privacy + IP¶
- Customer's source model: stays customer's. We do not retain customer source models beyond the compression job duration. The current subprocessor list is available on request — email
legal@sipsalabs.com. - Compressed output artifact: customer owns it; subject to the license written into
LICENSEbyuc compress(defaults to Research and Evaluation License; commercial customers use a different license value). - Compression methods: Sipsa Labs's IP, patent-pending. Use of
uc compressrequires acceptance of a Subscription Agreement provided bylegal@sipsalabs.comat onboarding. - Calibration data: stays customer's. Used only for the compression job; not retained.
What this command will NOT do (v0.2)¶
- Compress arbitrary architectures outside the transformer family (CNN, GNN, etc.)
- Compress encoder-only models (T5, BERT) — defer to v0.3
- Compress quantization-aware-training models (already-aware models) — defer to v0.3
- Run entirely locally without internet — defer; some path forward via
--offlineflag with pre-shipped kernels
Pricing for uc compress¶
Per PRICING_CALCULATOR.md:
| Plan | Cost | What you get |
|---|---|---|
| Free | $0 | 1 compression / month, models ≤ 7B params, research license only |
| Solo | $99/mo | 5 compressions / month |
| Team | $499/mo | 50 compressions / month, 5 users |
| Business | $1,999/mo | 200 compressions / month, 15 users, SLA |
| Enterprise | $5K-$50K/mo | Custom volume, custom users, SLA, audit logs, custom calibration cohorts |
For chip vendors and OEMs: separate per-device royalty model in OEM_LICENSING_TERMS.md.
Migration path from v0.1¶
v0.1 doesn't have uc compress. The migration path:
- v0.1 → v0.2:
uc compressbecomes available for customers on a paid tier - Existing v0.1 reference-model artifacts (downloaded via
huggingface-cli download) keep working in v0.2
Roadmap¶
| Feature | Target |
|---|---|
Basic uc compress with the shipping codec |
v0.2 (Q3 2026) |
| Architectural-compression component support | v0.2 |
| Combined codec + architectural-compression pipeline | v0.2 |
| Resumable jobs | v0.2.1 |
| Custom calibration cohorts (enterprise tier) | v0.3 |
| Encoder-only model support (T5, BERT) | v0.3 |
--offline mode (local-only compression) |
v1.0+ |
Open questions¶
-
API surface for programmatic access:
from ultracompress_cli import compress(model_id, bpw, ...)or stay CLI-only? Lean toward CLI-only for v0.2; add Python API in v0.3 if customer demand justifies. -
Custom calibration cohorts: how do we expose them safely? Tentatively: customers upload calibration JSONL with stricter NDA/contractual terms, similar to enterprise data-handling.
-
Performance benchmarks per output: should
uc compressautomatically runuc benchon the output before declaring success? Yes; small--bench-on-finishflag (default on, suppressible). -
Per-customer compression-method versioning: should customers be able to pin to a specific method version (e.g.,
--method-version 1.2.0)? Yes; expose via flag.
These are open. Customer feedback welcome — file an issue at github.com/sipsalabs/ultracompress or email founder@sipsalabs.com.
Design spec for v0.2; revise as implementation progresses.