Integration — vLLM¶
vLLM is the production-default for high-throughput LLM inference on commodity GPU. Native UltraCompress support is on the v0.2 roadmap (target Q3 2026). Until then, the integration story is:
- Use the UltraCompress reference loader to inflate to FP16, then serve via standard vLLM (lossy on memory savings; works today)
- Use the UltraCompress reference loader directly (no vLLM path; works today, lower throughput)
- Wait for native vLLM support in v0.2
Current path (v0.1, mid-2026)¶
Option A — inflate then serve via vLLM¶
This loses the memory savings but gives you vLLM's throughput. Useful for evaluation; not the final story.
from ultracompress_cli.loader import load_model # v0.1.1+
import torch
# Load + inflate to FP16
model = load_model("./models/sipsalabs_<model-id>").to(torch.float16)
# Save the inflated model in HF Transformers format
model.save_pretrained("./models/qwen3-1.7b-fp16-from-uc")
# Now serve with standard vLLM
# vllm serve ./models/qwen3-1.7b-fp16-from-uc
This is essentially "use UltraCompress to get the model and immediately throw away the compression." Use only as a vLLM evaluation path.
Option B — direct loader, no vLLM¶
If you want the memory savings, run the model directly via the reference loader, accepting lower throughput than vLLM:
from ultracompress_cli.loader import load_model
model = load_model("./models/sipsalabs_<model-id>").cuda()
# Single-request inference
This is what most pre-launch design partners are using during pilots.
v0.2 path (Q3 2026 roadmap)¶
We will ship a vLLM plugin that adds UltraCompress as a recognized quantization format:
pip install "ultracompress[vllm]" # bundles the plugin
vllm serve sipsalabs/<model-id> --dtype ultracompress
The plugin will:
- Register a custom layer kernel that decompresses UltraCompress weights at inference time (avoiding the FP16 inflation cost)
- Maintain vLLM's throughput characteristics (continuous batching, paged attention, etc.)
- Pass through the
ultracompress.jsonprovenance viavllm.model.metadata
Memory footprint with v0.2 plugin¶
| Variant | vLLM RAM at runtime |
|---|---|
| Qwen3-1.7B FP16 (vLLM standard) | ~3.5 GB |
| Qwen3-1.7B AWQ-INT4 (vLLM AWQ path) | ~1.0 GB |
| Qwen3-1.7B UltraCompress 2.798 bpw (v0.2 plugin) | ~0.7 GB |
(Inference speed targets ~2-3× UltraCompress reference loader, comparable to vLLM's AWQ path on equivalent hardware.)
What you can do today (mid-2026)¶
- Evaluation: use option A above to compare UltraCompress quality vs other quantization methods running through the same vLLM serving path
- Reference deployment: use option B for low-traffic / development / single-customer evaluation
- Open an issue if you're a vLLM user with a specific deployment scenario; we prioritize the v0.2 plugin work by deployed-customer-impact
What you'll be able to do post-Q3 2026¶
- One-line
vllm serveagainst UltraCompress artifacts directly - Continuous-batching throughput comparable to vLLM's AWQ path
- Memory footprint comparable to UltraCompress's native artifact size