Integration — llama.cpp / GGUF¶
Native llama.cpp / GGUF support is on the v0.2 roadmap (target Q3 2026). Until then, the integration story is two paths:
- Use the UltraCompress reference loader in Python (current path)
- Convert UltraCompress artifacts to GGUF format using a converter script we'll publish
This page describes both.
Current path (v0.1, mid-2026)¶
pip install "ultracompress[torch]"
huggingface-cli download SipsaLabs/<repo-id> --local-dir ./<repo-id>
Inference via the Python reference loader:
from ultracompress_cli.loader import load_model # available v0.1.1+
from transformers import AutoTokenizer
local = "./models/sipsalabs_<model-id>"
tokenizer = AutoTokenizer.from_pretrained(local)
model = load_model(local).cuda()
# Standard transformers generate
out = model.generate(**tokenizer("Hello", return_tensors="pt").to("cuda"))
print(tokenizer.decode(out[0]))
This works for evaluation, prototyping, and integration testing. It is not a production-grade inference path on llama.cpp's hardware.
v0.2 path (Q3 2026 roadmap)¶
We will ship uc export --format gguf that converts an UltraCompress artifact to GGUF format compatible with llama.cpp:
uc export ./models/sipsalabs_<model-id> --format gguf -o qwen3-uc.gguf
./llama-cli -m qwen3-uc.gguf -p "Hello"
The exported GGUF file will:
- Inflate the UltraCompress weights into one of llama.cpp's native quantization formats (likely Q3_K_S or a new format we contribute)
- Preserve the
ultracompress.jsonprovenance via a custom GGUF metadata field (general.ultracompress.bpw,general.ultracompress.method,general.ultracompress.patents) - Be a drop-in replacement for any other GGUF model in your llama.cpp pipeline
Why we don't ship llama.cpp natively today¶
The UltraCompress weight format is not a llama.cpp-native quantization scheme. To run efficiently on llama.cpp, we would need to either:
- Contribute new ggml quantization types upstream (slow, ~6-12 months of upstream review)
- Inflate to existing types at export time (faster, but loses some compression efficiency)
We're pursuing both paths. The export-to-existing-types path lands first (v0.2, Q3 2026); upstream contribution lands eventually.
Memory footprint after GGUF export¶
Inflating UltraCompress's 5-bpw representation to llama.cpp's nearest existing type:
| UltraCompress source | llama.cpp target | Memory after inflation | Compression vs FP16 |
|---|---|---|---|
| 5 bpw | Q5_K_M (~5.5 bpw) | ~1.18 GB for 1.7B | 2.9× |
| 5 bpw | Q5_K_S (~5.1 bpw) | ~1.10 GB for 1.7B | 3.1× |
We give up some of the on-disk compression efficiency in exchange for native llama.cpp inference speed. This is the right tradeoff for almost all production deployment scenarios; for storage-bound use cases (e.g., mobile distribution where the artifact is downloaded once), keep the native UltraCompress format.
What you can do today (mid-2026)¶
- Use the Python loader for evaluation + prototyping
- Run your own
lm-eval-harness(or equivalent) against the reconstructed model to compare compressed vs FP16 quality on your tasks - Open a GitHub issue with your specific llama.cpp use case so we prioritize the export path correctly
What you'll be able to do post-Q3 2026¶
uc export --format ggufto convert artifacts- Direct
llama-cli/llama-serveruse on the exported GGUF - All standard llama.cpp tooling:
llama-bench,llama-quantize(for re-quantizing),llama-perplexity