Integration — llama.cpp / GGUF¶

Native llama.cpp / GGUF support is on the v0.2 roadmap (target Q3 2026). Until then, the integration story is two paths:

Use the UltraCompress reference loader in Python (current path)
Convert UltraCompress artifacts to GGUF format using a converter script we'll publish

This page describes both.

Current path (v0.1, mid-2026)¶

pip install "ultracompress[torch]"
uc pull sipsalabs/<model-id>

Inference via the Python reference loader:

from ultracompress_cli.loader import load_model  # available v0.1.1+
from transformers import AutoTokenizer

local = "./models/sipsalabs_<model-id>"
tokenizer = AutoTokenizer.from_pretrained(local)
model = load_model(local).cuda()

# Standard transformers generate
out = model.generate(**tokenizer("Hello", return_tensors="pt").to("cuda"))
print(tokenizer.decode(out[0]))

This works for evaluation, prototyping, and integration testing. It is not a production-grade inference path on llama.cpp's hardware.

v0.2 path (Q3 2026 roadmap)¶

We will ship uc export --format gguf that converts an UltraCompress artifact to GGUF format compatible with llama.cpp:

uc export ./models/sipsalabs_<model-id> --format gguf -o qwen3-uc.gguf
./llama-cli -m qwen3-uc.gguf -p "Hello"

The exported GGUF file will:

Inflate the UltraCompress weights into one of llama.cpp's native quantization formats (likely Q3_K_S or a new format we contribute)
Preserve the ultracompress.json provenance via a custom GGUF metadata field (general.ultracompress.bpw, general.ultracompress.method, general.ultracompress.patents)
Be a drop-in replacement for any other GGUF model in your llama.cpp pipeline

Why we don't ship llama.cpp natively today¶

The UltraCompress weight format is not a llama.cpp-native quantization scheme. To run efficiently on llama.cpp, we would need to either:

Contribute new ggml quantization types upstream (slow, ~6-12 months of upstream review)
Inflate to existing types at export time (faster, but loses some compression efficiency)

We're pursuing both paths. The export-to-existing-types path lands first (v0.2, Q3 2026); upstream contribution lands eventually.

Memory footprint after GGUF export¶

Inflating UltraCompress's 2.798-bpw representation to llama.cpp's nearest existing type:

UltraCompress source	llama.cpp target	Memory after inflation	Compression vs FP16
2.798 bpw	Q3_K_S (~3.4 bpw)	~720 MB for 1.7B	4.7×
2.798 bpw	IQ3_XS (~2.9 bpw)	~620 MB for 1.7B	5.5×

We give up some of the on-disk compression efficiency in exchange for native llama.cpp inference speed. This is the right tradeoff for almost all production deployment scenarios; for storage-bound use cases (e.g., mobile distribution where the artifact is downloaded once), keep the native UltraCompress format.

What you can do today (mid-2026)¶

Use the Python loader for evaluation + prototyping
Run uc bench to compare compressed vs FP16 quality on your tasks
Open a GitHub issue with your specific llama.cpp use case so we prioritize the export path correctly

What you'll be able to do post-Q3 2026¶

uc export --format gguf to convert artifacts
Direct llama-cli / llama-server use on the exported GGUF
All standard llama.cpp tooling: llama-bench, llama-quantize (for re-quantizing), llama-perplexity