Skip to content

Integration — llama.cpp / GGUF

Native llama.cpp / GGUF support is on the v0.2 roadmap (target Q3 2026). Until then, the integration story is two paths:

  1. Use the UltraCompress reference loader in Python (current path)
  2. Convert UltraCompress artifacts to GGUF format using a converter script we'll publish

This page describes both.

Current path (v0.1, mid-2026)

pip install "ultracompress[torch]"
huggingface-cli download SipsaLabs/<repo-id> --local-dir ./<repo-id>

Inference via the Python reference loader:

from ultracompress_cli.loader import load_model  # available v0.1.1+
from transformers import AutoTokenizer

local = "./models/sipsalabs_<model-id>"
tokenizer = AutoTokenizer.from_pretrained(local)
model = load_model(local).cuda()

# Standard transformers generate
out = model.generate(**tokenizer("Hello", return_tensors="pt").to("cuda"))
print(tokenizer.decode(out[0]))

This works for evaluation, prototyping, and integration testing. It is not a production-grade inference path on llama.cpp's hardware.

v0.2 path (Q3 2026 roadmap)

We will ship uc export --format gguf that converts an UltraCompress artifact to GGUF format compatible with llama.cpp:

uc export ./models/sipsalabs_<model-id> --format gguf -o qwen3-uc.gguf
./llama-cli -m qwen3-uc.gguf -p "Hello"

The exported GGUF file will:

  • Inflate the UltraCompress weights into one of llama.cpp's native quantization formats (likely Q3_K_S or a new format we contribute)
  • Preserve the ultracompress.json provenance via a custom GGUF metadata field (general.ultracompress.bpw, general.ultracompress.method, general.ultracompress.patents)
  • Be a drop-in replacement for any other GGUF model in your llama.cpp pipeline

Why we don't ship llama.cpp natively today

The UltraCompress weight format is not a llama.cpp-native quantization scheme. To run efficiently on llama.cpp, we would need to either:

  • Contribute new ggml quantization types upstream (slow, ~6-12 months of upstream review)
  • Inflate to existing types at export time (faster, but loses some compression efficiency)

We're pursuing both paths. The export-to-existing-types path lands first (v0.2, Q3 2026); upstream contribution lands eventually.

Memory footprint after GGUF export

Inflating UltraCompress's 5-bpw representation to llama.cpp's nearest existing type:

UltraCompress source llama.cpp target Memory after inflation Compression vs FP16
5 bpw Q5_K_M (~5.5 bpw) ~1.18 GB for 1.7B 2.9×
5 bpw Q5_K_S (~5.1 bpw) ~1.10 GB for 1.7B 3.1×

We give up some of the on-disk compression efficiency in exchange for native llama.cpp inference speed. This is the right tradeoff for almost all production deployment scenarios; for storage-bound use cases (e.g., mobile distribution where the artifact is downloaded once), keep the native UltraCompress format.

What you can do today (mid-2026)

  • Use the Python loader for evaluation + prototyping
  • Run your own lm-eval-harness (or equivalent) against the reconstructed model to compare compressed vs FP16 quality on your tasks
  • Open a GitHub issue with your specific llama.cpp use case so we prioritize the export path correctly

What you'll be able to do post-Q3 2026

  • uc export --format gguf to convert artifacts
  • Direct llama-cli / llama-server use on the exported GGUF
  • All standard llama.cpp tooling: llama-bench, llama-quantize (for re-quantizing), llama-perplexity

See also