Skip to content

Integration — Hugging Face Transformers

UltraCompress artifacts are designed to work with the existing Hugging Face Transformers ecosystem. As of v0.1, you load them via the ultracompress package's loader; native Transformers support is on the v0.2 roadmap.

v0.1 (current) — using the UltraCompress loader

from ultracompress_cli.loader import load_model  # available v0.1.1+
from transformers import AutoTokenizer

model_path = "./models/sipsalabs_<model-id>"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = load_model(model_path)  # returns a torch.nn.Module

inputs = tokenizer("Why is the sky blue?", return_tensors="pt").to("cuda")
out = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(out[0], skip_special_tokens=True))

The loader inflates the compressed weights to a runnable representation. Inference speed will be similar to FP16 (the compression is for memory, not speed) until we ship the FP4 / NF4 / quantized-runtime kernel path in v0.2.

v0.2 (Q3 2026 roadmap) — native AutoModel support

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "sipsalabs/<model-id>",
    quantization_config="ultracompress",
)
tokenizer = AutoTokenizer.from_pretrained("sipsalabs/<model-id>")

This will require: - A ultracompress quantization-config registered with Transformers' BNB/HQQ-style plugin pattern - Or a custom from_pretrained extension shipped in our package

We're working with the Hugging Face team on the right integration shape.

Memory savings vs. FP16

Variant Memory at runtime
Qwen3-1.7B FP16 ~3.5 GB
Qwen3-1.7B int8 (bitsandbytes) ~1.8 GB
Qwen3-1.7B NF4 (bitsandbytes) ~1.0 GB
Qwen3-1.7B UltraCompress 2.798 bpw ~0.7 GB

(All measured on a CUDA device with the loader's default inflation; "memory at runtime" is torch.cuda.memory_allocated() after model load.)

Inference speed

UltraCompress v0.1 inflates compressed weights at load time, so inference speed at runtime is similar to FP16. Expected 2-3× speedup lands in v0.2 when the quantized-runtime kernel path ships.

Compatibility

Tested with: - transformers >= 4.45.0 - torch >= 2.4.0 - safetensors >= 0.4.5 - huggingface_hub >= 0.24.0

Reach out if you need older-version compatibility.

See also