Integration — NVIDIA TensorRT-LLM¶
NVIDIA TensorRT-LLM is the production-grade NVIDIA-GPU inference runtime, with deep integration into NVIDIA Inference Microservices (NIM), Triton Inference Server, and the broader NVIDIA serving stack.
Native UltraCompress support in TensorRT-LLM is on our v0.2 roadmap (target Q3 2026). Today, the integration story is similar to llama.cpp and vLLM: use the reference loader for evaluation, await the export path for production.
Current path (v0.1, mid-2026)¶
If you're evaluating UltraCompress on NVIDIA hardware:
Then use the UltraCompress reference loader directly. It runs on CUDA and gives you correct (though slower than TensorRT-LLM) inference for evaluation purposes.
For production-grade throughput on NVIDIA GPUs today: inflate to FP16 and use TensorRT-LLM's standard FP16 or INT4 path. You give up the compression at runtime; you keep it on-disk + at distribution time.
v0.2 path (Q3 2026 roadmap)¶
We will ship uc export --format trtllm-engine that produces a TensorRT-LLM-compatible inference engine:
uc export ./models/sipsalabs_<model-id> \
--format trtllm-engine \
--target-arch sm_90 \
-o qwen3-uc.engine
Engine file is then loadable directly by TensorRT-LLM's standard runtime:
The export path will:
- Convert UltraCompress weights into TensorRT-LLM's native quantization format (mostly likely INT4-AWQ-style or W4A8-INT8-style depending on target arch)
- Preserve UltraCompress provenance in the engine's metadata (
engine.metadata.ultracompress = {bpw, method, patents}) - Be optimized for the target SM architecture (Ada / Hopper / Blackwell)
Memory + speed at production scale¶
TensorRT-LLM is the strongest existing production inference path for NVIDIA GPUs. Once we land the v0.2 export, expected numbers:
| Variant | TRT-LLM inference latency (1.7B model, batch=1, A100) | Memory |
|---|---|---|
| FP16 | ~12 ms / token | ~3.5 GB |
| INT4-AWQ (existing) | ~6 ms / token | ~0.9 GB |
| UltraCompress export → INT4-AWQ-on-NVIDIA | ~6 ms / token | ~0.6 GB |
The on-NVIDIA-runtime memory delta vs INT4-AWQ is modest (~30%); the distribution-time advantage is much larger (downloadable artifact is ~2.7× smaller).
For chip vendors and OEMs targeting non-NVIDIA inference paths (Snapdragon, Apple Silicon, etc.), the memory and distribution advantages are larger; the TensorRT-LLM path is mainly relevant for cloud customers.
NVIDIA Inference Microservices (NIM)¶
NIM bundles TensorRT-LLM with a standardized serving API. Once we land the v0.2 TensorRT-LLM engine export, customers will be able to:
# Build a NIM container with our engine
docker build -f Dockerfile.nim -t mycorp/qwen3-uc-nim .
# Serve via NIM's standard chat-completion API
docker run --gpus all -p 8000:8000 mycorp/qwen3-uc-nim
This is the deployment path we'll recommend for cloud-native enterprise customers.
What you can do today (mid-2026)¶
- Run UltraCompress through the reference loader on CUDA for evaluation
- Inflate to FP16 and use TensorRT-LLM's standard FP16 path for production-quality throughput (loses compression at runtime)
- Open a GitHub issue with your specific TensorRT-LLM target arch + workload so we prioritize the v0.2 export correctly
What you'll be able to do post-Q3 2026¶
uc export --format trtllm-enginefor direct TensorRT-LLM integration- NIM-compatible container distribution
- INT4-AWQ-comparable inference latency with smaller download artifacts
A note on patent-licensing alignment¶
The UltraCompress methods are patent-pending (USPTO 64/049,511 + 64/049,517). NVIDIA TensorRT-LLM is a closed-source NVIDIA product. Our integration works at the export-format level — we produce a TensorRT-LLM-compatible engine — without requiring NVIDIA to integrate or license our methods.
This is an important architectural choice. Customers using TensorRT-LLM directly are unaffected by our patents (they're using NVIDIA's existing W4A16 / INT4 AWQ paths). Customers using our exported artifact through TensorRT-LLM are using our patented methods at the artifact-production stage and need a license; the runtime is NVIDIA's, the artifact is ours.
If you're a chip vendor or hyperscaler who wants to integrate UltraCompress more deeply into your inference stack, email legal@sipsalabs.com.