Skip to content

Integration — NVIDIA TensorRT-LLM

NVIDIA TensorRT-LLM is the production-grade NVIDIA-GPU inference runtime, with deep integration into NVIDIA Inference Microservices (NIM), Triton Inference Server, and the broader NVIDIA serving stack.

Native UltraCompress support in TensorRT-LLM is on our v0.2 roadmap (target Q3 2026). Today, the integration story is similar to llama.cpp and vLLM: use the reference loader for evaluation, await the export path for production.

Current path (v0.1, mid-2026)

If you're evaluating UltraCompress on NVIDIA hardware:

pip install "ultracompress[torch]"
uc pull sipsalabs/<model-id>

Then use the UltraCompress reference loader directly. It runs on CUDA and gives you correct (though slower than TensorRT-LLM) inference for evaluation purposes.

For production-grade throughput on NVIDIA GPUs today: inflate to FP16 and use TensorRT-LLM's standard FP16 or INT4 path. You give up the compression at runtime; you keep it on-disk + at distribution time.

v0.2 path (Q3 2026 roadmap)

We will ship uc export --format trtllm-engine that produces a TensorRT-LLM-compatible inference engine:

uc export ./models/sipsalabs_<model-id> \
    --format trtllm-engine \
    --target-arch sm_90 \
    -o qwen3-uc.engine

Engine file is then loadable directly by TensorRT-LLM's standard runtime:

from tensorrt_llm import Runner
runner = Runner.from_engine("qwen3-uc.engine")

The export path will:

  • Convert UltraCompress weights into TensorRT-LLM's native quantization format (mostly likely INT4-AWQ-style or W4A8-INT8-style depending on target arch)
  • Preserve UltraCompress provenance in the engine's metadata (engine.metadata.ultracompress = {bpw, method, patents})
  • Be optimized for the target SM architecture (Ada / Hopper / Blackwell)

Memory + speed at production scale

TensorRT-LLM is the strongest existing production inference path for NVIDIA GPUs. Once we land the v0.2 export, expected numbers:

Variant TRT-LLM inference latency (1.7B model, batch=1, A100) Memory
FP16 ~12 ms / token ~3.5 GB
INT4-AWQ (existing) ~6 ms / token ~0.9 GB
UltraCompress export → INT4-AWQ-on-NVIDIA ~6 ms / token ~0.6 GB

The on-NVIDIA-runtime memory delta vs INT4-AWQ is modest (~30%); the distribution-time advantage is much larger (downloadable artifact is ~2.7× smaller).

For chip vendors and OEMs targeting non-NVIDIA inference paths (Snapdragon, Apple Silicon, etc.), the memory and distribution advantages are larger; the TensorRT-LLM path is mainly relevant for cloud customers.

NVIDIA Inference Microservices (NIM)

NIM bundles TensorRT-LLM with a standardized serving API. Once we land the v0.2 TensorRT-LLM engine export, customers will be able to:

# Build a NIM container with our engine
docker build -f Dockerfile.nim -t mycorp/qwen3-uc-nim .
# Serve via NIM's standard chat-completion API
docker run --gpus all -p 8000:8000 mycorp/qwen3-uc-nim

This is the deployment path we'll recommend for cloud-native enterprise customers.

What you can do today (mid-2026)

  • Run UltraCompress through the reference loader on CUDA for evaluation
  • Inflate to FP16 and use TensorRT-LLM's standard FP16 path for production-quality throughput (loses compression at runtime)
  • Open a GitHub issue with your specific TensorRT-LLM target arch + workload so we prioritize the v0.2 export correctly

What you'll be able to do post-Q3 2026

  • uc export --format trtllm-engine for direct TensorRT-LLM integration
  • NIM-compatible container distribution
  • INT4-AWQ-comparable inference latency with smaller download artifacts

A note on patent-licensing alignment

The UltraCompress methods are patent-pending (USPTO 64/049,511 + 64/049,517). NVIDIA TensorRT-LLM is a closed-source NVIDIA product. Our integration works at the export-format level — we produce a TensorRT-LLM-compatible engine — without requiring NVIDIA to integrate or license our methods.

This is an important architectural choice. Customers using TensorRT-LLM directly are unaffected by our patents (they're using NVIDIA's existing W4A16 / INT4 AWQ paths). Customers using our exported artifact through TensorRT-LLM are using our patented methods at the artifact-production stage and need a license; the runtime is NVIDIA's, the artifact is ours.

If you're a chip vendor or hyperscaler who wants to integrate UltraCompress more deeply into your inference stack, email legal@sipsalabs.com.

See also