Inference runtimes

vLLM vs Triton vs SGLang: an architect's guide to choosing a production inference runtime.

A technical comparison for engineers and architects deciding how to serve open-weights LLMs at scale. Focus: KV-cache management, prefix caching, scheduling, and where each runtime actually wins.

Framing

Three runtimes, three philosophies.

vLLM, NVIDIA Triton, and SGLang are often discussed as alternatives, but they solve overlapping problems from different starting points. Picking the wrong one costs throughput, latency, or operational sanity.

vLLM

Throughput-first LLM engine

Built around PagedAttention and continuous batching. The de-facto open-source baseline for serving Llama, Qwen, Mistral, DeepSeek, and friends on a single model endpoint.

NVIDIA Triton

General model server

A model-agnostic server with pluggable backends — TensorRT-LLM, vLLM, ONNX, PyTorch, Python. Strongest when you serve many model types and need ensembles, routing, and a single ops surface.

SGLang

Structured-generation runtime

RadixAttention shares KV-cache across requests with common prefixes. A frontend DSL makes constrained, multi-turn, tool-using programs first-class — not retrofitted.

KV-cache

Where most of the throughput is actually won.

LLM inference is memory-bandwidth bound long before it is compute bound. How the runtime lays out, reuses, and evicts KV-cache blocks dominates real-world tokens/sec.

PagedAttention (vLLM)

KV-cache split into fixed-size blocks, allocated on demand. Eliminates internal fragmentation, allows batching across very different sequence lengths, and enables copy-on-write for beam search and parallel sampling.

RadixAttention (SGLang)

KV blocks are organised as a radix tree keyed by token prefix. Two requests sharing a system prompt, few-shot block, or retrieved chunk reuse the same physical KV — across requests, not just within one.

TRT-LLM paged KV (Triton)

In-flight batching with paged KV and explicit prefix reuse. Tight kernel fusion on NVIDIA hardware gives the best single-request latency, at the cost of a heavier build and engine-compilation step.

At a glance

Capability matrix.

Dimension	vLLM	Triton	SGLang
Primary role	Throughput-first LLM engine	General-purpose model server	Structured-generation runtime
Scheduler	Continuous batching, PagedAttention	Dynamic batcher; LLM backends (TRT-LLM, vLLM) plug in	RadixAttention + continuous batching
KV-cache	Paged blocks, near-zero fragmentation	Inherits backend (TRT-LLM paged, vLLM paged)	Radix tree of shared KV blocks
Prefix caching	Automatic prefix cache (hash-keyed)	Backend-dependent; TRT-LLM has reuse	First-class via RadixAttention; highest reuse
Structured output	Guided decoding (outlines, xgrammar)	Custom pre/post-processing	Native DSL: regex, JSON schema, choices, control flow
Multi-model serving	One engine per model	Strong: ensembles, multi-backend, model repo	One runtime per model; routers compose
Hardware	NVIDIA, AMD ROCm, TPU, Intel	NVIDIA-first, broad backend matrix	NVIDIA + AMD; FlashInfer kernels

Throughput

Where each runtime actually wins.

Public benchmarks shift release to release. The shape of the win is more stable than the numbers — anchor decisions on workload shape, not on last quarter's chart.

Workload	Typical winner	Why
Long shared prefix (agents, RAG)	SGLang	RadixAttention reuses KV across requests, not just within a sequence.
High-concurrency single-model serving	vLLM / SGLang	Both use continuous batching; gap narrows as batch fills.
Latency-sensitive single request	TRT-LLM via Triton	Compiled kernels and in-flight batching on Hopper / Blackwell.
Mixed-model production fleet	Triton	Model repository, ensembles, and metrics designed for it.
Strict JSON / grammar output	SGLang	Constrained decoding is native, not bolted on.

Decision

When to choose what.

guidance

Choose vLLM when

You serve one or a few open-weights models, want the simplest path to high throughput, and need broad hardware portability. The default for most production OSS LLM deployments.

guidance

Choose Triton when

You operate a heterogeneous fleet — LLMs alongside embedding, vision, ranking, or classical models — and need ensembles, A/B routing, BLS pipelines, and a single ops surface across model types.

guidance

Choose SGLang when

Your workload is agentic, RAG-heavy, tool-using, or constrained JSON. RadixAttention turns shared prefixes (system prompts, few-shot, retrieved context) into a measurable throughput win.

Pitfalls

What goes wrong in production.

Most runtime regrets come from benchmarking the wrong thing, not from picking the wrong runtime.

01Benchmarking with a cold prefix cache — SGLang and vLLM both look worse than they are in steady state.
02Comparing throughput at the same batch size instead of the same p95 latency budget; the right question is tokens/sec at your SLO.
03Ignoring KV-cache memory pressure: long-context workloads are bound by HBM, not FLOPs. Paged and radix layouts change the ceiling.
04Wrapping vLLM inside Triton 'because Triton' when a single-model deployment would be simpler and faster to operate.
05Treating prefix caching as free — it requires deterministic, shared prefixes. Per-user system prompts defeat it.

FAQ

Common questions from engineering teams.

Yes. Triton has a vLLM backend that wraps the vLLM engine inside Triton's model repository and ensemble framework. This makes sense when you need Triton's multi-model routing, metrics, and ops surface around a vLLM-powered LLM endpoint. For a single-model deployment, running vLLM directly is simpler and removes an abstraction layer.

Want this evaluated against your workload?

We design and review inference platforms for AI infrastructure, GPU cloud, and sovereign-AI operators — including runtime selection, KV-cache sizing, and capacity modelling.

Discuss your stack AI infrastructure architecture