vLLM vs Triton vs SGLang: an architect's guide to choosing a production inference runtime.
A technical comparison for engineers and architects deciding how to serve open-weights LLMs at scale. Focus: KV-cache management, prefix caching, scheduling, and where each runtime actually wins.
Three runtimes, three philosophies.
vLLM, NVIDIA Triton, and SGLang are often discussed as alternatives, but they solve overlapping problems from different starting points. Picking the wrong one costs throughput, latency, or operational sanity.
Throughput-first LLM engine
Built around PagedAttention and continuous batching. The de-facto open-source baseline for serving Llama, Qwen, Mistral, DeepSeek, and friends on a single model endpoint.
General model server
A model-agnostic server with pluggable backends — TensorRT-LLM, vLLM, ONNX, PyTorch, Python. Strongest when you serve many model types and need ensembles, routing, and a single ops surface.
Structured-generation runtime
RadixAttention shares KV-cache across requests with common prefixes. A frontend DSL makes constrained, multi-turn, tool-using programs first-class — not retrofitted.
Where most of the throughput is actually won.
LLM inference is memory-bandwidth bound long before it is compute bound. How the runtime lays out, reuses, and evicts KV-cache blocks dominates real-world tokens/sec.
KV-cache split into fixed-size blocks, allocated on demand. Eliminates internal fragmentation, allows batching across very different sequence lengths, and enables copy-on-write for beam search and parallel sampling.
KV blocks are organised as a radix tree keyed by token prefix. Two requests sharing a system prompt, few-shot block, or retrieved chunk reuse the same physical KV — across requests, not just within one.
In-flight batching with paged KV and explicit prefix reuse. Tight kernel fusion on NVIDIA hardware gives the best single-request latency, at the cost of a heavier build and engine-compilation step.
Capability matrix.
| Dimension | vLLM | Triton | SGLang |
|---|---|---|---|
| Primary role | Throughput-first LLM engine | General-purpose model server | Structured-generation runtime |
| Scheduler | Continuous batching, PagedAttention | Dynamic batcher; LLM backends (TRT-LLM, vLLM) plug in | RadixAttention + continuous batching |
| KV-cache | Paged blocks, near-zero fragmentation | Inherits backend (TRT-LLM paged, vLLM paged) | Radix tree of shared KV blocks |
| Prefix caching | Automatic prefix cache (hash-keyed) | Backend-dependent; TRT-LLM has reuse | First-class via RadixAttention; highest reuse |
| Structured output | Guided decoding (outlines, xgrammar) | Custom pre/post-processing | Native DSL: regex, JSON schema, choices, control flow |
| Multi-model serving | One engine per model | Strong: ensembles, multi-backend, model repo | One runtime per model; routers compose |
| Hardware | NVIDIA, AMD ROCm, TPU, Intel | NVIDIA-first, broad backend matrix | NVIDIA + AMD; FlashInfer kernels |
Where each runtime actually wins.
Public benchmarks shift release to release. The shape of the win is more stable than the numbers — anchor decisions on workload shape, not on last quarter's chart.
| Workload | Typical winner | Why |
|---|---|---|
| Long shared prefix (agents, RAG) | SGLang | RadixAttention reuses KV across requests, not just within a sequence. |
| High-concurrency single-model serving | vLLM / SGLang | Both use continuous batching; gap narrows as batch fills. |
| Latency-sensitive single request | TRT-LLM via Triton | Compiled kernels and in-flight batching on Hopper / Blackwell. |
| Mixed-model production fleet | Triton | Model repository, ensembles, and metrics designed for it. |
| Strict JSON / grammar output | SGLang | Constrained decoding is native, not bolted on. |
When to choose what.
Choose vLLM when
You serve one or a few open-weights models, want the simplest path to high throughput, and need broad hardware portability. The default for most production OSS LLM deployments.
Choose Triton when
You operate a heterogeneous fleet — LLMs alongside embedding, vision, ranking, or classical models — and need ensembles, A/B routing, BLS pipelines, and a single ops surface across model types.
Choose SGLang when
Your workload is agentic, RAG-heavy, tool-using, or constrained JSON. RadixAttention turns shared prefixes (system prompts, few-shot, retrieved context) into a measurable throughput win.
What goes wrong in production.
Most runtime regrets come from benchmarking the wrong thing, not from picking the wrong runtime.
- 01Benchmarking with a cold prefix cache — SGLang and vLLM both look worse than they are in steady state.
- 02Comparing throughput at the same batch size instead of the same p95 latency budget; the right question is tokens/sec at your SLO.
- 03Ignoring KV-cache memory pressure: long-context workloads are bound by HBM, not FLOPs. Paged and radix layouts change the ceiling.
- 04Wrapping vLLM inside Triton 'because Triton' when a single-model deployment would be simpler and faster to operate.
- 05Treating prefix caching as free — it requires deterministic, shared prefixes. Per-user system prompts defeat it.
Common questions from engineering teams.
Want this evaluated against your workload?
We design and review inference platforms for AI infrastructure, GPU cloud, and sovereign-AI operators — including runtime selection, KV-cache sizing, and capacity modelling.