Inference runtimes

vLLM vs Triton vs SGLang: an architect's guide to choosing a production inference runtime.

A technical comparison for engineers and architects deciding how to serve open-weights LLMs at scale. Focus: KV-cache management, prefix caching, scheduling, and where each runtime actually wins.

Framing

Three runtimes, three philosophies.

vLLM, NVIDIA Triton, and SGLang are often discussed as alternatives, but they solve overlapping problems from different starting points. Picking the wrong one costs throughput, latency, or operational sanity.

vLLM

Throughput-first LLM engine

Built around PagedAttention and continuous batching. The de-facto open-source baseline for serving Llama, Qwen, Mistral, DeepSeek, and friends on a single model endpoint.

NVIDIA Triton

General model server

A model-agnostic server with pluggable backends — TensorRT-LLM, vLLM, ONNX, PyTorch, Python. Strongest when you serve many model types and need ensembles, routing, and a single ops surface.

SGLang

Structured-generation runtime

RadixAttention shares KV-cache across requests with common prefixes. A frontend DSL makes constrained, multi-turn, tool-using programs first-class — not retrofitted.

KV-cache

Where most of the throughput is actually won.

LLM inference is memory-bandwidth bound long before it is compute bound. How the runtime lays out, reuses, and evicts KV-cache blocks dominates real-world tokens/sec.

PagedAttention (vLLM)

KV-cache split into fixed-size blocks, allocated on demand. Eliminates internal fragmentation, allows batching across very different sequence lengths, and enables copy-on-write for beam search and parallel sampling.

RadixAttention (SGLang)

KV blocks are organised as a radix tree keyed by token prefix. Two requests sharing a system prompt, few-shot block, or retrieved chunk reuse the same physical KV — across requests, not just within one.

TRT-LLM paged KV (Triton)

In-flight batching with paged KV and explicit prefix reuse. Tight kernel fusion on NVIDIA hardware gives the best single-request latency, at the cost of a heavier build and engine-compilation step.

At a glance

Capability matrix.

DimensionvLLMTritonSGLang
Primary roleThroughput-first LLM engineGeneral-purpose model serverStructured-generation runtime
SchedulerContinuous batching, PagedAttentionDynamic batcher; LLM backends (TRT-LLM, vLLM) plug inRadixAttention + continuous batching
KV-cachePaged blocks, near-zero fragmentationInherits backend (TRT-LLM paged, vLLM paged)Radix tree of shared KV blocks
Prefix cachingAutomatic prefix cache (hash-keyed)Backend-dependent; TRT-LLM has reuseFirst-class via RadixAttention; highest reuse
Structured outputGuided decoding (outlines, xgrammar)Custom pre/post-processingNative DSL: regex, JSON schema, choices, control flow
Multi-model servingOne engine per modelStrong: ensembles, multi-backend, model repoOne runtime per model; routers compose
HardwareNVIDIA, AMD ROCm, TPU, IntelNVIDIA-first, broad backend matrixNVIDIA + AMD; FlashInfer kernels
Throughput

Where each runtime actually wins.

Public benchmarks shift release to release. The shape of the win is more stable than the numbers — anchor decisions on workload shape, not on last quarter's chart.

WorkloadTypical winnerWhy
Long shared prefix (agents, RAG)SGLangRadixAttention reuses KV across requests, not just within a sequence.
High-concurrency single-model servingvLLM / SGLangBoth use continuous batching; gap narrows as batch fills.
Latency-sensitive single requestTRT-LLM via TritonCompiled kernels and in-flight batching on Hopper / Blackwell.
Mixed-model production fleetTritonModel repository, ensembles, and metrics designed for it.
Strict JSON / grammar outputSGLangConstrained decoding is native, not bolted on.
Decision

When to choose what.

guidance

Choose vLLM when

You serve one or a few open-weights models, want the simplest path to high throughput, and need broad hardware portability. The default for most production OSS LLM deployments.

guidance

Choose Triton when

You operate a heterogeneous fleet — LLMs alongside embedding, vision, ranking, or classical models — and need ensembles, A/B routing, BLS pipelines, and a single ops surface across model types.

guidance

Choose SGLang when

Your workload is agentic, RAG-heavy, tool-using, or constrained JSON. RadixAttention turns shared prefixes (system prompts, few-shot, retrieved context) into a measurable throughput win.

Pitfalls

What goes wrong in production.

Most runtime regrets come from benchmarking the wrong thing, not from picking the wrong runtime.

  • 01Benchmarking with a cold prefix cache — SGLang and vLLM both look worse than they are in steady state.
  • 02Comparing throughput at the same batch size instead of the same p95 latency budget; the right question is tokens/sec at your SLO.
  • 03Ignoring KV-cache memory pressure: long-context workloads are bound by HBM, not FLOPs. Paged and radix layouts change the ceiling.
  • 04Wrapping vLLM inside Triton 'because Triton' when a single-model deployment would be simpler and faster to operate.
  • 05Treating prefix caching as free — it requires deterministic, shared prefixes. Per-user system prompts defeat it.
FAQ

Common questions from engineering teams.

Yes. Triton has a vLLM backend that wraps the vLLM engine inside Triton's model repository and ensemble framework. This makes sense when you need Triton's multi-model routing, metrics, and ops surface around a vLLM-powered LLM endpoint. For a single-model deployment, running vLLM directly is simpler and removes an abstraction layer.
Next

Want this evaluated against your workload?

We design and review inference platforms for AI infrastructure, GPU cloud, and sovereign-AI operators — including runtime selection, KV-cache sizing, and capacity modelling.