On this page
TL;DR: LLM inference is two completely different workloads in a trench coat — compute-bound prefill and memory-bandwidth-bound decode — and every optimization in the modern serving stack exists because of that split. Fleet operators run millions of accelerators, memory bandwidth is the binding constraint, and power is the ultimate ceiling. APIs diverged from REST because the workload is fundamentally different: streaming, token-denominated pricing, and SLOs measured in TTFT and TPOT. MCP standardizes tool integration but introduces a brand-new attack surface.
The defining insight about LLM inference is that it is two completely different workloads wearing one trench coat.
Prefill — processing the prompt — is compute-bound. A single forward pass over thousands of tokens saturates the GPU’s matrix-multiply units. It’s embarrassingly parallel across tokens in the input.
Decode — generating output one token at a time — is memory-bandwidth-bound. Each token requires reading the entire model weights and the growing KV cache from HBM, doing comparatively little arithmetic per byte moved.
This asymmetry drives essentially every optimization in the stack: continuous batching, PagedAttention, speculative decoding, prefill/decode disaggregation, and the hardware roadmaps of every chip vendor. If you understand this split, the rest of the architecture follows naturally.
The foundational optimization of modern serving, introduced by the Orca paper (OSDI 2022) and popularized by vLLM, is iteration-level scheduling — better known as continuous batching.
Naive (“static”) batching processes a fixed batch start-to-finish, idling the GPU while the longest sequence in the batch finishes. Continuous batching instead admits and retires requests at every decoding iteration, keeping the GPU saturated on every forward pass. The Orca work demonstrated a 36.9x throughput improvement over FasterTransformer at equivalent latency, and reproductions on OPT-13B (A100 40GB) measured up to 23x throughput improvement versus static batching.
The reason batching is hard is the KV cache. Every autoregressive transformer stores, for each token it has seen, a key and value vector per attention layer. On each subsequent token it reads back this entire cache rather than recomputing it. The cache grows linearly with sequence length and batch size.
Before vLLM, serving systems pre-allocated contiguous memory chunks for each request’s maximum possible length. The PagedAttention authors found that existing systems waste 60–80% of memory due to fragmentation and over-reservation.
PagedAttention (UC Berkeley Sky Computing Lab, SOSP 2023) borrows the operating system’s virtual-memory playbook. The KV cache is divided into fixed-size blocks (16 tokens each), allocated non-contiguously on demand, and mapped through a per-sequence block table — exactly like page tables mapping virtual to physical memory.
The result: near-optimal memory usage with under 4% waste, delivering up to 24x higher throughput than HuggingFace Transformers and up to 3.5x higher than TGI by enabling far larger batch sizes. As a bonus, identical prefixes (a shared system prompt) can share physical KV blocks — the basis of prefix caching.
SGLang’s RadixAttention generalizes this into a radix tree of cached prefixes, enabling even more aggressive reuse across requests.
Because prefill is compute-bound and decode is memory-bandwidth-bound, running both on the same GPU is wasteful. A long prefill stalls the latency-sensitive decode of other requests, and the two phases want different parallelism strategies.
The DistServe work (UCSD Hao AI Lab) made the case for physically splitting them onto separate GPU pools that scale independently. Eighteen months later, essentially every production-grade serving framework — NVIDIA Dynamo, llm-d, SGLang, vLLM, Mooncake, LMCache — supports disaggregation.
The hard part is moving the KV cache from the prefill GPU to the decode GPU at wire speed. NIXL (NVIDIA’s Inference Transfer Library, open-sourced at GTC 2025) is the transport layer: a point-to-point library that abstracts RDMA/InfiniBand, NVLink, RoCE, TCP, NVMe-oF, and S3 behind one API, moving KV tensors GPU-to-GPU without round-tripping through the CPU.
Dynamo orchestrates on top with a KV-aware router (routing by cache-overlap score and load), a global prefill queue built on NATS, and a GPU Planner that auto-scales the pools. The transfer is non-blocking — the GPU keeps serving other requests while KV blocks move.
The payoff: ~3–5x cost reduction per token on chat workloads by putting decode on cheaper GPUs than prefill.
Speculative decoding attacks the memory-bandwidth wall of decode directly. A small “draft” model proposes K candidate tokens; the large “target” model verifies all K in a single parallel forward pass, accepting the longest matching prefix and resampling at the first divergence.
Because verification of K tokens costs roughly the same memory traffic as generating one, you get multiple tokens per target forward pass. The math:
Expected accepted tokens = (1 − α^(K+1)) / (1 − α), where α is the per-token acceptance rate.
In practice, α of 0.6–0.8 yields 2–3x speedups. Predictable domains like code hit 80%+ acceptance, while creative text can fall below 50% where overhead makes it a wash.
EAGLE and Medusa are the dominant modern variants, with EAGLE reusing the target model’s penultimate-layer features for higher acceptance. Critically, the algorithm is lossless — output is mathematically identical to running the target model alone.
A 1M-token request is where the KV cache stops being a sidekick and becomes the monster.
For a 70B model with grouped-query attention, a 128K-token context adds tens of GB of KV cache for a single request. At 1M tokens with full precision, the cache runs into hundreds of GB. The consequences:
Research quantifies this: moving from 4K to 50K context raises prefill latency from ~0.9s to ~14s.
The mitigations stack:
Research like KVQuant pushes toward 10M-token context on a single node by compressing the cache 8x+ with under 0.1 perplexity degradation.
The GB200 NVL72 packs 72 Blackwell GPUs and 36 Grace CPUs into one liquid-cooled rack:
| Spec | Value |
|---|---|
| FP4 Compute (with sparsity) | 1.44 exaFLOPS |
| HBM3e Memory | ~13.5 TB |
| NVLink Bisection Bandwidth | 130 TB/s |
| Per-GPU NVLink | 1.8 TB/s (~14x PCIe Gen5) |
| Power Draw | ~120 kW |
| Weight | ~1.36 metric tons |
| Internal Cabling | ~2 miles copper NVLink backplane |
The unified NVLink domain means a trillion-parameter model’s weights become cooperatively addressable across the rack at NVLink latency, making tensor parallelism cheap and interactive trillion-parameter inference viable.
The common thread across all providers: power is the binding constraint. A GB200 NVL72 rack at 120 kW exceeds what many data centers’ 60 kW racks can handle. Next-gen GPUs hit ~1,400W each. Liquid cooling has gone from exotic to mandatory.
This is why FP4, MoE, and caching aren’t just cost optimizations — they’re the only way to fit more intelligence under a fixed power budget.
Four parallelism dimensions compose to spread a model across a fleet:
| Dimension | What it splits | Communication cost | Where it runs |
|---|---|---|---|
| Tensor parallelism | Each layer’s matrices across GPUs | Heavy — needs NVLink | Within a rack |
| Pipeline parallelism | Layers into stages across nodes | Moderate | Between nodes |
| Expert parallelism | MoE experts across GPUs | Moderate | Cross-node OK |
| Data parallelism | Replicate for throughput | Low | Anywhere |
The interconnect determines what’s feasible. Tensor parallelism wants the 1.8 TB/s NVLink inside a rack; pipeline and expert parallelism can tolerate slower InfiniBand between nodes.
DeepSeek’s production setup is the clearest public example — EP32 (4 nodes) for prefill, EP144 (18 nodes) for decode, with redundant experts for load balancing.
If one model instance needs ~800GB (weights for a 400B+ model plus KV cache), how do you serve millions of concurrent users?
The answer is a stack of multiplicative tricks:
An 800GB model is sharded across a node (8× H100/H200 = 640GB–1.1TB) via tensor + pipeline + expert parallelism, then replicated hundreds or thousands of times. Fleet schedulers route requests across replicas with KV-cache-aware load balancing.
DeepSeek-V3/R1 has 671B total parameters but activates only 37B per token (8 of 256 experts per layer). You pay the memory cost of holding all experts in HBM, but the compute cost — and the decode bandwidth per token — is that of a 37B model.
Prefix caching means a shared system prompt is prefilled once and reused across thousands of requests. DeepSeek reported a 56.3% on-disk KV cache hit rate in production — more than half of all input tokens skipped prefill entirely.
DeepSeek’s February 2025 disclosure is the clearest public data on inference economics:
DeepSeek noted actual revenue is far lower (free tier, discounts), but the figure demonstrates that at high utilization with disaggregation, MoE, FP8, and caching, inference is structurally profitable at current market token prices.
Stanford HAI’s AI Index 2025 reports GPT-3.5-equivalent inference fell from $20/M tokens (Nov 2022) to $0.07/M tokens (Oct 2024) — a 280x decrease in two years. Dubbed “LLMflation,” this deflation curve is steeper than historical declines in PC compute or dotcom-era bandwidth.
The practical caveat: self-hosting beats per-request pricing only above ~40–60% sustained utilization. Below that, production duty cycles of 30–60% mean real cost/token runs 2–3x the spreadsheet.
Traditional web APIs are request/response, stateless, sub-second, and measured by p99 latency. LLM APIs break every one of those assumptions.
Because a full response takes seconds to minutes, providers stream tokens over Server-Sent Events (SSE) — a persistent HTTP connection with Content-Type: text/event-stream. No WebSocket upgrade needed. This is why TTFT (time-to-first-token) matters: users see output within the TTFT window even though full generation is far longer.
Instead of a single p99, LLM serving tracks:
MLCommons’ MLPerf 5.1 codifies human-perception thresholds: TTFT ≤ 500ms and TPOT ≤ 30ms (~33 tokens/sec, matching reading speed).
TTFT and TPOT are in fundamental tension: larger batches improve throughput but hurt TTFT via queueing.
Limits are TPM (tokens per minute), not QPS. Output tokens cost ~4–5x more than input because decode is the bottleneck:
| Provider | Input | Output | Ratio |
|---|---|---|---|
| Claude Sonnet 4.5 | $3/MTok | $15/MTok | 5x |
| GPT-4o | $2.50/MTok | $10/MTok | 4x |
| Gemini 2.5 Flash | $0.15/MTok | $0.60/MTok | 4x |
50% discounts apply for cached input and batch APIs across all three providers.
Prompt caching exposes the serving layer’s prefix-KV reuse as a cost lever:
| Provider | Mechanism | Read Discount | Write Cost | TTL |
|---|---|---|---|---|
| OpenAI | Automatic (≥1,024 tokens) | 50% | Free | 5–10 min (up to 24h) |
| Anthropic | Explicit cache_control breakpoints | ~90% | +25% | 5 min (1h at 2x) |
| Gemini | Implicit | 75% | Free | Varies |
The universal rule: static content (system prompt, tool defs, documents) must come first and be byte-for-byte identical. Any variation — a timestamp, a user ID — drops the hit rate to zero.
Production systems routinely hit 70–95% cache rates. One reported case cut TTFT from 4.3s to 0.6s on warm cache.
The Model Context Protocol (MCP), introduced by Anthropic in November 2024, standardizes how external tools and data reach a model — “a USB-C port for AI.”
A host-client-server model: an MCP host (Claude Desktop, VS Code, Cursor, ChatGPT) spins up one MCP client per connected server, each maintaining a dedicated, stateful session.
Communication is JSON-RPC 2.0 over two transports:
Servers expose three primitives: tools (model-callable actions), resources (context data), and prompts (templates).
The loop is the same across all providers and predates MCP:
tool_call with argumentsMCP standardizes the server side so a tool written once works across any compliant host.
By late 2025: OpenAI, Google DeepMind, and Anthropic all support MCP. The ecosystem reached 10,000+ active public servers and 97M+ monthly SDK downloads. Anthropic donated MCP to the Agentic AI Foundation under the Linux Foundation, co-founded with Block and OpenAI.
Tool definitions consume context-window tokens, so agents with hundreds of tools degrade reasoning. But the bigger issue is security:
The MCP spec’s mitigation is “SHOULD” — human-in-the-loop approval. Enterprise deployments increasingly route through a gateway with schema validation, OAuth 2.1 with PKCE, audit logging, and rate limiting.
| Signal | Action |
|---|---|
| GPU utilization below ~40% | Switch from self-hosting to per-request APIs |
| Cache hit rate below ~50% | Audit prompt structure before buying hardware |
| TTFT p99 exceeds SLO (average OK) | Add replicas or disaggregate — more batching will worsen it |
Fleet numbers are moving targets — many are vendor-announced rather than independently audited. Architecture details for frontier models are not officially confirmed (MoE is widely reported but not disclosed). DeepSeek’s 545% margin is explicitly theoretical. Pricing and cache TTLs change frequently — verify against current provider docs before architecting around them.
The 280x cost deflation, the disaggregation trend, and the power ceiling are structural forces. Whether you’re building on APIs or self-hosting, the decisions you make in the next 12 months will compound dramatically. The engineers who understand the serving stack — not just the models — will build the systems that survive the next repricing.