AI

Serving the Machine: How LLM Inference Runs at Planetary Scale

Harshit Rathod
Flat vector illustration of GPU racks connected by glowing data paths, with tokens flowing through a pipeline of prefill, decode, and streaming stages

TL;DR: LLM inference is two completely different workloads in a trench coat — compute-bound prefill and memory-bandwidth-bound decode — and every optimization in the modern serving stack exists because of that split. Fleet operators run millions of accelerators, memory bandwidth is the binding constraint, and power is the ultimate ceiling. APIs diverged from REST because the workload is fundamentally different: streaming, token-denominated pricing, and SLOs measured in TTFT and TPOT. MCP standardizes tool integration but introduces a brand-new attack surface.


The Central Architectural Fact

The defining insight about LLM inference is that it is two completely different workloads wearing one trench coat.

Prefill — processing the prompt — is compute-bound. A single forward pass over thousands of tokens saturates the GPU’s matrix-multiply units. It’s embarrassingly parallel across tokens in the input.

Decode — generating output one token at a time — is memory-bandwidth-bound. Each token requires reading the entire model weights and the growing KV cache from HBM, doing comparatively little arithmetic per byte moved.

This asymmetry drives essentially every optimization in the stack: continuous batching, PagedAttention, speculative decoding, prefill/decode disaggregation, and the hardware roadmaps of every chip vendor. If you understand this split, the rest of the architecture follows naturally.


Continuous Batching and the KV Cache

The foundational optimization of modern serving, introduced by the Orca paper (OSDI 2022) and popularized by vLLM, is iteration-level scheduling — better known as continuous batching.

Naive (“static”) batching processes a fixed batch start-to-finish, idling the GPU while the longest sequence in the batch finishes. Continuous batching instead admits and retires requests at every decoding iteration, keeping the GPU saturated on every forward pass. The Orca work demonstrated a 36.9x throughput improvement over FasterTransformer at equivalent latency, and reproductions on OPT-13B (A100 40GB) measured up to 23x throughput improvement versus static batching.

The reason batching is hard is the KV cache. Every autoregressive transformer stores, for each token it has seen, a key and value vector per attention layer. On each subsequent token it reads back this entire cache rather than recomputing it. The cache grows linearly with sequence length and batch size.

Before vLLM, serving systems pre-allocated contiguous memory chunks for each request’s maximum possible length. The PagedAttention authors found that existing systems waste 60–80% of memory due to fragmentation and over-reservation.

PagedAttention: Virtual Memory for AI

PagedAttention (UC Berkeley Sky Computing Lab, SOSP 2023) borrows the operating system’s virtual-memory playbook. The KV cache is divided into fixed-size blocks (16 tokens each), allocated non-contiguously on demand, and mapped through a per-sequence block table — exactly like page tables mapping virtual to physical memory.

The result: near-optimal memory usage with under 4% waste, delivering up to 24x higher throughput than HuggingFace Transformers and up to 3.5x higher than TGI by enabling far larger batch sizes. As a bonus, identical prefixes (a shared system prompt) can share physical KV blocks — the basis of prefix caching.

SGLang’s RadixAttention generalizes this into a radix tree of cached prefixes, enabling even more aggressive reuse across requests.


Prefill/Decode Disaggregation

Because prefill is compute-bound and decode is memory-bandwidth-bound, running both on the same GPU is wasteful. A long prefill stalls the latency-sensitive decode of other requests, and the two phases want different parallelism strategies.

The DistServe work (UCSD Hao AI Lab) made the case for physically splitting them onto separate GPU pools that scale independently. Eighteen months later, essentially every production-grade serving framework — NVIDIA Dynamo, llm-d, SGLang, vLLM, Mooncake, LMCache — supports disaggregation.

The hard part is moving the KV cache from the prefill GPU to the decode GPU at wire speed. NIXL (NVIDIA’s Inference Transfer Library, open-sourced at GTC 2025) is the transport layer: a point-to-point library that abstracts RDMA/InfiniBand, NVLink, RoCE, TCP, NVMe-oF, and S3 behind one API, moving KV tensors GPU-to-GPU without round-tripping through the CPU.

Dynamo orchestrates on top with a KV-aware router (routing by cache-overlap score and load), a global prefill queue built on NATS, and a GPU Planner that auto-scales the pools. The transfer is non-blocking — the GPU keeps serving other requests while KV blocks move.

The payoff: ~3–5x cost reduction per token on chat workloads by putting decode on cheaper GPUs than prefill.


Speculative Decoding

Speculative decoding attacks the memory-bandwidth wall of decode directly. A small “draft” model proposes K candidate tokens; the large “target” model verifies all K in a single parallel forward pass, accepting the longest matching prefix and resampling at the first divergence.

Because verification of K tokens costs roughly the same memory traffic as generating one, you get multiple tokens per target forward pass. The math:

Expected accepted tokens = (1 − α^(K+1)) / (1 − α), where α is the per-token acceptance rate.

In practice, α of 0.6–0.8 yields 2–3x speedups. Predictable domains like code hit 80%+ acceptance, while creative text can fall below 50% where overhead makes it a wash.

EAGLE and Medusa are the dominant modern variants, with EAGLE reusing the target model’s penultimate-layer features for higher acceptance. Critically, the algorithm is lossless — output is mathematically identical to running the target model alone.


The 1M-Token Problem

A 1M-token request is where the KV cache stops being a sidekick and becomes the monster.

For a 70B model with grouped-query attention, a 128K-token context adds tens of GB of KV cache for a single request. At 1M tokens with full precision, the cache runs into hundreds of GB. The consequences:

  • Prefill latency grows quadratically (attention is O(N²))
  • Decode latency grows linearly — every token must re-read the whole cache
  • Concurrency collapses — each request hogs so much HBM that per-GPU concurrency drops from ~20 requests to 1

Research quantifies this: moving from 4K to 50K context raises prefill latency from ~0.9s to ~14s.

The mitigations stack:

  • Multi-head latent attention (like DeepSeek’s MLA) to shrink the cache
  • KV-cache quantization to FP8 or FP4/INT4
  • Prefix caching to skip prefill on shared context
  • Disaggregation to keep long prefills from blocking decode

Research like KVQuant pushes toward 10M-token context on a single node by compressing the cache 8x+ with under 0.1 perplexity degradation.


Fleet-Scale Hardware

The GB200 NVL72 — Building Block of Modern Scale-Up

The GB200 NVL72 packs 72 Blackwell GPUs and 36 Grace CPUs into one liquid-cooled rack:

SpecValue
FP4 Compute (with sparsity)1.44 exaFLOPS
HBM3e Memory~13.5 TB
NVLink Bisection Bandwidth130 TB/s
Per-GPU NVLink1.8 TB/s (~14x PCIe Gen5)
Power Draw~120 kW
Weight~1.36 metric tons
Internal Cabling~2 miles copper NVLink backplane

The unified NVLink domain means a trillion-parameter model’s weights become cooperatively addressable across the rack at NVLink latency, making tensor parallelism cheap and interactive trillion-parameter inference viable.

The Power Ceiling

The common thread across all providers: power is the binding constraint. A GB200 NVL72 rack at 120 kW exceeds what many data centers’ 60 kW racks can handle. Next-gen GPUs hit ~1,400W each. Liquid cooling has gone from exotic to mandatory.

This is why FP4, MoE, and caching aren’t just cost optimizations — they’re the only way to fit more intelligence under a fixed power budget.


Parallelism: Fitting Frontier Models Across Chips

Four parallelism dimensions compose to spread a model across a fleet:

DimensionWhat it splitsCommunication costWhere it runs
Tensor parallelismEach layer’s matrices across GPUsHeavy — needs NVLinkWithin a rack
Pipeline parallelismLayers into stages across nodesModerateBetween nodes
Expert parallelismMoE experts across GPUsModerateCross-node OK
Data parallelismReplicate for throughputLowAnywhere

The interconnect determines what’s feasible. Tensor parallelism wants the 1.8 TB/s NVLink inside a rack; pipeline and expert parallelism can tolerate slower InfiniBand between nodes.

DeepSeek’s production setup is the clearest public example — EP32 (4 nodes) for prefill, EP144 (18 nodes) for decode, with redundant experts for load balancing.


Scaling to Millions: The ~800GB Model Problem

If one model instance needs ~800GB (weights for a 400B+ model plus KV cache), how do you serve millions of concurrent users?

The answer is a stack of multiplicative tricks:

1. Sharding + Replicas

An 800GB model is sharded across a node (8× H100/H200 = 640GB–1.1TB) via tensor + pipeline + expert parallelism, then replicated hundreds or thousands of times. Fleet schedulers route requests across replicas with KV-cache-aware load balancing.

2. MoE Cuts Active Compute

DeepSeek-V3/R1 has 671B total parameters but activates only 37B per token (8 of 256 experts per layer). You pay the memory cost of holding all experts in HBM, but the compute cost — and the decode bandwidth per token — is that of a 37B model.

3. Quantization Halves or Quarters Everything

  • FP8 (native on Hopper/Blackwell): halves memory, doubles tensor-core throughput vs FP16, lifts decode TPS 1.5–1.8x
  • FP4/INT4: quarters weight memory with only 1–3% perplexity degradation
  • KV cache quantization: attacks the long-context bottleneck directly

4. Caching Reuses Computed Work

Prefix caching means a shared system prompt is prefilled once and reused across thousands of requests. DeepSeek reported a 56.3% on-disk KV cache hit rate in production — more than half of all input tokens skipped prefill entirely.

The Unit Economics

DeepSeek’s February 2025 disclosure is the clearest public data on inference economics:

  • Peak: 278 nodes (2,224 H800 GPUs)
  • Average: 226.75 nodes (~1,814 GPUs)
  • Throughput: ~73.7K input tokens/sec (prefill) or ~14.8K output tokens/sec (decode) per node
  • Volume: 776B tokens/day (608B input, 168B output)
  • Daily cost: $87,072 (at $2/GPU-hour)
  • Theoretical daily revenue: $562,027 at published pricing
  • Theoretical margin: 545%

DeepSeek noted actual revenue is far lower (free tier, discounts), but the figure demonstrates that at high utilization with disaggregation, MoE, FP8, and caching, inference is structurally profitable at current market token prices.

The 280x Deflation Curve

Stanford HAI’s AI Index 2025 reports GPT-3.5-equivalent inference fell from $20/M tokens (Nov 2022) to $0.07/M tokens (Oct 2024) — a 280x decrease in two years. Dubbed “LLMflation,” this deflation curve is steeper than historical declines in PC compute or dotcom-era bandwidth.

The practical caveat: self-hosting beats per-request pricing only above ~40–60% sustained utilization. Below that, production duty cycles of 30–60% mean real cost/token runs 2–3x the spreadsheet.


Why LLM APIs Look Nothing Like REST

Traditional web APIs are request/response, stateless, sub-second, and measured by p99 latency. LLM APIs break every one of those assumptions.

Streaming Is the Default

Because a full response takes seconds to minutes, providers stream tokens over Server-Sent Events (SSE) — a persistent HTTP connection with Content-Type: text/event-stream. No WebSocket upgrade needed. This is why TTFT (time-to-first-token) matters: users see output within the TTFT window even though full generation is far longer.

Different SLO Vocabulary

Instead of a single p99, LLM serving tracks:

  • TTFT — Time to first token (queueing + prefill)
  • TPOT/ITL — Time per output token / inter-token latency (decode speed)
  • Goodput — Fraction of requests meeting all SLOs simultaneously

MLCommons’ MLPerf 5.1 codifies human-perception thresholds: TTFT ≤ 500ms and TPOT ≤ 30ms (~33 tokens/sec, matching reading speed).

TTFT and TPOT are in fundamental tension: larger batches improve throughput but hurt TTFT via queueing.

Token-Denominated Pricing

Limits are TPM (tokens per minute), not QPS. Output tokens cost ~4–5x more than input because decode is the bottleneck:

ProviderInputOutputRatio
Claude Sonnet 4.5$3/MTok$15/MTok5x
GPT-4o$2.50/MTok$10/MTok4x
Gemini 2.5 Flash$0.15/MTok$0.60/MTok4x

50% discounts apply for cached input and batch APIs across all three providers.

Prompt Caching as a Billing Primitive

Prompt caching exposes the serving layer’s prefix-KV reuse as a cost lever:

ProviderMechanismRead DiscountWrite CostTTL
OpenAIAutomatic (≥1,024 tokens)50%Free5–10 min (up to 24h)
AnthropicExplicit cache_control breakpoints~90%+25%5 min (1h at 2x)
GeminiImplicit75%FreeVaries

The universal rule: static content (system prompt, tool defs, documents) must come first and be byte-for-byte identical. Any variation — a timestamp, a user ID — drops the hit rate to zero.

Production systems routinely hit 70–95% cache rates. One reported case cut TTFT from 4.3s to 0.6s on warm cache.


MCP: The Standard — and the New Attack Surface

The Model Context Protocol (MCP), introduced by Anthropic in November 2024, standardizes how external tools and data reach a model — “a USB-C port for AI.”

Architecture

A host-client-server model: an MCP host (Claude Desktop, VS Code, Cursor, ChatGPT) spins up one MCP client per connected server, each maintaining a dedicated, stateful session.

Communication is JSON-RPC 2.0 over two transports:

  • stdio — local subprocess, zero network overhead
  • Streamable HTTP — single endpoint with optional SSE upgrade for streaming (replaced the older two-endpoint HTTP+SSE transport in March 2025)

Servers expose three primitives: tools (model-callable actions), resources (context data), and prompts (templates).

The Function-Calling Loop

The loop is the same across all providers and predates MCP:

  1. App sends the model a prompt plus tool schemas
  2. Model returns a structured tool_call with arguments
  3. App executes the tool and feeds the result back
  4. Model reasons again, possibly calling more tools, until it produces a final answer

MCP standardizes the server side so a tool written once works across any compliant host.

Adoption

By late 2025: OpenAI, Google DeepMind, and Anthropic all support MCP. The ecosystem reached 10,000+ active public servers and 97M+ monthly SDK downloads. Anthropic donated MCP to the Agentic AI Foundation under the Linux Foundation, co-founded with Block and OpenAI.

The Security Problem

Tool definitions consume context-window tokens, so agents with hundreds of tools degrade reasoning. But the bigger issue is security:

  • Prompt injection is OWASP’s #1 LLM vulnerability for 2025
  • Tool poisoning (CVE-2025-54136 “MCPoison,” CVE-2025-54135 “CurXecute”) embeds malicious instructions in tool metadata that the model reads as instructions — a supply-chain attack on the agent’s context
  • A tool poisoned once affects every session that lists it

The MCP spec’s mitigation is “SHOULD” — human-in-the-loop approval. Enterprise deployments increasingly route through a gateway with schema validation, OAuth 2.1 with PKCE, audit logging, and rate limiting.


Practical Recommendations

Building an LLM Application

  • Turn on prompt caching today — highest-leverage cost and latency lever
  • Structure prompts with static content first, byte-for-byte identical
  • Stream over SSE; instrument TTFT and TPOT separately (not a single p99)
  • Budget by tokens; alert at 80% of TPM ceilings
  • Route easy traffic to cheap models — a 70/30 split between $0.10 and $3 models blends to ~$0.97/M

Self-Hosting Inference

  • Start with vLLM or SGLang for PagedAttention + continuous batching before anything exotic
  • Only adopt disaggregated serving once you’ve measured prefill-vs-decode interference
  • Quantize to FP8 first (near-free quality), then evaluate FP4 for long-context
  • Enable speculative decoding only after benchmarking acceptance rate on your workload
  • Treat ~40% utilization as the breakeven line for build-vs-buy

Designing Tool/Agent Systems

  • Adopt MCP for portability, but treat every tool definition as untrusted input
  • Keep a human in the loop for high-consequence actions
  • Run remote servers behind a gateway with schema validation and OAuth 2.1
  • Prune your tool set — hundreds of tools degrade reasoning and bloat context
  • Pin server versions to defend against tool-poisoning

Thresholds That Change the Calculus

SignalAction
GPU utilization below ~40%Switch from self-hosting to per-request APIs
Cache hit rate below ~50%Audit prompt structure before buying hardware
TTFT p99 exceeds SLO (average OK)Add replicas or disaggregate — more batching will worsen it

Caveats

Fleet numbers are moving targets — many are vendor-announced rather than independently audited. Architecture details for frontier models are not officially confirmed (MoE is widely reported but not disclosed). DeepSeek’s 545% margin is explicitly theoretical. Pricing and cache TTLs change frequently — verify against current provider docs before architecting around them.

The 280x cost deflation, the disaggregation trend, and the power ceiling are structural forces. Whether you’re building on APIs or self-hosting, the decisions you make in the next 12 months will compound dramatically. The engineers who understand the serving stack — not just the models — will build the systems that survive the next repricing.

Related Posts