← ALL WRITING
02 MAY 2026 · 16 MIN READ
LLMsInferenceGPUvLLMInfrastructure

Deploy Your Own Models to GPUs

A 2026 operator playbook: when self-hosting wins, which inference engine to pick, the GPU buying matrix, quantization that actually works, and the four metrics you autoscale on.

NIRBHAY AGRAWAL02 MAY 2026

The first question to settle, and the one most slide decks skip, is whether you should be self-hosting at all. The math is brutal at both ends. A single H100 SXM5 at $2.40 an hour, running Llama-3.3-70B at FP8 with vLLM under steady load, lands at about $1.67 per million output tokens. GPT-4o-class API usage is somewhere between $3.75 and $10 per million output tokens depending on caching and batch size. So self-hosting is roughly two to three times cheaper at sustained load, and roughly infinitely more expensive when your H100 sits at five percent utilisation because traffic is spiky. The line between self-host and API isn’t cost. It’s throughput consistency.

Spiky traffic should never self-host. A predictable internal pipeline at enterprise volume usually should. Most of what follows is a working operator’s playbook for the second case, plus the math you need to know before you commit.

Should you self-host at all?

Three gating questions, in order. Is your throughput consistent? If a week of token-per-second telemetry looks like a square wave, you can self-host. If it looks like an EKG, don’t. Do you need a fine-tune, an exotic model, or data sovereignty? If yes, self-host, and skip ahead. If you’re shipping a frontier model someone else trained, the API is almost always the right answer. Do you have the engineering budget? A production vLLM cluster eats roughly half an SRE’s calendar, which is a hundred thousand dollars a year of someone’s time that you’re not paying an API to deal with.

The clean rule I use is that self-host wins when annualised API spend exceeds GPU spend plus 0.5 SRE FTE. In 2026 numbers that’s usually somewhere between five and fifteen thousand dollars a month of API spend. Below that, you’re doing this for fun. Above it, you’re burning money on egress.

Pick your inference engine

In 2026, ninety percent of teams self-hosting LLMs should be running vLLM. The other ten percent know exactly why they aren’t. The V1 engine, default since v0.8.0 in January 2025, is a ground-up rewrite that hits 85 to 92 percent GPU utilisation under high concurrency. The seminal post on continuous batching from Anyscale in 2023 already showed up to 23x throughput over naive static batching on OPT-13B. That number is now unremarkable. It’s the floor every modern serving stack assumes.

The cases where I’d reach for something other than vLLM are specific. SGLang is worth it if your workload has prefix sharing, which is most RAG and multi-turn agents. Independent benchmarks put SGLang at +29% on H100 at 8B with unique prompts, and up to 6.4x throughput on cache-heavy workloads versus engines without RadixAttention. The gap shrinks at 70B and up on unique prompts to three or five percent, so the win is highly workload-shaped. TensorRT-LLM is the right answer if you’ve already hit vLLM’s ceiling and have a CUDA engineer on the team. NVIDIA reports 12,000 tokens per second on Llama2-13B and 1.9x H100 on Llama-2-70B with FP8, but the operational complexity is real and the on-call tax is unforgiving. Triton Inference Server is fine when you have non-LLM models alongside (vision, ASR, embeddings) but a poor choice for an LLM-only fleet. Ollama and llama.cpp with GGUF are laptop and dev tools, not production answers. And if you’re still on TGI in 2026, you started in 2023 and forgot to look up.

Pick your GPU

The relevant 2026 grid is shorter than people pretend. The 4090 and L4 are for small models up to about 8B and dev workloads; twenty-four gigabytes of VRAM is the cliff. The L40S at forty-eight gigabytes (no NVLink) is good for 13B to 34B at one to two dollars an hour boutique, mediocre for 70B. The A100 80GB still pays its way for batched inference if you can find one cheap, but it’s missing FP8 and feels older every quarter. The H100 80GB is the production default. Native FP8 (1,979 TFLOPS), HBM3, NVLink-4. AWS p5 settled around $3.90 a GPU-hour after the June 2025 cut of about 44 percent; the boutique floor on RunPod, Lambda, and Vast.ai sits at $1.49 to $2.99. The H200 with 141GB of HBM3e is roughly 1.9x H100 on Llama-2-70B under FP8, and it’s worth the premium when single-card 70B fits matter to you. The B200 (Blackwell) is the new ceiling: 180GB HBM3e, 8 TB/s, native FP4, NVLink-5 at 1.8 TB/s. NVIDIA reports 3x DGX H200 throughput on DeepSeek-R1, Llama-3.1-405B, and Llama-3.3-70B. Only buy capacity here if you can keep a four-to-six-dollar GPU saturated. Most teams can’t.

NVLink starts to matter at tensor-parallel of four or more. Below that, PCIe is fine, and the obsession with topology is mostly hobbyist. Spot instances are tempting on paper. In practice the eviction rate at peak hours is what kills the model-load economics, and your warm-pool decisions matter more than the per-hour delta. I’ve been bitten by spot enough times to mostly stop using it for serving.

Quantization and the quality/cost curve

FP8 on Hopper is a free lunch. Jin et al.’s comprehensive evaluation of quantized instruction-tuned LLMs puts FP8 within 0.4 points of MMLU-Pro of FP16 across 70B-class models, with about fifteen percent throughput uplift and half the KV-cache VRAM. vLLM’s own measurements show FP8 KV-cache delivering +14.9% output throughput and –14.8% median ITL on Llama-3.1-8B versus BF16, with the decode break-even point dropping to about 7K tokens. If you’re running Hopper and you’re not using FP8 KV-cache, you’re paying for GPU you don’t need.

4-bit is more dangerous than the marketing suggests. The same paper shows AWQ-4 within 1.6 points and GPTQ-4 within 1.9 points on average, but Llama-3.3-70B took a 7.8-point MMLU drop at 4-bit. Modern frontier models are more fragile to aggressive quantization than the 2023 generation, not less. The honest 2026 rule of thumb: at 8B and below, AWQ-4 or GPTQ-4 are fine, and 4-bit Llama-3.1-8B sits within ±1 point of FP16. At 30B to 70B, FP8 if you’re on Hopper, NVFP4 if you’re on Blackwell, and avoid AWQ-4 unless you’ve evaluated on your actual task. If you’re long-context decode-bound, FP8 KV-cache is non-negotiable; the ITL improvement is the difference between a usable chat UI and one your users complain about. GGUF and llama.cpp are great when you’re mixing CPU and GPU offload, and a poor production serving format on their own.

Deploy: the serving stack

Three patterns cover most of what teams ship. For spiky traffic, agents, or anything with bursty user-driven load, serverless providers like Modal or Replicate are the right answer. Modal’s GPU memory snapshots cut vLLM cold start from forty-five seconds to five for Qwen2.5-0.5B, and similar reductions for ViT (8.5s to 2.25s) and Parakeet ASR (20s to 2s). Cold-start economics are basically the only thing that makes serverless viable for LLMs. Before snapshot APIs, every cold container re-paid the model-load tax from scratch.

For steady production, the mainstream stack is Kubernetes plus vLLM as a Deployment, plus KEDA for autoscaling on a custom metric (queue depth or KV-cache occupancy, not request rate, more on that below). BentoML’s post on 25x faster cold starts on Kubernetes covers the image-streaming and parallel-weight-loading work that turns a multi-minute pod start into something tolerable. If you’re running multi-model pipelines (ASR to LLM to TTS, or embed to rerank to generate), Ray Serve and BentoML earn their keep, because the optimisation target is end-to-end latency across heterogeneous models, not per-model throughput.

The vLLM flags that show up in every production config I’ve written:

vllm serve meta-llama/Llama-3.3-70B-Instruct \
  --tensor-parallel-size 4 \
  --kv-cache-dtype fp8 \
  --enable-prefix-caching \
  --max-num-seqs 256 \
  --gpu-memory-utilization 0.92 \
  --speculative-model meta-llama/Llama-3.2-1B-Instruct \
  --num-speculative-tokens 5 \
  --enable-chunked-prefill

The two flags that pay off most often are --enable-prefix-caching and --kv-cache-dtype fp8. Speculative decoding with a small draft model can roughly double throughput at the cost of implementation complexity, so I usually leave it off until everything else is stable.

Operate: monitoring and scaling

Four metrics actually matter. TTFT (time to first token) is your prefill latency, page on P95 above three seconds for chat. ITL (inter-token latency, sometimes called TPOT) is decode latency, page on P95 above fifty milliseconds for chat or a hundred for batch. KV-cache occupancy sustained above ninety percent means you’re queueing. Queue time itself is the leading indicator of saturation, and it’s the metric I trust most.

vLLM exposes all of these at /metrics in Prometheus format with names like vllm:time_to_first_token_seconds, vllm:time_per_output_token_seconds, vllm:e2e_request_latency_seconds, vllm:num_requests_running, vllm:num_requests_waiting, and vllm:gpu_cache_usage_perc. Glukhov’s Prometheus and Grafana writeup is the most practical guide I’ve found for wiring it all together if you don’t want to invent it from scratch.

Autoscale on KV-cache occupancy or queue time, not request rate. Request-rate autoscaling is a 2019 web-services reflex that breaks the moment one user sends a 100K-token prompt, which they will, eventually, on a Tuesday afternoon, while you’re in standup. The same request rate that was fine yesterday is a paged GPU today, because the traffic that landed is twenty times longer per call. I’ve had this conversation with three different teams in the last year. Don’t be the fourth.

Cost reality

The headline numbers under realistic forty-to-sixty percent utilisation, not nameplate, look roughly like this. Llama-3.3-70B in FP8 on an H100 SXM5 at $2.40 an hour gets you about 400 tokens per second sustained, which works out to $1.67 per million output tokens. Llama-3.1-8B in FP8 on a single L40S at $1.50 an hour gets you about 6,000 tokens per second, or $0.07 per million. Llama-3.1-405B in FP4 on B200 nodes lands around twenty to fifty cents per million at frontier-model quality, if you can stomach Blackwell’s operational complexity. GPT-4o by comparison is around $2.50/M input, $10/M output, blended effective rate $3.75 to $4.38 with caching.

The undiscussed cost is engineer-hours. A self-hosted vLLM cluster consumes roughly half an SRE FTE, paid in postmortems and capacity planning. That’s real money. A hundred thousand dollars a year that doesn’t show up in your AWS invoice but does show up on the recruiter’s spreadsheet. Self-host wins when annualised API spend > GPU spend + 0.5 SRE FTE. Below the line you’re paying yourself to feel sovereign. Above it, you’re saving real money and getting real control. Pick the right side, and the rest of this playbook pays for itself in a quarter.