Production LLM Inference with vLLM

A production postmortem on migrating from naive HuggingFace pipeline() inference to vLLM — and the 3x latency improvement that came with it.

The Problem

Our Llama 3 8B model was serving at 800ms p95 with HuggingFace’s default inference. At 500 concurrent users, GPU utilization was only 40% — most time was spent in memory allocation and batch scheduling.

The Fix

vLLM’s PagedAttention — eliminated KV cache fragmentation, GPU memory utilization jumped to 90%+
Continuous batching — no more waiting for the slowest request in a batch
AWQ quantization — 4-bit quantized model with negligible quality loss, 2x throughput
Tensor parallelism — split model across 2x A10G for headroom

Results

Metric	Before	After
p95 Latency	800ms	250ms
Throughput	15 req/s	48 req/s
GPU Utilization	40%	92%
Cost/1K requests	$0.12	$0.04

Takeaway

vLLM is production-ready. The continuous batching alone is worth the migration. If you’re still using transformers.pipeline() for serving, you’re leaving 3x performance on the table.