Production LLM Inference with vLLM

Writing

A production postmortem on migrating from naive HuggingFace pipeline() inference to vLLM — and the 3x latency improvement that came with it.

The Problem

Our Llama 3 8B model was serving at 800ms p95 with HuggingFace’s default inference. At 500 concurrent users, GPU utilization was only 40% — most time was spent in memory allocation and batch scheduling.

The Fix

Results

MetricBeforeAfter
p95 Latency800ms250ms
Throughput15 req/s48 req/s
GPU Utilization40%92%
Cost/1K requests$0.12$0.04

Takeaway

vLLM is production-ready. The continuous batching alone is worth the migration. If you’re still using transformers.pipeline() for serving, you’re leaving 3x performance on the table.