Built a comprehensive LLM observability platform on top of Weights & Biases to track production model health.
What We Track
- Token economics — cost per request, budget burn rate, model routing efficiency
- Quality metrics — hallucination rate, RAGAS scores, user feedback signals
- Performance — latency p50/p95/p99, throughput, GPU utilization
- Drift detection — embedding drift on input distributions, output length anomalies
Architecture
- W&B Tables for structured experiment comparison
- Custom W&B Weave evaluations for automated quality checks
- Slack alerts on metric threshold breaches
- Weekly auto-generated reports for stakeholder visibility
Impact
Caught a 15% accuracy regression within 2 hours of a model update — previously would have gone unnoticed for days.