Most RAG systems are evaluated by “vibes” — someone reads a few outputs and says “looks good.” This talk presents a systematic evaluation framework used in production.
Key Points
- RAGAS metrics — faithfulness, answer relevancy, context precision, context recall
- Human preference ranking — A/B testing with domain experts using blind evaluation
- Regression testing — golden dataset of 200+ question-answer pairs, run on every deployment
- Failure taxonomy — categorizing RAG failures (retrieval miss, context overflow, hallucination, wrong attribution)
The Framework
- Build a golden dataset with verified answers and source citations
- Run RAGAS metrics on every PR that touches the RAG pipeline
- Flag regressions >2% on any metric for human review
- Weekly manual evaluation of edge cases and new failure modes
Audience
AI Engineers building production RAG systems who want to move beyond “it seems to work” to measurable, tracked quality.