Evaluating RAG Systems — Beyond Vibes

Most RAG systems are evaluated by “vibes” — someone reads a few outputs and says “looks good.” This talk presents a systematic evaluation framework used in production.

Key Points

RAGAS metrics — faithfulness, answer relevancy, context precision, context recall
Human preference ranking — A/B testing with domain experts using blind evaluation
Regression testing — golden dataset of 200+ question-answer pairs, run on every deployment
Failure taxonomy — categorizing RAG failures (retrieval miss, context overflow, hallucination, wrong attribution)

The Framework

Build a golden dataset with verified answers and source citations
Run RAGAS metrics on every PR that touches the RAG pipeline
Flag regressions >2% on any metric for human review
Weekly manual evaluation of edge cases and new failure modes

Audience

AI Engineers building production RAG systems who want to move beyond “it seems to work” to measurable, tracked quality.