Evaluating RAG Systems — Beyond Vibes

Talk

at AI Engineer Summit

Most RAG systems are evaluated by “vibes” — someone reads a few outputs and says “looks good.” This talk presents a systematic evaluation framework used in production.

Key Points

The Framework

  1. Build a golden dataset with verified answers and source citations
  2. Run RAGAS metrics on every PR that touches the RAG pipeline
  3. Flag regressions >2% on any metric for human review
  4. Weekly manual evaluation of edge cases and new failure modes

Audience

AI Engineers building production RAG systems who want to move beyond “it seems to work” to measurable, tracked quality.