RAGEval: Evaluating Retrieval-Augmented Generation with LLMs

AI
RAG
Evaluation
LLMs
Search
Author

Atila Madai

Published

July 26, 2025

RAG System Evaluation Flow

A modular framework to assess trust, citation quality, and response relevance in RAG pipelines — powered by large language models.

Overview

This project explores a prototype evaluation framework for AI search engines using:

  • Retriever-Augmented Generation (RAG) for grounded, document-backed responses.
  • A Judge Module that uses large language models to score relevance, citation quality, and fluency of the generated output.

It is inspired by the architectures and product-level rigor seen at companies like Perplexity, Anthropic, and OpenAI — with a strong emphasis on system trustworthiness and response fidelity.

Components

1. RAG Pipeline

  • Supports BM25, Dense Retrieval, and LLM-powered retrievers:
    • LLM-powered retrievers use embeddings or query rewriting informed by models like OpenAI, Mistral, or local encoders.
  • Hybrid retrieval merges dense and sparse search for higher recall.
  • Retrieved context is injected into prompts passed to an LLM (OpenAI, Mistral, etc.).

2. Judge Module

  • Uses an LLM to evaluate responses for:
    • Relevance to the user query
    • Citation fidelity — whether the output is supported by retrieved passages
    • Fluency & factuality
  • Optionally compares multiple LLM outputs or agent chains.

Use Cases

  • Measure and improve trustworthiness of AI-generated responses.
  • Prototype evaluation loops for internal LLM deployment or research.
  • Train scoring models on LLM-labeled A/B data.

Project Structure

.
├── /data/               # Corpus or indexed docs
├── /retrievers/         # BM25, Dense, LLM-powered, Hybrid
├── /generators/         # LLM wrappers (OpenAI, Mistral, Claude)
├── /judges/             # LLM-based and rule-based evaluators
├── /evaluation/         # Metrics, logs, side-by-side tools
├── /dashboards/         # Optional UI (Streamlit, Gradio)
└── main.py              # CLI pipeline runner

Setup & Run

conda create -n rageval python=3.10
pip install -r requirements.txt

python main.py --query "Explain RAG in LLMs" --judge gpt-4

Sample Output

Query: Explain RAG in LLMs
LLM Answer: Retrieval-Augmented Generation combines external knowledge retrieval with generative reasoning...

Judge Output:
- Relevance: 9.5 / 10
- Citation Match: Yes
- Hallucination Risk: Low
- Language Fluency: Native / Clear

Evaluation Metrics

Metric Description
Relevance Score Does the output match query intent?
Citation Fidelity Are facts traceable to retrieved sources?
Fluency Score Is the language coherent and well-structured?
Latency (ms) Time to generate and score the response

Roadmap

Inspiration


Author: Atila Madai
GitHub | Blog