Your RAG System Works Fine Until It Doesn’t
You’ve built it. Five minutes of tinkering with LangChain or LlamaIndex, a vector database that cost three dollars to spin up, and boom — your chatbot is answering questions about your documentation. It’s perfect. Your local tests look great. You show your boss. Everyone’s impressed.
Then you deploy it to actual users.
Suddenly the system’s confidence is terrifying. It confidently hallucinates when it doesn’t know something. It retrieves technically accurate context but answers the wrong question. It floods the LLM with 47 irrelevant chunks when it only needed two. Your users start losing trust. You’re scrambling to figure out why, armed only with vibes and the vague sense that something’s broken.
This is where most RAG systems fail — not in the concept, but in the evaluation. Demos reward confidence and plausibility. Production punishes them both.
If you’re shipping RAG to anyone — your users, your company, or even just yourself in production — you need metrics that measure what actually matters: Does the answer stay grounded in retrieved context? Did retrieval find the right documents? Does the answer actually answer the question? And did retrieval miss anything critical?
This is where Ragas comes in.
The Problem with “Does It Look Good?”
Here’s the thing about RAG systems: they fail in subtle, invisible ways. A traditional ML model trains on labeled data and validates against a held-out test set — clear feedback, reproducible metrics. RAG systems? They’re orchestrations of three separate pipelines: retrieval, ranking (implicit in the LLM context window), and generation.
Each pipe can quietly degrade, and you won’t notice until a user points it out.
Faithfulness failures: The LLM generates an answer that sounds plausible but has nothing to do with the retrieved context. The LLM hallucinated because the context was sparse, or it misunderstood, or it just felt like making something up.
Relevancy failures: The system answers a different question than what was asked. You ask “How do I set up WireGuard on Ubuntu?” and it explains how to configure the WireGuard protocol in general. Technically correct, not actually useful.
Retrieval precision failures: The system retrieves 50 chunks when 5 would’ve done it. This wastes tokens, slows inference, and dilutes the signal with noise. The LLM has to hunt through irrelevance to find the useful bit.
Retrieval recall failures: The system misses critical context. You ask about a specific feature, and the retrieval step brings back chunks about the feature’s cousin. You miss the exact answer by one semantic hop.
These aren’t dramatic failures. They’re the slow death of user trust.
The problem is evaluating them. You can’t just eyeball ten examples and call it good. You need metrics, and you need them to run automatically against your entire system.
Enter Ragas.
What Is Ragas?
Ragas (Retrieval-Augmented Generation Assessment) is an open-source evaluation framework that measures RAG system quality using four core metrics — all computed with an LLM-as-judge approach. No hand-labeled test sets required (though you can use them). No human raters. Just metrics that correlate with real user experience.
It’s simple to integrate: add a few lines of Python, point it at your RAG pipeline, and get back quantitative scores.
The framework is language-agnostic (works with any LLM), framework-agnostic (LangChain, LlamaIndex, or your custom pipeline), and designed for home lab budgets — Ragas uses cheaper judge models (like Gemini Flash or your local Llama) so eval loops don’t cost more than your actual system.
Let’s talk about the four metrics that matter.
The Four Ragas Metrics (And Why They Matter)
1. Faithfulness: Did the Answer Stay Grounded?
Faithfulness measures whether the generated answer is supported by the retrieved context. The metric answers: “Did the LLM invent things?”
How it works: Ragas feeds the answer and context to a judge LLM with a prompt like: “Given this context, is this answer grounded in what you see, or did the model make things up?” The judge breaks the answer into atomic claims and checks each one against the context.
Scores range from 0 to 1. A score of 0.95 means 95% of the answer’s claims are backed by the retrieved context.
Why it matters: This is where hallucination lives. An LLM with a high faithfulness score won’t confidently lie to you. It’ll either answer based on context or admit it doesn’t know.
Real example: You ask about Docker volume drivers, and retrieval brings back chunks about volume persistence. If the answer says “Docker volumes use the copy-on-write strategy” (true in general, not in the context), faithfulness catches it as an unsupported claim.
2. Answer Relevancy: Did It Answer the Right Question?
Answer relevancy measures whether the generated answer is actually responsive to the user’s query. It’s the difference between a technically correct answer and a useful one.
How it works: The judge LLM is asked: “Does this answer actually address the question asked, or is it off-topic?” It’s simpler than faithfulness — just a yes/no with a confidence score.
Scores range from 0 to 1. A score of 0.88 means the system is pretty sure it answered what was asked.
Why it matters: Your RAG system could retrieve perfect context and faithfully answer a completely different question. This metric catches that drift.
Real example: You ask “How do I debug a WireGuard connection?” and the system explains how to install WireGuard. Both answers are about WireGuard, but one doesn’t answer the question.
3. Context Precision: Was the Retrieved Context Useful?
Context precision measures how much of the retrieved context was actually relevant to answering the question. It’s the signal-to-noise ratio of retrieval.
How it works: The judge LLM scores each retrieved chunk: “Is this chunk helpful for answering the question?” Then Ragas computes the ratio of useful chunks to total chunks. If you retrieved 10 chunks and 7 were useful, your precision is 0.7.
Scores range from 0 to 1. Higher is better, but perfect isn’t always realistic.
Why it matters: Bad retrieval floods your token budget with noise. If your system retrieves 100 chunks when 10 would suffice, context precision catches the bloat. You can then adjust your retrieval strategy (shorter chunks, better BM25 + semantic reranking, etc.).
Real example: You ask “What’s the difference between Docker volumes and bind mounts?” Retrieval brings back 20 chunks — 8 about volumes, 5 about bind mounts (both useful), and 7 about Docker networking (noise). Precision is 0.65.
4. Context Recall: Did Retrieval Miss Anything?
Context recall measures whether all relevant context was retrieved. It answers: “Did the system miss critical information?”
How it works: This one’s trickier. Ragas uses the LLM to generate what context should exist to fully answer the question, then checks whether that context was actually retrieved. If the expected context is missing, recall drops.
Scores range from 0 to 1. A score of 0.9 means 90% of expected relevant context was retrieved.
Why it matters: Perfect precision with terrible recall means you retrieved the right things, but not enough things. The user gets a partial answer.
Real example: You ask “Explain Docker’s networking modes,” and the system only retrieves chunks about bridge mode (3 out of 4 modes covered). Recall is around 0.75.
Building the Evaluation Loop
Now let’s wire this up. You’ll need:
- A RAG pipeline (your question, retrieval, and generation steps)
- Ragas installed
- A test dataset (Ragas can generate one synthetically)
- A judge LLM (local or API)
- An eval script that runs the metrics and reports results
Step 1: Install and Import
pip install ragas langchain-community langchain-openaiIf you’re using Ollama or local models:
pip install ragas ollamaStep 2: Create Your Test Dataset
Ragas has a synthetic dataset generator that creates Q&A pairs from your actual documents. This is the magic bullet — no manual labeling needed.
from ragas.testset_generator import TestsetGeneratorfrom ragas.embeddings import LangchainEmbeddingsWrapperfrom langchain_openai import OpenAIEmbeddingsfrom langchain_community.document_loaders import DirectoryLoader
# Load your documentsloader = DirectoryLoader("./docs", glob="*.md")documents = loader.load()
# Initialize the generator (uses an LLM to create questions)generator = TestsetGenerator.with_openai()
# Generate a test settestset = generator.generate_with_langchain_docs( documents=documents, test_size=50, # 50 Q&A pairs distributions={ "simple": 0.5, # straightforward questions "multi_context": 0.3, # questions needing multiple chunks "reasoning": 0.2 # questions requiring reasoning })
# Save the test settestset.to_json("testset.json")This generates 50 synthetic questions from your documents — no manual effort. Ragas creates diverse questions: simple lookups, multi-context questions (needing multiple chunks), and reasoning questions.
Step 3: Evaluate Your RAG Pipeline
Now the actual eval. You point Ragas at your pipeline and let it loose:
from ragas import evaluatefrom ragas.metrics import ( faithfulness, answer_relevancy, context_precision, context_recall)from ragas.run_config import RunConfigfrom datasets import Datasetimport json
# Load your test setwith open("testset.json") as f: test_data = json.load(f)
# Convert to Ragas Dataset formatdataset = Dataset.from_dict({ "question": [q["user_input"] for q in test_data["test_cases"]], "answer": [q["reference_answer"] for q in test_data["test_cases"]], "contexts": [q["contexts"] for q in test_data["test_cases"]],})
# Define your RAG pipeline (LangChain example)from langchain_community.vectorstores import Chromafrom langchain_openai import ChatOpenAI, OpenAIEmbeddingsfrom langchain.prompts import ChatPromptTemplatefrom langchain.schema.runnable import RunnablePassthrough
vector_store = Chroma( collection_name="docs", embedding_function=OpenAIEmbeddings())retriever = vector_store.as_retriever(search_kwargs={"k": 5})
template = """Answer based on this context only:{context}
Question: {question}Answer:"""
prompt = ChatPromptTemplate.from_template(template)llm = ChatOpenAI(model="gpt-4-turbo", temperature=0)
rag_chain = ( {"context": retriever, "question": RunnablePassthrough()} | prompt | llm)
# Run evaluationresults = evaluate( dataset, metrics=[ faithfulness, answer_relevancy, context_precision, context_recall ], llm=ChatOpenAI(model="gpt-3.5-turbo"), # judge model embeddings=OpenAIEmbeddings(), run_config=RunConfig(timeout=60),)
# Print resultsprint(results)print(f"\nAverage Faithfulness: {results['faithfulness'].mean():.3f}")print(f"Average Answer Relevancy: {results['answer_relevancy'].mean():.3f}")print(f"Average Context Precision: {results['context_precision'].mean():.3f}")print(f"Average Context Recall: {results['context_recall'].mean():.3f}")This runs each metric against your RAG pipeline, using the cheaper gpt-3.5-turbo as the judge model. You’ll get back averages and per-question scores.
Using Ollama instead of OpenAI? Swap the LLM initialization:
from langchain_community.llms import Ollama
judge_llm = Ollama(model="llama2") # or mistral, neural-chat, etc.Reading the Results
A typical evaluation output looks like:
Average Faithfulness: 0.87Average Answer Relevancy: 0.92Average Context Precision: 0.68Average Context Recall: 0.81Here’s what you do with that:
Faithfulness 0.87 — Pretty good. Means 87% of claims in answers are grounded. If this drops below 0.8, your LLM is hallucinating too much. Fix: use a more honest model, add constraints (like “answer only from context”), or improve retrieval quality.
Answer Relevancy 0.92 — Solid. The system is answering what’s asked. If this drops below 0.85, your prompt might be unclear, or the retriever is bringing back off-topic stuff.
Context Precision 0.68 — Here’s the problem. You’re retrieving too much noise. Means about 1 in 3 chunks is irrelevant. Fix: reduce k (number of retrieved chunks), use a reranker, or improve your embedding quality.
Context Recall 0.81 — You’re missing some relevant context. Fix: increase k, improve the embedding model, or add keyword-based retrieval (BM25) alongside semantic search.
Common Gotchas and Workarounds
Judge Model Bias
The judge LLM itself has opinions. If you use GPT-4 as the judge, it’ll be more lenient with GPT-4 answers but harsher on Llama outputs. Use a consistent judge model across all evals, or use multiple judges and average their scores.
Gotcha: Running evals with different judge models will give wildly different scores. Pick one and stick with it.
Expensive Eval Runs
Evaluating 100+ questions with a $0.01/1K token judge model isn’t cheap. A single eval run might cost $5–10. If you’re iterating rapidly, this adds up.
Workaround: Start with a small test set (20 questions), refine your pipeline, then scale to 100+. Use cheaper models for iteration (gpt-3.5-turbo, local mistral), then validate with a better judge before shipping.
Synthetic Test Sets Aren’t Perfect
Ragas generates questions from your documents, but the generator has its own biases. It might miss edge cases or generate questions that are trivial for your domain.
Workaround: Start with synthetic, then mix in 10-20% hand-written questions from actual users or domain experts. This catches what the generator missed.
Low Recall, High Precision
You’re retrieving the perfect chunks, but not enough of them. Precision is 0.95, but recall is 0.6.
Fix: Increase k (number of chunks) in your retriever. Go from 5 to 10 and re-eval. If recall improves without hurting precision, you’ve found your sweet spot.
High Recall, Low Precision
You’re retrieving everything, including noise. Recall is 0.95, but precision is 0.4.
Fix: Add a reranker (like Cohere’s reranking API or a local cross-encoder). Retrieve 20 chunks, rerank to top 5. This kills noise while keeping the good stuff.
Beyond Ragas: What It Doesn’t Measure
Ragas is powerful, but it’s not the whole story. Here’s what it doesn’t catch:
Latency: Ragas measures quality, not speed. Your system could be perfectly faithful but take 30 seconds to answer. Measure retrieval latency and generation latency separately.
Cost: Ragas doesn’t know about token costs. You might have perfect metrics using GPT-4 but blow your budget.
User satisfaction: Metrics correlate with real satisfaction, but they’re not identical. A 0.9 faithfulness score doesn’t guarantee users will trust your system. Run a side-by-side user test to validate that your metrics actually predict behavior.
Domain-specific quality: Faithfulness and relevancy are general. But in specialized domains (medicine, law), you might need custom metrics. A legally accurate answer isn’t faithfully answering if it doesn’t cite the specific statute the user asked about. Use DeepEval or build custom rubrics on top of Ragas.
The Honest Truth About RAG Evaluation
Here’s what most RAG projects get wrong: they ship without baseline metrics. They tweak the system based on feelings. Then something breaks in production, and they have no way to know whether the fix actually helped.
Ragas solves this. It gives you a dashboard. Before you change retrieval strategy, you have a baseline. After you add a reranker, you know if precision actually improved. You can A/B test configurations (10 chunks vs. 5, gpt-3.5 vs. gpt-4, BM25 reranking yes/no) and see which combination wins.
The metrics aren’t perfect. They’re biased by the judge model. They don’t measure user trust directly. But they’re infinitely better than “it feels right.”
Ship a RAG system with a Ragas eval loop, and you’ve got the feedback you need to keep improving it. Skip it, and you’re flying blind.
Your users will notice the difference.
Next Steps
- Install Ragas:
pip install ragas - Load your documents: Use DirectoryLoader or your own pipeline
- Generate a test set: Let the synthetic generator do the work
- Run baseline evals: Measure your current system
- Iterate: Tweak retrieval, test precision/recall, measure the impact
- Monitor in production: Keep running evals on real user questions
The whole loop from docs to eval results is maybe 50 lines of code. No magic. Just the discipline to measure before you ship.