Skip to content
Go back

RAGAS: Evaluating RAG Without Vibes

By SumGuy 9 min read
RAGAS: Evaluating RAG Without Vibes

Your RAG Is Flying Blind

You built a RAG pipeline. You threw some docs at a vector store, wired up LangChain or LlamaIndex, and asked it a few questions. It answered correctly — at least the five questions you happened to test. You shipped it.

Three weeks later someone files a bug: “the bot confidently told me something that wasn’t in any of our docs.” Classic. You’ve been doing RAG evaluation by vibes, and vibes don’t catch regressions.

RAGAS (Retrieval Augmented Generation Assessment) is an open-source eval framework that replaces gut-feel testing with reproducible, comparable metrics. Run it locally against Ollama, gate your CI pipeline on quality scores, and stop shipping retrieval failures to production.

This article assumes you’ve already got a RAG pipeline. If not, check out LangChain vs LlamaIndex first, or the RAG on a budget guide if you’re doing this on commodity hardware.


What RAGAS Actually Measures

RAGAS evaluates four dimensions. Each one catches a different failure mode.

Faithfulness — Does the answer contain only claims that are supported by the retrieved context? A hallucination detector. Score of 1.0 means every statement in the answer is traceable to a retrieved chunk. Score of 0.4 means your LLM is making stuff up between the lines.

Answer Relevance — Is the answer actually addressing the question asked? Catches the “technically correct but useless” failure: the model regurgitates tangentially related content instead of answering directly.

Context Precision — Of the chunks you retrieved, how many were actually useful for answering the question? A low score means your retriever is pulling noise. You’re paying for tokens you don’t need and potentially confusing the generator.

Context Recall — Did you retrieve all the chunks you actually needed to answer correctly? This requires a ground-truth answer to compare against. Low recall means relevant docs exist in your store but your retriever is missing them.

Context Entity Recall — A stricter version of recall that checks whether named entities in the ground truth show up in your retrieved context. Great for knowledge-intensive domains where names, dates, and specific terms matter.

Noise Sensitivity — How badly does irrelevant context hurt answer quality? Inject some garbage into the retrieved chunks and measure the score drop.


The Dataset Format

RAGAS needs a specific input shape. Four fields, and you can’t skip them:

dataset_structure.py
from datasets import Dataset
# Minimum viable RAGAS dataset
data = {
"question": [
"What is the default timeout for requests in our API?",
"How do I enable debug logging?",
],
"answer": [
"The default timeout is 30 seconds.",
"Set LOG_LEVEL=DEBUG in your environment variables.",
],
"contexts": [
# List of retrieved chunks per question — must be a list of strings
["API configuration: timeout defaults to 30s. ...", "Request handling docs..."],
["Logging config: LOG_LEVEL controls verbosity. DEBUG enables all output."],
],
"ground_truth": [
# What the correct answer should be — used for recall metrics
"The default request timeout is 30 seconds.",
"Set the LOG_LEVEL environment variable to DEBUG.",
],
}
dataset = Dataset.from_dict(data)

The contexts field is a list of lists — one list of chunks per question. Whatever your retriever returned for that question, dump it here raw. Don’t summarize or concatenate; RAGAS needs the individual chunks.

ground_truth is only required for context recall and a few other metrics. If you’re running faithfulness and answer relevance only, you can omit it. But honestly, writing ground truth answers upfront is the best investment you’ll make in your RAG project.


Running RAGAS Locally with Ollama

You don’t need OpenAI for any of this. RAGAS uses an LLM as the judge, and that judge can be a local model.

Terminal window
pip install ragas langchain-community ollama
eval_local.py
from ragas import evaluate
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_precision,
context_recall,
)
from langchain_community.chat_models import ChatOllama
from langchain_community.embeddings import OllamaEmbeddings
from datasets import Dataset
# Use a capable local model as the judge
judge_llm = ChatOllama(model="qwen2.5:14b", temperature=0)
embeddings = OllamaEmbeddings(model="nomic-embed-text")
# Your dataset (built from your pipeline's outputs)
data = {
"question": ["What is the retention policy for audit logs?"],
"answer": ["Audit logs are retained for 90 days."],
"contexts": [["Audit logs: retention period is 90 days per compliance policy."]],
"ground_truth": ["Audit logs are retained for 90 days."],
}
dataset = Dataset.from_dict(data)
result = evaluate(
dataset=dataset,
metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
llm=judge_llm,
embeddings=embeddings,
)
print(result)
# {'faithfulness': 1.0, 'answer_relevancy': 0.92, 'context_precision': 1.0, 'context_recall': 1.0}

Qwen 2.5 14B and Llama 3.1 8B both work reasonably well as judges for most English-language evaluation. Smaller models (3B, 7B) tend to produce noisy scores — the judge needs enough reasoning capacity to decompose claims and check them against context.

Temperature zero on the judge is non-negotiable. You want deterministic verdicts, not creative ones.


Plugging In Your Actual Pipeline

You probably already have a LangChain or LlamaIndex pipeline. The trick is capturing what the retriever actually returned alongside what the LLM answered.

LangChain

langchain_eval.py
from langchain.chains import RetrievalQA
from langchain_community.chat_models import ChatOllama
from langchain_community.vectorstores import Chroma
llm = ChatOllama(model="llama3.1:8b")
vectorstore = Chroma(persist_directory="./chroma_db", embedding_function=your_embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
retriever=retriever,
return_source_documents=True, # Critical — need the docs for RAGAS
)
questions = ["What ports does the service expose?", "How do I reset admin credentials?"]
answers, contexts = [], []
for q in questions:
result = qa_chain.invoke({"query": q})
answers.append(result["result"])
# Extract page_content from each returned Document
contexts.append([doc.page_content for doc in result["source_documents"]])
# Now build the RAGAS dataset
data = {
"question": questions,
"answer": answers,
"contexts": contexts,
"ground_truth": [
"The service exposes ports 8080 and 443.",
"Run reset-admin.sh from the container.",
],
}

LlamaIndex

llamaindex_eval.py
from llama_index.core import VectorStoreIndex, StorageContext, load_index_from_storage
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
storage_context = StorageContext.from_defaults(persist_dir="./index_storage")
index = load_index_from_storage(storage_context)
retriever = VectorIndexRetriever(index=index, similarity_top_k=4)
query_engine = RetrieverQueryEngine.from_args(retriever=retriever)
questions = ["What is the max file upload size?"]
answers, contexts = [], []
for q in questions:
response = query_engine.query(q)
answers.append(str(response))
contexts.append([node.get_content() for node in response.source_nodes])

Same pattern: capture the retrieved nodes alongside the answer. RAGAS doesn’t care how you built your pipeline — it just needs the inputs and outputs.


CI Gating on Quality Metrics

This is where it actually pays off. Set a minimum score threshold and fail the build if you drop below it.

ci_eval.py
import sys
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
from datasets import Dataset
# Load your golden test dataset (checked into the repo)
import json
with open("tests/ragas_golden_set.json") as f:
data = json.load(f)
dataset = Dataset.from_dict(data)
result = evaluate(
dataset=dataset,
metrics=[faithfulness, answer_relevancy, context_precision],
llm=judge_llm,
embeddings=embeddings,
)
THRESHOLDS = {
"faithfulness": 0.85,
"answer_relevancy": 0.80,
"context_precision": 0.75,
}
failed = []
for metric, threshold in THRESHOLDS.items():
score = result[metric]
status = "PASS" if score >= threshold else "FAIL"
print(f"{metric}: {score:.2f} (threshold {threshold}) [{status}]")
if score < threshold:
failed.append(metric)
if failed:
print(f"\nFailed metrics: {', '.join(failed)}")
sys.exit(1)
print("\nAll metrics passed.")
sys.exit(0)
Terminal window
# In your CI pipeline
python ci_eval.py
# faithfulness: 0.91 (threshold 0.85) [PASS]
# answer_relevancy: 0.78 (threshold 0.80) [FAIL]
# context_precision: 0.82 (threshold 0.75) [PASS]
#
# Failed metrics: answer_relevancy

Keep your golden test set small — 20 to 50 questions is enough to catch regressions. More than that and you’ll be waiting 20 minutes for eval runs and people will start skipping them.


How RAGAS Compares to DeepEval and TruLens

You’ve got options here. They’re not all the same.

DeepEval overlaps heavily with RAGAS on RAG metrics but adds LLM output quality checks like hallucination, toxicity, and G-Eval (a generalized LLM-as-judge framework). It has a cleaner pytest integration if your team is already pytest-native. The tradeoff: it’s more opinionated and the free tier has some limits on the hosted dashboard.

TruLens is built around the “RAG Triad” — groundedness (essentially faithfulness), context relevance, and answer relevance. It has first-class tracing integration with LangChain and LlamaIndex and a nice local dashboard for interactive debugging. Better for exploratory analysis; RAGAS is better for automated CI gates.

RAGAS wins on raw metric coverage and local-first design. The Ollama integration is mature, the dataset format is portable, and there’s no phone-home requirement. If you’re running self-hosted and don’t want your eval traffic leaving the box, RAGAS is the move.


Caveats That Will Bite You

Judge model bias is real. The LLM you use as the judge has opinions. A judge model fine-tuned on instruction-following will score “helpful but unfaithful” answers higher than a reasoning model will. Run your eval pipeline with two different judge models occasionally and compare. If the scores diverge by more than 0.1 on faithfulness, your judge is doing editorializing.

Ground truth annotation is expensive. Context recall and answer correctness metrics require human-written ground truth answers. For a 50-question test set, that’s probably a few hours of work for someone who actually knows your domain. Don’t skip this step by auto-generating ground truth with the same model you’re evaluating — that’s circular reasoning dressed up as testing.

Scores are relative, not absolute. A faithfulness score of 0.85 doesn’t mean “85% of answers are good.” It means “on this dataset, with this judge, 85% of generated claims were verifiable from context.” Switch the judge model, the dataset, or the chunk size, and the number moves. Track trends over time on the same config — don’t compare raw scores across different setups.

Small datasets amplify noise. With 10 questions, one bad answer swings your faithfulness by 10 points. Run at least 20 questions per evaluation, and use diverse question types (factual, comparison, how-to, edge cases).


The Actual Payoff

Here’s the before/after when you add RAGAS to a real project: you stop arguing about whether the RAG “feels worse” after a chunk-size change and start looking at a number. Context precision went from 0.71 to 0.88 when we switched from 512-token chunks to 256-token chunks with 10% overlap on a documentation corpus. That’s a real signal.

Your 2 AM self debugging a production hallucination will appreciate having a test suite that would’ve caught it. Vibes don’t ship to prod — metrics do.


Share this post on:

Send a Webmention

Written about this post on your own site? Send a webmention and it'll show up above once verified.


Next Post
SigNoz vs Uptrace Self-Hosted Observability

Discussion

Powered by Garrul . Sign in with GitHub or Google, or post anonymously.

Related Posts