Agentic RAG

Beyond simple retrieve-then-generate: agents that intelligently decide when, what, and how to retrieve, then critique and correct their own retrieval.

The RAG Evolution

RAG Architecture Evolution
BASIC RAG              AGENTIC RAG           SELF-RAG              CORRECTIVE RAG
──────────────         ──────────────        ──────────────        ──────────────

Query                  Query                 Query                 Query
  │                      │                     │                     │
  ▼                      ▼                     ▼                     ▼
┌─────────┐          ┌─────────┐          ┌─────────┐          ┌─────────┐
│ ALWAYS  │          │ DECIDE  │          │ DECIDE  │          │ ALWAYS  │
│RETRIEVE │          │IF NEEDED│          │IF NEEDED│          │RETRIEVE │
└────┬────┘          └────┬────┘          └────┬────┘          └────┬────┘
     │                    │                    │                    │
     ▼                    ▼                    ▼                    ▼
┌─────────┐          ┌─────────┐          ┌─────────┐          ┌─────────┐
│ Vector  │          │ Multiple│          │ Retrieve│          │ GRADE   │
│ Search  │          │ Tools   │          │ + Grade │          │ EACH    │
└────┬────┘          └────┬────┘          │ Relevance          │ DOCUMENT│
     │                    │               └────┬────┘          └────┬────┘
     │                    │                    │                    │
     │                    │                    ▼               ┌────┴────┐
     │                    │               ┌─────────┐          │ CORRECT │
     │                    │               │ Generate│          │ AMBIG.  │
     │                    │               │+ Self-  │          │ INCORR. │
     │                    │               │ Critique│          └────┬────┘
     ▼                    ▼               └────┬────┘               │
┌─────────┐          ┌─────────┐               │                    ▼
│GENERATE │          │GENERATE │               ▼               ┌─────────┐
└─────────┘          └─────────┘          ┌─────────┐          │GENERATE │
                                          │ Revise  │          └─────────┘
                                          │if Needed│
                                          └─────────┘


GRAPH RAG
──────────────

Query
  │
  ├──────────────┐
  ▼              ▼
┌─────────┐  ┌─────────┐
│ Extract │  │ Vector  │
│Entities │  │ Search  │
└────┬────┘  └────┬────┘
     │            │
     ▼            │
┌─────────┐       │
│ Graph   │       │
│Traversal│       │
└────┬────┘       │
     │            │
     └─────┬──────┘
           ▼
      ┌─────────┐
      │ COMBINE │
      │ Context │
      └────┬────┘
           ▼
      ┌─────────┐
      │GENERATE │
      └─────────┘
Approach When to Retrieve Quality Control Best For
Basic RAG Always None Simple Q&A
Agentic RAG Agent decides Tool selection Varied queries
Self-RAG Agent decides Self-critique Accuracy critical
Corrective RAG Always Grade + correct Noisy retrieval
Graph RAG Always (dual) Structured + semantic Entity-rich domains

RAG approach comparison

1. Basic RAG (Baseline)

The simplest RAG architecture: always retrieve, then generate. No intelligence about whether retrieval is needed or if retrieved documents are relevant.

Basic RAG Implementation
# Basic RAG: Always retrieve, then generate

function basicRAG(query, vectorStore, llm):
    # Step 1: Embed the query
    queryEmbedding = embedModel.encode(query)

    # Step 2: Retrieve relevant documents
    documents = vectorStore.search(
        embedding: queryEmbedding,
        topK: 5
    )

    # Step 3: Build context from retrieved docs
    context = formatDocuments(documents)

    # Step 4: Generate response with context
    prompt = """
    Use the following context to answer the question.
    If the context doesn't contain the answer, say "I don't know."

    Context:
    {context}

    Question: {query}
    """

    response = llm.generate(prompt)
    return response

# Limitations:
# - Always retrieves, even for simple questions
# - No quality check on retrieved documents
# - Single retrieval pass (may miss info)
# - No reasoning about what to retrieve
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain.chains import RetrievalQA

def create_basic_rag(documents: list[str], collection_name: str = "docs"):
    """Create a basic RAG pipeline."""

    # Initialize embeddings
    embeddings = OpenAIEmbeddings()

    # Create vector store
    vectorstore = Chroma.from_texts(
        texts=documents,
        embedding=embeddings,
        collection_name=collection_name
    )

    # Create retriever
    retriever = vectorstore.as_retriever(
        search_type="similarity",
        search_kwargs={"k": 5}
    )

    # Create QA chain
    llm = ChatOpenAI(model="gpt-4", temperature=0)

    qa_chain = RetrievalQA.from_chain_type(
        llm=llm,
        chain_type="stuff",  # Simple: stuff all docs into context
        retriever=retriever,
        return_source_documents=True
    )

    return qa_chain

# Usage
qa = create_basic_rag(my_documents)
result = qa.invoke({"query": "What is the refund policy?"})
print(result["result"])
print(f"Sources: {[doc.metadata for doc in result['source_documents']]}")]
using Microsoft.Agents.AI;
using Microsoft.Extensions.AI;
using Microsoft.Extensions.VectorData;
using OpenAI;

public class BasicRagPipeline
{
    private readonly IVectorStore _vectorStore;
    private readonly IChatClient _chatClient;
    private const string CollectionName = "documents";

    public BasicRagPipeline(string apiKey)
    {
        var openAI = new OpenAIClient(apiKey);

        // Initialize chat client
        _chatClient = openAI
            .GetChatClient("gpt-4o")
            .AsIChatClient();

        // Initialize vector store with embeddings
        var embeddingClient = openAI
            .GetEmbeddingClient("text-embedding-3-small")
            .AsIEmbeddingGenerator<string, Embedding<float>>();

        _vectorStore = new InMemoryVectorStore();
    }

    public async Task IndexDocumentsAsync(
        IEnumerable<(string Id, string Text)> documents,
        CancellationToken ct = default)
    {
        var collection = _vectorStore.GetCollection<string, DocumentRecord>(CollectionName);
        await collection.CreateCollectionIfNotExistsAsync(ct);

        foreach (var (id, text) in documents)
        {
            var record = new DocumentRecord { Id = id, Text = text };
            await collection.UpsertAsync(record, ct);
        }
    }

    public async Task<string> QueryAsync(
        string query,
        int topK = 5,
        CancellationToken ct = default)
    {
        // Step 1: Retrieve relevant documents
        var collection = _vectorStore.GetCollection<string, DocumentRecord>(CollectionName);
        var results = await collection.VectorizedSearchAsync(query, topK, ct);

        // Step 2: Build context
        var context = string.Join("\n\n", results.Select(r => r.Record.Text));

        // Step 3: Generate response
        var prompt = $@"Use the following context to answer the question.
If the context doesn't contain the answer, say ""I don't know.""

Context:
{context}

Question: {query}";

        var response = await _chatClient.GetResponseAsync(prompt, ct);
        return response.Text;
    }
}

public class DocumentRecord
{
    [VectorStoreRecordKey]
    public string Id { get; set; } = "";

    [VectorStoreRecordData]
    public string Text { get; set; } = "";

    [VectorStoreRecordVector(1536)]
    public ReadOnlyMemory<float> Embedding { get; set; }
}

Limitations

Basic RAG retrieves for every query (even "hello"), uses whatever is retrieved regardless of quality, and only does one retrieval pass.

2. Agentic RAG

An agent with retrieval tools that decides when retrieval is needed, which tool to use, and what query to formulate:

Agentic RAG Implementation
# Agentic RAG: Agent decides when and what to retrieve

class AgenticRAG:
    tools: [
        searchDocuments(query, filters),  # Vector search
        lookupEntity(entityName),          # Knowledge base lookup
        webSearch(query),                  # External search
        noRetrieval()                      # Answer from knowledge
    ]

    function answer(question):
        # Agent reasons about retrieval strategy
        while not hasAnswer:
            thought = llm.reason(
                question: question,
                previousSteps: history,
                availableTools: tools
            )

            if thought.needsRetrieval:
                # Agent formulates retrieval query (may differ from question)
                retrievalQuery = thought.formulatedQuery
                results = executeRetrieval(thought.selectedTool, retrievalQuery)

                # Agent evaluates results
                evaluation = llm.evaluate(
                    question: question,
                    retrievedInfo: results
                )

                if evaluation.sufficient:
                    hasAnswer = true
                elif evaluation.needsMoreInfo:
                    # Refine and retrieve again
                    history.append(results)
                else:
                    # Try different retrieval strategy
                    continue

            else:
                # Agent can answer without retrieval
                hasAnswer = true

        return llm.generateAnswer(question, history)

# Key differences from basic RAG:
# 1. Agent DECIDES whether to retrieve
# 2. Agent FORMULATES the retrieval query
# 3. Agent EVALUATES retrieval results
# 4. Agent can do MULTIPLE retrieval rounds
# 5. Agent can use DIFFERENT retrieval tools
from langgraph.graph import StateGraph, END
from langgraph.prebuilt import ToolNode
from langchain_openai import ChatOpenAI
from langchain_core.tools import tool
from typing import TypedDict, Annotated
import operator

# Define state
class AgentState(TypedDict):
    messages: Annotated[list, operator.add]
    retrieved_docs: list
    needs_retrieval: bool

# Define tools
@tool
def search_documents(query: str, max_results: int = 5) -> str:
    """Search the document store for relevant information."""
    results = vector_store.similarity_search(query, k=max_results)
    return "

".join([doc.page_content for doc in results])

@tool
def search_web(query: str) -> str:
    """Search the web for current information."""
    # Implementation with web search API
    return web_search_api.search(query)

@tool
def lookup_entity(entity_name: str) -> str:
    """Look up specific entity in knowledge base."""
    return knowledge_base.get(entity_name, "Entity not found")

# Create the agent
llm = ChatOpenAI(model="gpt-4").bind_tools([
    search_documents, search_web, lookup_entity
])

def should_retrieve(state: AgentState) -> str:
    """Decide if we need to retrieve or can answer."""
    last_message = state["messages"][-1]

    if hasattr(last_message, "tool_calls") and last_message.tool_calls:
        return "retrieve"
    return "answer"

def call_model(state: AgentState) -> dict:
    """Have the agent reason about what to do."""
    messages = state["messages"]

    # Add system prompt for RAG behavior
    system = """You are a helpful assistant with access to retrieval tools.

    IMPORTANT: Before answering, consider:
    1. Can you answer this from your knowledge? If yes, just respond.
    2. Does this need current/specific information? Use search_web.
    3. Does this need document lookup? Use search_documents.
    4. Is this about a specific entity? Use lookup_entity.

    Be strategic about retrieval - don't retrieve if unnecessary."""

    response = llm.invoke([{"role": "system", "content": system}] + messages)
    return {"messages": [response]}

def generate_answer(state: AgentState) -> dict:
    """Generate final answer based on retrieved context."""
    messages = state["messages"]
    docs = state.get("retrieved_docs", [])

    if docs:
        context = "

".join(docs)
        messages = messages + [
            {"role": "system", "content": f"Context from retrieval:
{context}"}
        ]

    response = llm.invoke(messages)
    return {"messages": [response]}

# Build the graph
workflow = StateGraph(AgentState)

workflow.add_node("agent", call_model)
workflow.add_node("retrieve", ToolNode([search_documents, search_web, lookup_entity]))
workflow.add_node("answer", generate_answer)

workflow.set_entry_point("agent")
workflow.add_conditional_edges("agent", should_retrieve, {
    "retrieve": "retrieve",
    "answer": "answer"
})
workflow.add_edge("retrieve", "agent")  # Loop back after retrieval
workflow.add_edge("answer", END)

app = workflow.compile()

# Usage
result = app.invoke({
    "messages": [{"role": "user", "content": "What's our Q3 revenue?"}],
    "retrieved_docs": [],
    "needs_retrieval": False
})
using Microsoft.Agents.AI;
using Microsoft.Extensions.AI;
using System.ComponentModel;
using OpenAI;

public class AgenticRagPipeline
{
    private readonly AIAgent _agent;
    private readonly IVectorStore _vectorStore;

    public AgenticRagPipeline(string apiKey)
    {
        var chatClient = new OpenAIClient(apiKey)
            .GetChatClient("gpt-4o")
            .AsIChatClient();

        // Create agent with retrieval tools
        _agent = chatClient.CreateAIAgent(
            name: "RAGAgent",
            instructions: @"You are a helpful assistant with retrieval tools.

Before answering, consider:
1. Can you answer from knowledge? Just respond.
2. Need current info? Use SearchWeb.
3. Need document lookup? Use SearchDocuments.
4. About specific entity? Use LookupEntity.

Be strategic - don't retrieve unnecessarily.",
            tools: [
                AIFunctionFactory.Create(SearchDocuments),
                AIFunctionFactory.Create(SearchWeb),
                AIFunctionFactory.Create(LookupEntity)
            ]
        );
    }

    [Description("Search documents for relevant information")]
    private async Task<string> SearchDocuments(
        [Description("Search query")] string query,
        [Description("Maximum results")] int maxResults = 5)
    {
        var collection = _vectorStore.GetCollection<string, DocumentRecord>("documents");
        var results = await collection.VectorizedSearchAsync(query, maxResults);
        return string.Join("\n\n", results.Select(r => r.Record.Text));
    }

    [Description("Search the web for current information")]
    private async Task<string> SearchWeb(
        [Description("Search query")] string query)
    {
        return await _webSearchService.SearchAsync(query);
    }

    [Description("Look up a specific entity in the knowledge base")]
    private async Task<string> LookupEntity(
        [Description("Entity name")] string entityName)
    {
        return await _knowledgeBase.GetAsync(entityName) ?? "Entity not found";
    }

    public async Task<string> QueryAsync(
        string question,
        CancellationToken ct = default)
    {
        var thread = _agent.GetNewThread();

        // Agent will automatically use tools as needed
        var response = await _agent.RunAsync(question, thread, ct);

        return response;
    }
}

Key Capabilities

  • 1. Retrieval Decision - Agent decides IF retrieval is needed
  • 2. Query Formulation - Agent rewrites query for better retrieval
  • 3. Tool Selection - Agent chooses the right retrieval tool
  • 4. Iterative Retrieval - Agent can retrieve multiple times

3. Self-RAG

Self-RAG (Asai et al., 2023) adds self-reflection: the model critiques its own retrieval decisions and generation quality:

Self-RAG Implementation
# Self-RAG: Model critiques its own retrieval and generation

class SelfRAG:
    function answer(question):
        # Step 1: Decide if retrieval is needed
        retrievalDecision = llm.generate(
            prompt: "Given this question, do I need to retrieve information? [Yes/No]",
            question: question
        )

        if retrievalDecision == "No":
            # Generate without retrieval
            response = llm.generate(question)
            return selfCritique(question, response, [])

        # Step 2: Retrieve documents
        documents = retrieve(question)

        # Step 3: For each document, assess relevance
        relevantDocs = []
        for doc in documents:
            isRelevant = llm.generate(
                prompt: "Is this document relevant to the question? [Relevant/Irrelevant]",
                question: question,
                document: doc
            )
            if isRelevant == "Relevant":
                relevantDocs.append(doc)

        # Step 4: Generate response with relevant docs
        response = llm.generate(
            prompt: question,
            context: relevantDocs
        )

        # Step 5: Self-critique the response
        return selfCritique(question, response, relevantDocs)

    function selfCritique(question, response, sources):
        # Check if response is supported by sources
        supportScore = llm.generate(
            prompt: "Is this response fully supported by the sources? [Fully/Partially/No]",
            response: response,
            sources: sources
        )

        # Check if response is useful
        usefulnessScore = llm.generate(
            prompt: "How useful is this response? [5/4/3/2/1]",
            question: question,
            response: response
        )

        if supportScore == "No" or usefulnessScore < 3:
            # Regenerate with feedback
            return regenerateWithCritique(question, response, supportScore, usefulnessScore)

        return {
            response: response,
            supported: supportScore,
            usefulness: usefulnessScore,
            sources: sources
        }
from dataclasses import dataclass
from enum import Enum

class RetrievalDecision(Enum):
    YES = "yes"
    NO = "no"

class RelevanceScore(Enum):
    RELEVANT = "relevant"
    IRRELEVANT = "irrelevant"

class SupportScore(Enum):
    FULLY_SUPPORTED = "fully_supported"
    PARTIALLY_SUPPORTED = "partially_supported"
    NOT_SUPPORTED = "not_supported"

@dataclass
class SelfRAGResponse:
    answer: str
    sources: list[str]
    support_score: SupportScore
    usefulness_score: int
    retrieval_used: bool

class SelfRAG:
    def __init__(self, client, retriever, model: str = "gpt-4"):
        self.client = client
        self.retriever = retriever
        self.model = model

    def query(self, question: str) -> SelfRAGResponse:
        # Step 1: Decide if retrieval is needed
        retrieval_decision = self._decide_retrieval(question)

        if retrieval_decision == RetrievalDecision.NO:
            answer = self._generate_without_context(question)
            return self._self_critique(question, answer, [], False)

        # Step 2: Retrieve documents
        documents = self.retriever.search(question, k=5)

        # Step 3: Filter by relevance
        relevant_docs = self._filter_relevant(question, documents)

        if not relevant_docs:
            # Fall back to generation without context
            answer = self._generate_without_context(question)
            return self._self_critique(question, answer, [], True)

        # Step 4: Generate with relevant context
        answer = self._generate_with_context(question, relevant_docs)

        # Step 5: Self-critique
        return self._self_critique(question, answer, relevant_docs, True)

    def _decide_retrieval(self, question: str) -> RetrievalDecision:
        response = self.client.chat.completions.create(
            model=self.model,
            messages=[{
                "role": "user",
                "content": f"""Given this question, do you need to retrieve external information to answer accurately?

Question: {question}

Consider:
- Is this about specific facts, data, or recent events? -> Retrieve
- Is this about general knowledge or reasoning? -> No retrieval
- Is this about personal opinions or hypotheticals? -> No retrieval

Answer with just: YES or NO"""
            }]
        )
        answer = response.choices[0].message.content.strip().upper()
        return RetrievalDecision.YES if "YES" in answer else RetrievalDecision.NO

    def _filter_relevant(self, question: str, documents: list[str]) -> list[str]:
        relevant = []
        for doc in documents:
            response = self.client.chat.completions.create(
                model=self.model,
                messages=[{
                    "role": "user",
                    "content": f"""Is this document relevant to answering the question?

Question: {question}

Document: {doc[:500]}...

Answer with just: RELEVANT or IRRELEVANT"""
                }]
            )
            if "RELEVANT" in response.choices[0].message.content.upper():
                relevant.append(doc)
        return relevant

    def _self_critique(
        self,
        question: str,
        answer: str,
        sources: list[str],
        retrieval_used: bool
    ) -> SelfRAGResponse:
        # Check support
        support_score = self._check_support(answer, sources)

        # Check usefulness
        usefulness_score = self._check_usefulness(question, answer)

        # Regenerate if quality is low
        if support_score == SupportScore.NOT_SUPPORTED and sources:
            answer = self._regenerate_with_feedback(
                question, answer, sources, "not supported by sources"
            )
            support_score = self._check_support(answer, sources)

        if usefulness_score < 3:
            answer = self._regenerate_with_feedback(
                question, answer, sources, "not useful enough"
            )
            usefulness_score = self._check_usefulness(question, answer)

        return SelfRAGResponse(
            answer=answer,
            sources=sources,
            support_score=support_score,
            usefulness_score=usefulness_score,
            retrieval_used=retrieval_used
        )

    def _check_support(self, answer: str, sources: list[str]) -> SupportScore:
        if not sources:
            return SupportScore.FULLY_SUPPORTED  # No sources to contradict

        sources_text = "
---
".join(sources)
        response = self.client.chat.completions.create(
            model=self.model,
            messages=[{
                "role": "user",
                "content": f"""Is this answer supported by the source documents?

Answer: {answer}

Sources:
{sources_text}

Respond with:
- FULLY_SUPPORTED: All claims in the answer are backed by sources
- PARTIALLY_SUPPORTED: Some claims are backed, others are not
- NOT_SUPPORTED: The answer contradicts or goes beyond the sources"""
            }]
        )
        content = response.choices[0].message.content.upper()
        if "FULLY" in content:
            return SupportScore.FULLY_SUPPORTED
        elif "PARTIALLY" in content:
            return SupportScore.PARTIALLY_SUPPORTED
        return SupportScore.NOT_SUPPORTED

    def _check_usefulness(self, question: str, answer: str) -> int:
        response = self.client.chat.completions.create(
            model=self.model,
            messages=[{
                "role": "user",
                "content": f"""Rate how useful this answer is for the question (1-5):

Question: {question}
Answer: {answer}

5 = Perfectly answers the question
4 = Good answer with minor gaps
3 = Adequate but could be better
2 = Partially helpful
1 = Not helpful

Respond with just the number."""
            }]
        )
        try:
            return int(response.choices[0].message.content.strip()[0])
        except:
            return 3

Self-RAG Reflection Tokens

  • [Retrieve] - Should I retrieve? (Yes/No)
  • [IsRel] - Is this document relevant? (Relevant/Irrelevant)
  • [IsSup] - Is response supported? (Fully/Partially/No)
  • [IsUse] - Is response useful? (5/4/3/2/1)

4. Corrective RAG (CRAG)

CRAG (Yan et al., 2024) focuses on evaluating and correcting retrieval quality before generation:

Corrective RAG Implementation
# Corrective RAG (CRAG): Evaluate and correct retrieval quality

class CorrectiveRAG:
    function answer(question):
        # Step 1: Initial retrieval
        documents = retrieve(question)

        # Step 2: Evaluate each document's relevance
        evaluations = []
        for doc in documents:
            score = evaluateRelevance(question, doc)
            evaluations.append({ doc: doc, score: score })

        # Step 3: Determine action based on evaluation
        relevantDocs = filter(evaluations, score == "Correct")
        ambiguousDocs = filter(evaluations, score == "Ambiguous")
        irrelevantDocs = filter(evaluations, score == "Incorrect")

        if allRelevant(evaluations):
            # All documents are relevant - use them directly
            action = "CORRECT"
            context = relevantDocs

        elif allIrrelevant(evaluations):
            # All documents are irrelevant - use web search
            action = "INCORRECT"
            webResults = webSearch(question)
            context = webResults

        else:
            # Mixed relevance - combine strategies
            action = "AMBIGUOUS"
            webResults = webSearch(question)
            context = relevantDocs + refineDocuments(ambiguousDocs) + webResults

        # Step 4: Generate with corrected context
        return generate(question, context)

    function evaluateRelevance(question, document):
        # Three-way classification
        prompt = """
        Evaluate if this document is relevant to the question.

        Question: {question}
        Document: {document}

        - CORRECT: Document directly helps answer the question
        - INCORRECT: Document is not relevant at all
        - AMBIGUOUS: Document is partially relevant or tangential

        Respond with: CORRECT, INCORRECT, or AMBIGUOUS
        """
        return llm.generate(prompt)

    function refineDocuments(documents):
        # Extract only the relevant portions
        refined = []
        for doc in documents:
            relevantParts = llm.extract(
                prompt: "Extract only the parts relevant to the question",
                document: doc
            )
            refined.append(relevantParts)
        return refined
from dataclasses import dataclass
from enum import Enum

class RelevanceGrade(Enum):
    CORRECT = "correct"      # Directly relevant
    INCORRECT = "incorrect"  # Not relevant
    AMBIGUOUS = "ambiguous"  # Partially relevant

class RetrievalAction(Enum):
    USE_RETRIEVED = "use_retrieved"
    USE_WEB = "use_web"
    COMBINE = "combine"

@dataclass
class GradedDocument:
    content: str
    grade: RelevanceGrade
    confidence: float

class CorrectiveRAG:
    def __init__(self, client, retriever, web_search, model: str = "gpt-4"):
        self.client = client
        self.retriever = retriever
        self.web_search = web_search
        self.model = model

    def query(self, question: str) -> str:
        # Step 1: Initial retrieval
        documents = self.retriever.search(question, k=5)

        # Step 2: Grade each document
        graded_docs = [
            self._grade_document(question, doc)
            for doc in documents
        ]

        # Step 3: Determine corrective action
        action = self._determine_action(graded_docs)

        # Step 4: Build context based on action
        if action == RetrievalAction.USE_RETRIEVED:
            context = self._build_context_from_docs(graded_docs)

        elif action == RetrievalAction.USE_WEB:
            web_results = self.web_search.search(question)
            context = self._format_web_results(web_results)

        else:  # COMBINE
            # Use correct docs + refined ambiguous + web
            correct_docs = [d for d in graded_docs if d.grade == RelevanceGrade.CORRECT]
            ambiguous_docs = [d for d in graded_docs if d.grade == RelevanceGrade.AMBIGUOUS]

            refined = self._refine_ambiguous(question, ambiguous_docs)
            web_results = self.web_search.search(question)

            context = (
                self._build_context_from_docs(correct_docs) +
                "

" + refined +
                "

" + self._format_web_results(web_results)
            )

        # Step 5: Generate answer
        return self._generate(question, context)

    def _grade_document(self, question: str, document: str) -> GradedDocument:
        response = self.client.chat.completions.create(
            model=self.model,
            messages=[{
                "role": "user",
                "content": f"""Grade this document's relevance to the question.

Question: {question}

Document:
{document[:1000]}

Grades:
- CORRECT: Directly helps answer the question
- INCORRECT: Not relevant to the question
- AMBIGUOUS: Partially relevant or tangential

Respond with JSON: {{"grade": "...", "confidence": 0.0-1.0, "reason": "..."}}"""
            }],
            response_format={"type": "json_object"}
        )

        data = json.loads(response.choices[0].message.content)
        return GradedDocument(
            content=document,
            grade=RelevanceGrade(data["grade"].lower()),
            confidence=data.get("confidence", 0.5)
        )

    def _determine_action(self, graded_docs: list[GradedDocument]) -> RetrievalAction:
        correct = sum(1 for d in graded_docs if d.grade == RelevanceGrade.CORRECT)
        incorrect = sum(1 for d in graded_docs if d.grade == RelevanceGrade.INCORRECT)
        ambiguous = sum(1 for d in graded_docs if d.grade == RelevanceGrade.AMBIGUOUS)

        total = len(graded_docs)

        if correct / total >= 0.6:
            return RetrievalAction.USE_RETRIEVED
        elif incorrect / total >= 0.8:
            return RetrievalAction.USE_WEB
        else:
            return RetrievalAction.COMBINE

    def _refine_ambiguous(
        self,
        question: str,
        ambiguous_docs: list[GradedDocument]
    ) -> str:
        if not ambiguous_docs:
            return ""

        docs_text = "
---
".join([d.content for d in ambiguous_docs])

        response = self.client.chat.completions.create(
            model=self.model,
            messages=[{
                "role": "user",
                "content": f"""Extract only the parts of these documents that are relevant to the question.

Question: {question}

Documents:
{docs_text}

Return only the relevant excerpts, removing irrelevant content."""
            }]
        )
        return response.choices[0].message.content

    def _generate(self, question: str, context: str) -> str:
        response = self.client.chat.completions.create(
            model=self.model,
            messages=[
                {"role": "system", "content": f"Context:
{context}"},
                {"role": "user", "content": question}
            ]
        )
        return response.choices[0].message.content
CRAG Decision Flow
Retrieved Documents
        │
        ▼
┌───────────────────┐
│   GRADE EACH      │
│   DOCUMENT        │
│                   │
│  Correct?         │
│  Incorrect?       │
│  Ambiguous?       │
└─────────┬─────────┘
          │
    ┌─────┴─────┐
    │           │
    ▼           ▼
All Correct  All Incorrect   Mixed
    │           │              │
    │           │              │
    ▼           ▼              ▼
┌───────┐  ┌───────┐    ┌───────────┐
│ USE   │  │ WEB   │    │ COMBINE   │
│ DOCS  │  │SEARCH │    │ Correct + │
└───────┘  └───────┘    │ Refined + │
                        │ Web       │
                        └───────────┘

Key Innovation

CRAG's three-way grading (Correct/Incorrect/Ambiguous) enables nuanced handling: keeping good docs, discarding bad ones, and refining ambiguous ones.

5. Graph RAG

Graph RAG combines vector search with knowledge graph traversal for structured + semantic retrieval:

Graph RAG Implementation
# Graph RAG: Combine knowledge graphs with vector retrieval

class GraphRAG:
    vectorStore: VectorDB      # For semantic search
    knowledgeGraph: Neo4j      # For structured relationships

    function answer(question):
        # Step 1: Extract entities from question
        entities = extractEntities(question)

        # Step 2: Retrieve from both sources
        # Vector retrieval for semantic similarity
        vectorResults = vectorStore.search(question, topK = 5)

        # Graph traversal for related entities
        graphResults = []
        for entity in entities:
            # Find entity in graph
            node = knowledgeGraph.findNode(entity)
            if node:
                # Get related nodes (neighbors, paths)
                related = knowledgeGraph.traverse(
                    startNode: node,
                    maxDepth: 2,
                    relationTypes: ["related_to", "part_of", "caused_by"]
                )
                graphResults.append(related)

        # Step 3: Combine and deduplicate
        combinedContext = merge(vectorResults, graphResults)

        # Step 4: Build structured context
        context = formatContext(
            semanticDocs: vectorResults,
            entityRelations: graphResults,
            entities: entities
        )

        # Step 5: Generate with structured knowledge
        return llm.generate(question, context)

    function extractEntities(question):
        # Use NER or LLM to extract entities
        return llm.generate(
            prompt: "Extract named entities (people, places, concepts) from: " + question
        )

    function formatContext(semanticDocs, entityRelations, entities):
        context = "## Relevant Documents
"
        for doc in semanticDocs:
            context += "- " + doc.summary + "
"

        context += "
## Entity Relationships
"
        for entity in entities:
            relations = entityRelations.get(entity, [])
            context += f"### {entity}
"
            for rel in relations:
                context += f"- {rel.type}: {rel.target}
"

        return context
from neo4j import GraphDatabase
from dataclasses import dataclass

@dataclass
class EntityRelation:
    source: str
    relation: str
    target: str
    properties: dict

class GraphRAG:
    def __init__(
        self,
        client,
        vector_store,
        neo4j_uri: str,
        neo4j_auth: tuple,
        model: str = "gpt-4"
    ):
        self.client = client
        self.vector_store = vector_store
        self.graph = GraphDatabase.driver(neo4j_uri, auth=neo4j_auth)
        self.model = model

    def query(self, question: str) -> str:
        # Step 1: Extract entities
        entities = self._extract_entities(question)

        # Step 2: Vector retrieval
        vector_results = self.vector_store.similarity_search(question, k=5)

        # Step 3: Graph retrieval
        graph_results = self._graph_retrieval(entities)

        # Step 4: Build combined context
        context = self._build_context(
            question, entities, vector_results, graph_results
        )

        # Step 5: Generate answer
        return self._generate(question, context)

    def _extract_entities(self, question: str) -> list[str]:
        response = self.client.chat.completions.create(
            model=self.model,
            messages=[{
                "role": "user",
                "content": f"""Extract named entities from this question.
Include: people, organizations, products, concepts, locations.

Question: {question}

Return as JSON: {{"entities": ["entity1", "entity2", ...]}}"""
            }],
            response_format={"type": "json_object"}
        )

        data = json.loads(response.choices[0].message.content)
        return data.get("entities", [])

    def _graph_retrieval(self, entities: list[str]) -> dict[str, list[EntityRelation]]:
        results = {}

        with self.graph.session() as session:
            for entity in entities:
                # Find entity and its relationships
                query = """
                MATCH (e)-[r]-(related)
                WHERE e.name =~ $pattern OR e.label =~ $pattern
                RETURN e.name as source,
                       type(r) as relation,
                       related.name as target,
                       properties(r) as props
                LIMIT 20
                """
                pattern = f"(?i).*{entity}.*"

                records = session.run(query, pattern=pattern)

                relations = [
                    EntityRelation(
                        source=record["source"],
                        relation=record["relation"],
                        target=record["target"],
                        properties=record["props"] or {}
                    )
                    for record in records
                ]

                if relations:
                    results[entity] = relations

        return results

    def _build_context(
        self,
        question: str,
        entities: list[str],
        vector_results: list,
        graph_results: dict[str, list[EntityRelation]]
    ) -> str:
        parts = []

        # Semantic documents
        if vector_results:
            parts.append("## Relevant Documents")
            for i, doc in enumerate(vector_results, 1):
                parts.append(f"{i}. {doc.page_content[:300]}...")

        # Entity relationships from graph
        if graph_results:
            parts.append("
## Entity Knowledge Graph")
            for entity, relations in graph_results.items():
                parts.append(f"
### {entity}")
                for rel in relations[:10]:  # Limit relations
                    parts.append(f"- {rel.relation} -> {rel.target}")
                    if rel.properties:
                        props = ", ".join(f"{k}={v}" for k, v in rel.properties.items())
                        parts.append(f"  ({props})")

        # Extracted entities for reference
        parts.append(f"
## Detected Entities: {', '.join(entities)}")

        return "
".join(parts)

    def _generate(self, question: str, context: str) -> str:
        response = self.client.chat.completions.create(
            model=self.model,
            messages=[
                {"role": "system", "content": f"""You have access to both document search results and a knowledge graph.
Use both sources to provide a comprehensive answer.

{context}"""},
                {"role": "user", "content": question}
            ]
        )
        return response.choices[0].message.content

    def close(self):
        self.graph.close()

When to Use Graph RAG

Use Case Why Graph RAG Helps
Multi-hop questions Graph traversal connects related entities
Entity-rich domains Structured relationships improve precision
Reasoning about relationships "Who reports to whom?" needs graph structure
Combining structured + unstructured Graph for facts, vectors for context

Graph Construction

You can build knowledge graphs from documents using LLM-based entity extraction, or use existing structured data (databases, ontologies).

Evaluation Metrics

Metric What it Measures How to Calculate
Answer Accuracy Is the answer correct? Human eval or exact match
Faithfulness Is answer grounded in retrieved docs? NLI or LLM-as-judge
Relevance Are retrieved docs relevant? Precision@K, NDCG
Retrieval Efficiency How often is retrieval needed? % queries requiring retrieval
Latency Time to answer Wall clock time
Hallucination Rate Unsupported claims in answer Manual or NLI checking

Key metrics for RAG evaluation

Choosing an Approach

Use Basic RAG when:

  • All queries need document lookup
  • Simple, single-turn Q&A
  • Latency is critical

Use Agentic RAG when:

  • Queries vary (some need retrieval, some don't)
  • Multiple retrieval sources available
  • Complex multi-step reasoning needed

Use Self-RAG when:

  • Accuracy is paramount
  • You need to minimize hallucinations
  • Quality > latency

Use Corrective RAG when:

  • Retrieval quality varies
  • Mixed document quality in corpus
  • Web fallback is acceptable

Use Graph RAG when:

  • Data has clear entity relationships
  • Multi-hop reasoning required
  • You have or can build a knowledge graph

Common Pitfalls

Over-retrieval

Retrieving for every query wastes tokens and can confuse the model with irrelevant context.

Chunk Size Mismatch

Too small: loses context. Too large: dilutes relevance. Tune chunk size to your use case.

Ignoring Retrieval Quality

Just because you retrieved 5 documents doesn't mean they're all useful. Grade and filter.

Single-Pass Retrieval

Complex questions often need multiple retrieval rounds with refined queries.

Related Topics