RAG for Agents

Medium 28 min read

What is RAG?

Why RAG Matters for Agents

The Problem: LLMs have a knowledge cutoff and cannot access private data. They hallucinate when asked about information outside their training data.

The Solution: Retrieval-Augmented Generation (RAG) lets agents search a knowledge base, retrieve relevant documents, and use them as context for generating accurate, grounded answers.

Real Impact: RAG reduces hallucination by up to 70% and enables agents to work with proprietary, real-time, and domain-specific information.

Real-World Analogy

Think of RAG like a librarian helping with research:

  • Query = Your research question
  • Embedding = Understanding the meaning of your question
  • Search = Librarian finding relevant books and sections
  • Context = The relevant passages placed on your desk
  • Generate = Writing your answer using those passages

RAG Components

Document Chunking

Split documents into meaningful chunks -- by paragraph, semantic boundaries, or fixed token counts.

Embedding

Convert text chunks into vector representations that capture semantic meaning for similarity search.

Vector Search

Find the most relevant chunks by comparing query embedding against stored document embeddings.

Context Injection

Insert retrieved chunks into the LLM prompt as context, enabling grounded and accurate generation.

Embedding & Indexing

RAG Pipeline
Query user question Embed vectorize Search find similar Context top chunks Generate LLM answer
rag_pipeline.py
from openai import OpenAI
import chromadb

client = OpenAI()
db = chromadb.Client()
collection = db.get_or_create_collection("docs")

# Index documents
def index_documents(documents):
    for i, doc in enumerate(documents):
        collection.add(
            documents=[doc["text"]],
            metadatas=[{"source": doc["source"]}],
            ids=[f"doc_{i}"]
        )

# Query with RAG
def rag_query(question, n_results=3):
    # Retrieve relevant docs
    results = collection.query(
        query_texts=[question], n_results=n_results
    )
    context = "\n".join(results["documents"][0])

    # Generate with context
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": f"Answer using this context:\n{context}"},
            {"role": "user", "content": question}
        ]
    )
    return response.choices[0].message.content
Output
Embedding dimensions: 1536
Chunks indexed: 42
Query: "How does photosynthesis work?"
Top 3 results (cosine similarity):
  0.92 - "Photosynthesis converts CO2 and water into glucose..."
  0.87 - "The light-dependent reactions occur in the thylakoid..."
  0.83 - "Chlorophyll absorbs light energy primarily in..."
Key Takeaway: RAG transforms documents into vector embeddings, stores them in a vector database, and retrieves the most semantically similar chunks at query time. Chunk size (typically 256-512 tokens) and overlap (10-20%) significantly impact retrieval quality.

Retrieval Strategies

StrategyHow It WorksBest For
Dense RetrievalSemantic search with embeddingsMeaning-based queries
Sparse RetrievalKeyword matching (BM25)Exact term matching
Hybrid SearchCombine dense + sparse scoresBest of both approaches
RerankingScore top results with cross-encoderPrecision-critical applications
Multi-queryGenerate multiple query variantsAmbiguous user questions

RAG as an Agent Tool

RAG as a Tool

  • Search Tool: Agent decides when to search the knowledge base
  • Multi-source: Agent can search different collections for different topics
  • Iterative Retrieval: Agent can refine its search based on initial results
  • Verification: Agent can cross-reference retrieved info with other tools

Common Mistake

Wrong: Using RAG with no chunk overlap: text_splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=0)

Why it fails: Important context that spans chunk boundaries gets split across chunks. The retriever may find one half but miss the other, giving the LLM incomplete information.

Instead: chunk_overlap=50 (10-20% of chunk size) ensures boundary context is preserved in adjacent chunks.

Deep Dive: Hybrid Search for Better Retrieval

Pure vector similarity search can miss keyword-exact matches (e.g., product IDs, error codes). Hybrid search combines dense vector retrieval with sparse keyword search (BM25). The results are merged using Reciprocal Rank Fusion (RRF): score = sum(1 / (k + rank_i)) across both retrieval methods. Most production RAG systems use hybrid search with k=60 for optimal results. Tools like Pinecone, Weaviate, and Qdrant support hybrid search natively.

Advanced RAG Patterns

PatternDescriptionImprovement
Parent-ChildIndex small chunks, retrieve parent docsBetter context coherence
HyDEGenerate hypothetical doc, then searchBetter for abstract queries
Contextual CompressionExtract only relevant parts of chunksReduce noise in context
Agentic RAGAgent controls retrieval strategyAdaptive to query complexity
Graph RAGUse knowledge graphs + vectorsBetter for relational queries
Key Takeaway: RAG as an agent tool gives the LLM access to private, up-to-date knowledge. The agent decides when to search, what query to use, and how to synthesize results -- making it more flexible than static RAG pipelines.
Output (Agent with RAG Tool)
User: "What's our refund policy for enterprise customers?"
Thought: I need to check the company documentation.
Action: search_docs("enterprise refund policy")
Observation: "Enterprise customers are eligible for full refunds within 30 days..."
Final Answer: Enterprise customers can receive full refunds within 30 days of purchase.

Quick Reference

ComponentOptionsRecommendation
Embedding ModelOpenAI, Cohere, BGE, E5text-embedding-3-small for cost
Vector DBPinecone, Chroma, Weaviate, QdrantChroma for local dev
Chunk Size256-2048 tokens512 tokens as starting point
Overlap0-25% of chunk size10-20% to maintain context
Top-K1-20 results3-5 for most use cases