What is RAG?
Why RAG Matters for Agents
The Problem: LLMs have a knowledge cutoff and cannot access private data. They hallucinate when asked about information outside their training data.
The Solution: Retrieval-Augmented Generation (RAG) lets agents search a knowledge base, retrieve relevant documents, and use them as context for generating accurate, grounded answers.
Real Impact: RAG reduces hallucination by up to 70% and enables agents to work with proprietary, real-time, and domain-specific information.
Real-World Analogy
Think of RAG like a librarian helping with research:
- Query = Your research question
- Embedding = Understanding the meaning of your question
- Search = Librarian finding relevant books and sections
- Context = The relevant passages placed on your desk
- Generate = Writing your answer using those passages
RAG Components
Document Chunking
Split documents into meaningful chunks -- by paragraph, semantic boundaries, or fixed token counts.
Embedding
Convert text chunks into vector representations that capture semantic meaning for similarity search.
Vector Search
Find the most relevant chunks by comparing query embedding against stored document embeddings.
Context Injection
Insert retrieved chunks into the LLM prompt as context, enabling grounded and accurate generation.
Embedding & Indexing
from openai import OpenAI
import chromadb
client = OpenAI()
db = chromadb.Client()
collection = db.get_or_create_collection("docs")
# Index documents
def index_documents(documents):
for i, doc in enumerate(documents):
collection.add(
documents=[doc["text"]],
metadatas=[{"source": doc["source"]}],
ids=[f"doc_{i}"]
)
# Query with RAG
def rag_query(question, n_results=3):
# Retrieve relevant docs
results = collection.query(
query_texts=[question], n_results=n_results
)
context = "\n".join(results["documents"][0])
# Generate with context
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": f"Answer using this context:\n{context}"},
{"role": "user", "content": question}
]
)
return response.choices[0].message.content
Embedding dimensions: 1536 Chunks indexed: 42 Query: "How does photosynthesis work?" Top 3 results (cosine similarity): 0.92 - "Photosynthesis converts CO2 and water into glucose..." 0.87 - "The light-dependent reactions occur in the thylakoid..." 0.83 - "Chlorophyll absorbs light energy primarily in..."
Retrieval Strategies
| Strategy | How It Works | Best For |
|---|---|---|
| Dense Retrieval | Semantic search with embeddings | Meaning-based queries |
| Sparse Retrieval | Keyword matching (BM25) | Exact term matching |
| Hybrid Search | Combine dense + sparse scores | Best of both approaches |
| Reranking | Score top results with cross-encoder | Precision-critical applications |
| Multi-query | Generate multiple query variants | Ambiguous user questions |
RAG as an Agent Tool
RAG as a Tool
- Search Tool: Agent decides when to search the knowledge base
- Multi-source: Agent can search different collections for different topics
- Iterative Retrieval: Agent can refine its search based on initial results
- Verification: Agent can cross-reference retrieved info with other tools
Common Mistake
Wrong: Using RAG with no chunk overlap: text_splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=0)
Why it fails: Important context that spans chunk boundaries gets split across chunks. The retriever may find one half but miss the other, giving the LLM incomplete information.
Instead: chunk_overlap=50 (10-20% of chunk size) ensures boundary context is preserved in adjacent chunks.
Deep Dive: Hybrid Search for Better Retrieval
Pure vector similarity search can miss keyword-exact matches (e.g., product IDs, error codes). Hybrid search combines dense vector retrieval with sparse keyword search (BM25). The results are merged using Reciprocal Rank Fusion (RRF): score = sum(1 / (k + rank_i)) across both retrieval methods. Most production RAG systems use hybrid search with k=60 for optimal results. Tools like Pinecone, Weaviate, and Qdrant support hybrid search natively.
Advanced RAG Patterns
| Pattern | Description | Improvement |
|---|---|---|
| Parent-Child | Index small chunks, retrieve parent docs | Better context coherence |
| HyDE | Generate hypothetical doc, then search | Better for abstract queries |
| Contextual Compression | Extract only relevant parts of chunks | Reduce noise in context |
| Agentic RAG | Agent controls retrieval strategy | Adaptive to query complexity |
| Graph RAG | Use knowledge graphs + vectors | Better for relational queries |
User: "What's our refund policy for enterprise customers?"
Thought: I need to check the company documentation.
Action: search_docs("enterprise refund policy")
Observation: "Enterprise customers are eligible for full refunds within 30 days..."
Final Answer: Enterprise customers can receive full refunds within 30 days of purchase.
Quick Reference
| Component | Options | Recommendation |
|---|---|---|
| Embedding Model | OpenAI, Cohere, BGE, E5 | text-embedding-3-small for cost |
| Vector DB | Pinecone, Chroma, Weaviate, Qdrant | Chroma for local dev |
| Chunk Size | 256-2048 tokens | 512 tokens as starting point |
| Overlap | 0-25% of chunk size | 10-20% to maintain context |
| Top-K | 1-20 results | 3-5 for most use cases |