RAG Patterns

Part of Module 3: AI Applications

Retrieval-Augmented Generation (RAG) is a paradigm-shifting pattern that combines the power of large language models with external knowledge retrieval systems. RAG enables AI applications to access up-to-date, domain-specific information beyond their training data, making them more accurate, reliable, and contextually aware.

Understanding RAG Architecture

RAG systems augment language models by retrieving relevant information from external sources during generation, combining the reasoning capabilities of LLMs with the precision of information retrieval.

RAG Pipeline Flow

User Query → Embedding → Vector Search → Context Retrieval → LLM Generation → Response

Core Components

  1. Document Store: Repository of source documents (PDFs, websites, databases)
  2. Chunking Strategy: Breaking documents into retrievable segments
  3. Embedding Model: Converting text to vector representations
  4. Vector Database: Storing and searching embeddings efficiently
  5. Retrieval System: Finding relevant chunks for queries
  6. Context Assembly: Combining retrieved information
  7. LLM Integration: Generating responses with retrieved context

RAG Implementation Patterns

1. Naive RAG

The simplest implementation with direct retrieval and generation.

Python Implementation
# Basic RAG Pipeline
from langchain import VectorStore, Embeddings, LLM

def naive_rag(query, documents):
    # 1. Create embeddings
    embeddings = Embeddings.create(documents)
    
    # 2. Store in vector database
    vector_store = VectorStore(embeddings)
    
    # 3. Retrieve relevant chunks
    relevant_docs = vector_store.similarity_search(query, k=5)
    
    # 4. Generate response
    context = "\n".join(relevant_docs)
    prompt = f"Context: {context}\n\nQuestion: {query}\nAnswer:"
    
    return llm.generate(prompt)

Use Cases: Simple Q&A systems, basic document search, prototype development

2. Advanced RAG

Enhanced with pre-retrieval and post-retrieval optimizations.

Pre-Retrieval Optimizations:

  • Query Expansion: Reformulating queries for better retrieval
  • Query Routing: Directing queries to specific data sources
  • Hypothetical Document Embeddings (HyDE): Generating hypothetical answers for retrieval

Post-Retrieval Optimizations:

  • Reranking: Scoring and reordering retrieved documents
  • Compression: Removing irrelevant information from chunks
  • Fusion: Combining results from multiple retrieval methods
Advanced RAG Example
# Advanced RAG with Query Expansion and Reranking
from transformers import pipeline

class AdvancedRAG:
    def __init__(self):
        self.reranker = pipeline("reranking")
        self.query_expander = QueryExpander()
    
    def retrieve_and_generate(self, query):
        # Query expansion
        expanded_queries = self.query_expander.expand(query)
        
        # Multi-query retrieval
        all_docs = []
        for q in expanded_queries:
            docs = self.vector_store.search(q, k=10)
            all_docs.extend(docs)
        
        # Rerank documents
        reranked = self.reranker(query, all_docs)
        top_docs = reranked[:5]
        
        # Generate with context
        return self.generate_response(query, top_docs)

3. Modular RAG

Flexible architecture with interchangeable components and routing.

  • Module Types: Search, Memory, Routing, Prediction, Task Adaptors
  • Orchestration: Dynamic pipeline construction based on query type
  • Feedback Loops: Iterative refinement of retrieval and generation

4. Graph RAG

Leveraging knowledge graphs for structured information retrieval.

  • Entity Extraction: Identifying entities and relationships
  • Graph Construction: Building knowledge graphs from documents
  • Graph Traversal: Multi-hop reasoning across relationships
  • Hybrid Retrieval: Combining vector and graph search

5. Agentic RAG

RAG systems with autonomous decision-making capabilities.

  • Self-Reflection: Evaluating retrieval quality
  • Adaptive Retrieval: Dynamically adjusting retrieval strategies
  • Multi-Step Reasoning: Breaking complex queries into sub-tasks
  • Tool Integration: Calling external APIs and functions

Chunking Strategies

Strategy Description Pros Cons
Fixed Size Split by character/token count Simple, predictable May break context
Sentence-Based Split at sentence boundaries Preserves meaning Variable sizes
Semantic Split by meaning similarity Coherent chunks Computationally expensive
Document Structure Use headings, paragraphs Preserves hierarchy Requires structured docs
Sliding Window Overlapping chunks Better context coverage Storage overhead

Retrieval Methods

1. Dense Retrieval

Using neural embeddings for semantic similarity search.

  • Models: BERT, Sentence-BERT, OpenAI Embeddings
  • Advantages: Semantic understanding, cross-lingual capability
  • Challenges: Computational cost, domain adaptation

2. Sparse Retrieval

Traditional keyword-based search methods.

  • Methods: BM25, TF-IDF, Elasticsearch
  • Advantages: Fast, interpretable, exact matching
  • Challenges: No semantic understanding, vocabulary mismatch

3. Hybrid Retrieval

Combining dense and sparse methods for optimal results.

Hybrid Retrieval Implementation
def hybrid_search(query, alpha=0.5):
    # Dense retrieval
    dense_results = vector_store.similarity_search(query, k=20)
    
    # Sparse retrieval (BM25)
    sparse_results = bm25_index.search(query, k=20)
    
    # Combine scores
    combined_scores = {}
    for doc in dense_results:
        combined_scores[doc.id] = alpha * doc.score
    
    for doc in sparse_results:
        if doc.id in combined_scores:
            combined_scores[doc.id] += (1-alpha) * doc.score
        else:
            combined_scores[doc.id] = (1-alpha) * doc.score
    
    # Return top results
    return sorted(combined_scores.items(), key=lambda x: x[1], reverse=True)[:10]

Evaluation Metrics

Faithfulness
Answer grounded in context
Relevance
Retrieved doc quality
Correctness
Answer accuracy
Coverage
Information completeness

Evaluation Frameworks

  • RAGAS: Retrieval Augmented Generation Assessment
  • TruLens: LLM app evaluation and tracking
  • LangSmith: End-to-end RAG testing and monitoring
  • Phoenix: ML observability for RAG pipelines

Common RAG Challenges & Solutions

1. Lost in the Middle Problem

LLMs tend to focus on information at the beginning and end of context.

Solutions:
  • Reorder chunks by relevance
  • Use positional encoding
  • Implement attention mechanisms

2. Context Window Limitations

Limited token capacity for retrieved documents.

Solutions:
  • Implement context compression
  • Use hierarchical retrieval
  • Apply extractive summarization

3. Hallucination in RAG

Model generates information not present in retrieved context.

Solutions:
  • Implement citation mechanisms
  • Use constrained generation
  • Add verification layers

4. Retrieval Quality Issues

Irrelevant or incomplete document retrieval.

Solutions:
  • Fine-tune embedding models
  • Implement query understanding
  • Use feedback loops for improvement

Production RAG Best Practices

✅ Best Practices

  • Version control your embeddings
  • Implement incremental indexing
  • Monitor retrieval quality metrics
  • Use caching for frequent queries
  • Implement fallback strategies
  • Regular reindexing schedule
  • A/B test retrieval strategies

⚠️ Common Pitfalls

  • Ignoring document updates
  • No evaluation metrics
  • Over-relying on single retrieval method
  • Inadequate error handling
  • Not considering latency requirements
  • Insufficient chunk overlap
  • Missing metadata filtering

RAG Stack Technologies

Vector Databases

  • Pinecone: Managed vector database with filtering
  • Weaviate: Open-source with hybrid search
  • Qdrant: High-performance with payload filtering
  • ChromaDB: Lightweight, developer-friendly
  • Milvus: Scalable, production-ready

Embedding Models

  • OpenAI Ada-002: General-purpose, high quality
  • Cohere Embed: Multilingual support
  • Sentence Transformers: Open-source, customizable
  • Instructor: Task-specific embeddings

Orchestration Frameworks

  • LangChain: Comprehensive RAG toolkit
  • LlamaIndex: Data framework for LLMs
  • Haystack: End-to-end NLP framework
  • DSPy: Declarative language model programming

Industry-Specific RAG Applications

Industry Use Case Key Requirements
Healthcare Clinical decision support HIPAA compliance, medical accuracy
Legal Contract analysis, case research Citation tracking, precedent linking
Finance Risk assessment, compliance Real-time data, regulatory updates
Education Personalized tutoring Curriculum alignment, progress tracking
Customer Support Knowledge base Q&A Multi-channel, response accuracy

Future of RAG

The evolution of RAG systems continues with emerging trends:

  • Multi-Modal RAG: Retrieving and processing images, videos, and audio
  • Long-Context Models: Reducing dependency on retrieval with larger context windows
  • Active Learning RAG: Systems that improve through user feedback
  • Federated RAG: Distributed retrieval across private data sources
  • Neural Databases: Learned indices replacing traditional search
  • RAG-as-a-Service: Managed platforms for enterprise RAG deployment