Design a RAG System

Hard 30 min read

Problem Statement & Requirements

Why RAG Systems Matter

Large Language Models hallucinate. RAG (Retrieval-Augmented Generation) grounds LLM responses in real, verifiable data — reducing hallucinations by up to 70%. Every enterprise deploying LLMs (Microsoft Copilot, Notion AI, customer support bots) uses some form of RAG.

Think of RAG like an open-book exam. Instead of relying solely on memorized knowledge (the LLM's training data), the system first looks up relevant information from a knowledge base, then formulates an answer using both its general knowledge and the retrieved context.

Functional Requirements

Non-Functional Requirements

Back-of-Envelope Estimation

ParameterEstimate
Document corpus1M documents (~50M chunks)
Avg chunk size500 tokens (~2 KB text)
Embedding dimension1536 (OpenAI) or 768 (open-source)
Vector storage50M × 1536 × 4B = ~300 GB
Query QPS100-1,000
Retrieval latency budget<200ms
LLM generation budget<2.5 seconds
Embedding throughput~1,000 chunks/sec (batched)

System API Design

RAG APIs
# Query the RAG system
POST /api/v1/query
{
  "question": "What is our refund policy for enterprise customers?",
  "conversation_id": "conv_123",
  "filters": { "source": "policy_docs" },
  "top_k": 5,
  "stream": true
}
# Response
{
  "answer": "Enterprise customers can request a full refund...",
  "citations": [
    { "chunk_id": "c_456", "doc": "refund-policy-v3.pdf", "page": 4, "score": 0.92 }
  ],
  "confidence": 0.88
}

# Ingest documents
POST /api/v1/documents
{
  "source": "s3://docs/policy/",
  "collection": "policy_docs",
  "chunking": { "strategy": "semantic", "max_tokens": 512 }
}

# Submit feedback
POST /api/v1/feedback
{ "query_id": "q_789", "rating": "helpful", "comment": "Accurate and well-cited" }

Data Model

Core Schema
CREATE TABLE documents (
    doc_id        VARCHAR PRIMARY KEY,
    source_url    TEXT,
    title         TEXT,
    collection    VARCHAR,
    ingested_at   TIMESTAMP,
    metadata      JSONB
);
CREATE TABLE chunks (
    chunk_id      VARCHAR PRIMARY KEY,
    doc_id        VARCHAR REFERENCES documents,
    content       TEXT,
    embedding     vector(1536),  -- pgvector type
    chunk_index   INT,
    token_count   INT,
    metadata      JSONB
);
CREATE INDEX ON chunks
    USING ivfflat (embedding vector_cosine_ops)
    WITH (lists = 1000);
CREATE TABLE conversations (
    conversation_id VARCHAR PRIMARY KEY,
    user_id         VARCHAR,
    messages        JSONB[],
    created_at      TIMESTAMP
);

High-Level Architecture

RAG has two pipelines: ingestion (offline) and query (online).

Ingestion Pipeline

Document → Parser (extract text from PDF/HTML) → Chunker (split into passages) → Embedder (vectorize each chunk) → Vector DB (index for search). Runs asynchronously when new documents arrive.

Query Pipeline

User Question → Query Embedding → Vector Search (retrieve top-K chunks) → Re-Ranker (cross-encoder scoring) → Context Assembly → LLM Generation → Citation Extraction → Response.

Why Re-Ranking Matters

Bi-encoder retrieval (embedding similarity) is fast but approximate. A cross-encoder re-ranker jointly attends to the query and each candidate chunk, improving relevance by 10-20%. It runs on the top 20-50 candidates (not the full corpus).

Deep Dive: Core Components

Chunking Strategies

StrategyHow It WorksBest For
Fixed-sizeSplit every N tokens with overlapSimple, predictable token usage
Sentence-basedSplit on sentence boundariesPreserving complete thoughts
SemanticSplit when embedding similarity dropsTopic-coherent chunks
RecursiveTry headings → paragraphs → sentencesStructured documents
Document-awareUse document structure (sections, pages)PDFs, legal docs, manuals
Semantic Chunking
from sentence_transformers import SentenceTransformer
import numpy as np

def semantic_chunk(text, model, threshold=0.75):
    """Split text where semantic similarity drops."""
    sentences = text.split(". ")
    embeddings = model.encode(sentences)

    chunks, current = [], [sentences[0]]
    for i in range(1, len(sentences)):
        sim = np.dot(embeddings[i], embeddings[i-1])
        if sim < threshold:
            chunks.append(". ".join(current))
            current = [sentences[i]]
        else:
            current.append(sentences[i])
    chunks.append(". ".join(current))
    return chunks

Vector Databases

DatabaseTypeStrengths
PineconeManaged SaaSEasiest to operate, auto-scaling
WeaviateOpen-sourceHybrid search, multi-modal
QdrantOpen-sourceFiltering, payload storage
pgvectorPostgreSQL extensionUse existing Postgres, ACID
MilvusOpen-sourceBillion-scale, GPU-accelerated

Retrieval Strategies

Guardrails & Hallucination Mitigation

Critical: Preventing Hallucinations

Even with RAG, LLMs can hallucinate. Mitigate with: (1) Constrain the prompt — "Answer ONLY based on the provided context. If the context doesn't contain the answer, say so." (2) Citation verification — Check that every claim maps to a retrieved chunk. (3) Confidence scoring — If retrieval scores are low, return "I don't have enough information" instead of guessing. (4) Fact-checking chain — Use a second LLM call to verify the answer against the retrieved context.

Scaling & Optimization

Embedding Caching

Cache query embeddings (many queries are repeated or similar). Use a two-level cache: L1 in-memory (last 10K queries) and L2 in Redis (last 1M queries). Cache hit rates of 30-50% are common.

Hierarchical Retrieval

For large corpora, use two-stage retrieval: (1) Retrieve relevant documents first using document-level summaries, then (2) Retrieve specific chunks within those documents. Reduces search space 10-100x.

Evaluation Framework

MetricMeasuresTarget
FaithfulnessIs the answer supported by retrieved context?>0.90
Answer RelevancyDoes the answer address the question?>0.85
Context PrecisionAre retrieved chunks relevant?>0.80
Context RecallAre all needed chunks retrieved?>0.75

Practice Problems

Practice 1: Multi-Language RAG

Your knowledge base contains documents in English, Spanish, and Japanese. A user asks a question in French. Design a retrieval strategy that works across languages.

Practice 2: Table & Image Retrieval

Your documents contain tables and charts that are critical for answering questions. Standard text chunking loses this information. Design an ingestion pipeline that handles multimodal content.

Practice 3: Real-Time Knowledge Updates

Your company wiki is updated 500 times per day. Design an ingestion pipeline that keeps the RAG index fresh within 5 minutes of any edit, without re-embedding the entire corpus.

Quick Reference

ComponentTechnologyPurpose
Vector DatabasePinecone / pgvector / WeaviateStore and search embeddings
Embedding ModelOpenAI / Cohere / E5Convert text to vectors
Re-RankerCohere Rerank / ColBERTImprove retrieval precision
LLMGPT-4 / Claude / LlamaAnswer generation
OrchestrationLangChain / LlamaIndexPipeline orchestration
EvaluationRAGAS / DeepEvalMeasure RAG quality

Key Takeaways

  • Chunking quality is the #1 factor in RAG performance — invest heavily here
  • Use hybrid retrieval (dense + sparse) for best recall
  • Always add a re-ranking step — cheap and highly effective
  • Build guardrails from day one: constrained prompts, citation verification, confidence thresholds
  • Measure with RAGAS metrics: faithfulness, relevancy, precision, recall