Design a RAG System | LIZIU System Design

Problem Statement & Requirements

Why RAG Systems Matter

Large Language Models hallucinate. RAG (Retrieval-Augmented Generation) grounds LLM responses in real, verifiable data — reducing hallucinations by up to 70%. Every enterprise deploying LLMs (Microsoft Copilot, Notion AI, customer support bots) uses some form of RAG.

Think of RAG like an open-book exam. Instead of relying solely on memorized knowledge (the LLM's training data), the system first looks up relevant information from a knowledge base, then formulates an answer using both its general knowledge and the retrieved context.

Functional Requirements

Document ingestion — Ingest PDFs, HTML, Markdown, databases, and APIs
Chunking & embedding — Split documents into retrievable chunks with vector representations
Semantic search — Find relevant chunks for a user query
LLM generation — Generate answers grounded in retrieved context
Citation & provenance — Link every claim to its source document
Feedback loop — Users can rate answers; improve retrieval over time

Non-Functional Requirements

End-to-end latency — <3 seconds (retrieval + generation)
Relevance — Top-5 retrieved chunks contain the answer 90%+ of the time
Freshness — New documents searchable within 15 minutes of ingestion
Scale — 10M+ document chunks, 1000+ concurrent queries

Back-of-Envelope Estimation

Parameter	Estimate
Document corpus	1M documents (~50M chunks)
Avg chunk size	500 tokens (~2 KB text)
Embedding dimension	1536 (OpenAI) or 768 (open-source)
Vector storage	50M × 1536 × 4B = ~300 GB
Query QPS	100-1,000
Retrieval latency budget	<200ms
LLM generation budget	<2.5 seconds
Embedding throughput	~1,000 chunks/sec (batched)

System API Design

RAG APIs

# Query the RAG system
POST /api/v1/query
{
  "question": "What is our refund policy for enterprise customers?",
  "conversation_id": "conv_123",
  "filters": { "source": "policy_docs" },
  "top_k": 5,
  "stream": true
}
# Response
{
  "answer": "Enterprise customers can request a full refund...",
  "citations": [
    { "chunk_id": "c_456", "doc": "refund-policy-v3.pdf", "page": 4, "score": 0.92 }
  ],
  "confidence": 0.88
}

# Ingest documents
POST /api/v1/documents
{
  "source": "s3://docs/policy/",
  "collection": "policy_docs",
  "chunking": { "strategy": "semantic", "max_tokens": 512 }
}

# Submit feedback
POST /api/v1/feedback
{ "query_id": "q_789", "rating": "helpful", "comment": "Accurate and well-cited" }

Data Model

Core Schema

CREATE TABLE documents (
    doc_id        VARCHAR PRIMARY KEY,
    source_url    TEXT,
    title         TEXT,
    collection    VARCHAR,
    ingested_at   TIMESTAMP,
    metadata      JSONB
);
CREATE TABLE chunks (
    chunk_id      VARCHAR PRIMARY KEY,
    doc_id        VARCHAR REFERENCES documents,
    content       TEXT,
    embedding     vector(1536),  -- pgvector type
    chunk_index   INT,
    token_count   INT,
    metadata      JSONB
);
CREATE INDEX ON chunks
    USING ivfflat (embedding vector_cosine_ops)
    WITH (lists = 1000);
CREATE TABLE conversations (
    conversation_id VARCHAR PRIMARY KEY,
    user_id         VARCHAR,
    messages        JSONB[],
    created_at      TIMESTAMP
);

High-Level Architecture

RAG has two pipelines: ingestion (offline) and query (online).

Ingestion Pipeline

Document → Parser (extract text from PDF/HTML) → Chunker (split into passages) → Embedder (vectorize each chunk) → Vector DB (index for search). Runs asynchronously when new documents arrive.

Query Pipeline

User Question → Query Embedding → Vector Search (retrieve top-K chunks) → Re-Ranker (cross-encoder scoring) → Context Assembly → LLM Generation → Citation Extraction → Response.

Why Re-Ranking Matters

Bi-encoder retrieval (embedding similarity) is fast but approximate. A cross-encoder re-ranker jointly attends to the query and each candidate chunk, improving relevance by 10-20%. It runs on the top 20-50 candidates (not the full corpus).

Deep Dive: Core Components

Chunking Strategies

Strategy	How It Works	Best For
Fixed-size	Split every N tokens with overlap	Simple, predictable token usage
Sentence-based	Split on sentence boundaries	Preserving complete thoughts
Semantic	Split when embedding similarity drops	Topic-coherent chunks
Recursive	Try headings → paragraphs → sentences	Structured documents
Document-aware	Use document structure (sections, pages)	PDFs, legal docs, manuals

Semantic Chunking

from sentence_transformers import SentenceTransformer
import numpy as np

def semantic_chunk(text, model, threshold=0.75):
    """Split text where semantic similarity drops."""
    sentences = text.split(". ")
    embeddings = model.encode(sentences)

    chunks, current = [], [sentences[0]]
    for i in range(1, len(sentences)):
        sim = np.dot(embeddings[i], embeddings[i-1])
        if sim < threshold:
            chunks.append(". ".join(current))
            current = [sentences[i]]
        else:
            current.append(sentences[i])
    chunks.append(". ".join(current))
    return chunks

Vector Databases

Database	Type	Strengths
Pinecone	Managed SaaS	Easiest to operate, auto-scaling
Weaviate	Open-source	Hybrid search, multi-modal
Qdrant	Open-source	Filtering, payload storage
pgvector	PostgreSQL extension	Use existing Postgres, ACID
Milvus	Open-source	Billion-scale, GPU-accelerated

Retrieval Strategies

Dense retrieval: Embed query, find nearest vectors (semantic match)
Sparse retrieval (BM25): Keyword-based (exact term match)
Hybrid: Combine dense + sparse with reciprocal rank fusion (RRF). Best of both worlds — catches semantic similarity AND exact keywords

Guardrails & Hallucination Mitigation

Critical: Preventing Hallucinations

Even with RAG, LLMs can hallucinate. Mitigate with: (1) Constrain the prompt — "Answer ONLY based on the provided context. If the context doesn't contain the answer, say so." (2) Citation verification — Check that every claim maps to a retrieved chunk. (3) Confidence scoring — If retrieval scores are low, return "I don't have enough information" instead of guessing. (4) Fact-checking chain — Use a second LLM call to verify the answer against the retrieved context.

Scaling & Optimization

Embedding Caching

Cache query embeddings (many queries are repeated or similar). Use a two-level cache: L1 in-memory (last 10K queries) and L2 in Redis (last 1M queries). Cache hit rates of 30-50% are common.

Hierarchical Retrieval

For large corpora, use two-stage retrieval: (1) Retrieve relevant documents first using document-level summaries, then (2) Retrieve specific chunks within those documents. Reduces search space 10-100x.

Evaluation Framework

Metric	Measures	Target
Faithfulness	Is the answer supported by retrieved context?	>0.90
Answer Relevancy	Does the answer address the question?	>0.85
Context Precision	Are retrieved chunks relevant?	>0.80
Context Recall	Are all needed chunks retrieved?	>0.75

Practice Problems

Practice 1: Multi-Language RAG

Your knowledge base contains documents in English, Spanish, and Japanese. A user asks a question in French. Design a retrieval strategy that works across languages.

Practice 2: Table & Image Retrieval

Your documents contain tables and charts that are critical for answering questions. Standard text chunking loses this information. Design an ingestion pipeline that handles multimodal content.

Practice 3: Real-Time Knowledge Updates

Your company wiki is updated 500 times per day. Design an ingestion pipeline that keeps the RAG index fresh within 5 minutes of any edit, without re-embedding the entire corpus.

Quick Reference

Component	Technology	Purpose
Vector Database	Pinecone / pgvector / Weaviate	Store and search embeddings
Embedding Model	OpenAI / Cohere / E5	Convert text to vectors
Re-Ranker	Cohere Rerank / ColBERT	Improve retrieval precision
LLM	GPT-4 / Claude / Llama	Answer generation
Orchestration	LangChain / LlamaIndex	Pipeline orchestration
Evaluation	RAGAS / DeepEval	Measure RAG quality

Key Takeaways

Chunking quality is the #1 factor in RAG performance — invest heavily here
Use hybrid retrieval (dense + sparse) for best recall
Always add a re-ranking step — cheap and highly effective
Build guardrails from day one: constrained prompts, citation verification, confidence thresholds
Measure with RAGAS metrics: faithfulness, relevancy, precision, recall