Problem Statement & Requirements
Why RAG Systems Matter
Large Language Models hallucinate. RAG (Retrieval-Augmented Generation) grounds LLM responses in real, verifiable data — reducing hallucinations by up to 70%. Every enterprise deploying LLMs (Microsoft Copilot, Notion AI, customer support bots) uses some form of RAG.
Think of RAG like an open-book exam. Instead of relying solely on memorized knowledge (the LLM's training data), the system first looks up relevant information from a knowledge base, then formulates an answer using both its general knowledge and the retrieved context.
Functional Requirements
- Document ingestion — Ingest PDFs, HTML, Markdown, databases, and APIs
- Chunking & embedding — Split documents into retrievable chunks with vector representations
- Semantic search — Find relevant chunks for a user query
- LLM generation — Generate answers grounded in retrieved context
- Citation & provenance — Link every claim to its source document
- Feedback loop — Users can rate answers; improve retrieval over time
Non-Functional Requirements
- End-to-end latency — <3 seconds (retrieval + generation)
- Relevance — Top-5 retrieved chunks contain the answer 90%+ of the time
- Freshness — New documents searchable within 15 minutes of ingestion
- Scale — 10M+ document chunks, 1000+ concurrent queries
Back-of-Envelope Estimation
| Parameter | Estimate |
|---|---|
| Document corpus | 1M documents (~50M chunks) |
| Avg chunk size | 500 tokens (~2 KB text) |
| Embedding dimension | 1536 (OpenAI) or 768 (open-source) |
| Vector storage | 50M × 1536 × 4B = ~300 GB |
| Query QPS | 100-1,000 |
| Retrieval latency budget | <200ms |
| LLM generation budget | <2.5 seconds |
| Embedding throughput | ~1,000 chunks/sec (batched) |
System API Design
# Query the RAG system
POST /api/v1/query
{
"question": "What is our refund policy for enterprise customers?",
"conversation_id": "conv_123",
"filters": { "source": "policy_docs" },
"top_k": 5,
"stream": true
}
# Response
{
"answer": "Enterprise customers can request a full refund...",
"citations": [
{ "chunk_id": "c_456", "doc": "refund-policy-v3.pdf", "page": 4, "score": 0.92 }
],
"confidence": 0.88
}
# Ingest documents
POST /api/v1/documents
{
"source": "s3://docs/policy/",
"collection": "policy_docs",
"chunking": { "strategy": "semantic", "max_tokens": 512 }
}
# Submit feedback
POST /api/v1/feedback
{ "query_id": "q_789", "rating": "helpful", "comment": "Accurate and well-cited" }
Data Model
CREATE TABLE documents (
doc_id VARCHAR PRIMARY KEY,
source_url TEXT,
title TEXT,
collection VARCHAR,
ingested_at TIMESTAMP,
metadata JSONB
);
CREATE TABLE chunks (
chunk_id VARCHAR PRIMARY KEY,
doc_id VARCHAR REFERENCES documents,
content TEXT,
embedding vector(1536), -- pgvector type
chunk_index INT,
token_count INT,
metadata JSONB
);
CREATE INDEX ON chunks
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 1000);
CREATE TABLE conversations (
conversation_id VARCHAR PRIMARY KEY,
user_id VARCHAR,
messages JSONB[],
created_at TIMESTAMP
);
High-Level Architecture
RAG has two pipelines: ingestion (offline) and query (online).
Ingestion Pipeline
Document → Parser (extract text from PDF/HTML) → Chunker (split into passages) → Embedder (vectorize each chunk) → Vector DB (index for search). Runs asynchronously when new documents arrive.
Query Pipeline
User Question → Query Embedding → Vector Search (retrieve top-K chunks) → Re-Ranker (cross-encoder scoring) → Context Assembly → LLM Generation → Citation Extraction → Response.
Why Re-Ranking Matters
Bi-encoder retrieval (embedding similarity) is fast but approximate. A cross-encoder re-ranker jointly attends to the query and each candidate chunk, improving relevance by 10-20%. It runs on the top 20-50 candidates (not the full corpus).
Deep Dive: Core Components
Chunking Strategies
| Strategy | How It Works | Best For |
|---|---|---|
| Fixed-size | Split every N tokens with overlap | Simple, predictable token usage |
| Sentence-based | Split on sentence boundaries | Preserving complete thoughts |
| Semantic | Split when embedding similarity drops | Topic-coherent chunks |
| Recursive | Try headings → paragraphs → sentences | Structured documents |
| Document-aware | Use document structure (sections, pages) | PDFs, legal docs, manuals |
from sentence_transformers import SentenceTransformer
import numpy as np
def semantic_chunk(text, model, threshold=0.75):
"""Split text where semantic similarity drops."""
sentences = text.split(". ")
embeddings = model.encode(sentences)
chunks, current = [], [sentences[0]]
for i in range(1, len(sentences)):
sim = np.dot(embeddings[i], embeddings[i-1])
if sim < threshold:
chunks.append(". ".join(current))
current = [sentences[i]]
else:
current.append(sentences[i])
chunks.append(". ".join(current))
return chunks
Vector Databases
| Database | Type | Strengths |
|---|---|---|
| Pinecone | Managed SaaS | Easiest to operate, auto-scaling |
| Weaviate | Open-source | Hybrid search, multi-modal |
| Qdrant | Open-source | Filtering, payload storage |
| pgvector | PostgreSQL extension | Use existing Postgres, ACID |
| Milvus | Open-source | Billion-scale, GPU-accelerated |
Retrieval Strategies
- Dense retrieval: Embed query, find nearest vectors (semantic match)
- Sparse retrieval (BM25): Keyword-based (exact term match)
- Hybrid: Combine dense + sparse with reciprocal rank fusion (RRF). Best of both worlds — catches semantic similarity AND exact keywords
Guardrails & Hallucination Mitigation
Critical: Preventing Hallucinations
Even with RAG, LLMs can hallucinate. Mitigate with: (1) Constrain the prompt — "Answer ONLY based on the provided context. If the context doesn't contain the answer, say so." (2) Citation verification — Check that every claim maps to a retrieved chunk. (3) Confidence scoring — If retrieval scores are low, return "I don't have enough information" instead of guessing. (4) Fact-checking chain — Use a second LLM call to verify the answer against the retrieved context.
Scaling & Optimization
Embedding Caching
Cache query embeddings (many queries are repeated or similar). Use a two-level cache: L1 in-memory (last 10K queries) and L2 in Redis (last 1M queries). Cache hit rates of 30-50% are common.
Hierarchical Retrieval
For large corpora, use two-stage retrieval: (1) Retrieve relevant documents first using document-level summaries, then (2) Retrieve specific chunks within those documents. Reduces search space 10-100x.
Evaluation Framework
| Metric | Measures | Target |
|---|---|---|
| Faithfulness | Is the answer supported by retrieved context? | >0.90 |
| Answer Relevancy | Does the answer address the question? | >0.85 |
| Context Precision | Are retrieved chunks relevant? | >0.80 |
| Context Recall | Are all needed chunks retrieved? | >0.75 |
Practice Problems
Practice 1: Multi-Language RAG
Your knowledge base contains documents in English, Spanish, and Japanese. A user asks a question in French. Design a retrieval strategy that works across languages.
Practice 2: Table & Image Retrieval
Your documents contain tables and charts that are critical for answering questions. Standard text chunking loses this information. Design an ingestion pipeline that handles multimodal content.
Practice 3: Real-Time Knowledge Updates
Your company wiki is updated 500 times per day. Design an ingestion pipeline that keeps the RAG index fresh within 5 minutes of any edit, without re-embedding the entire corpus.
Quick Reference
| Component | Technology | Purpose |
|---|---|---|
| Vector Database | Pinecone / pgvector / Weaviate | Store and search embeddings |
| Embedding Model | OpenAI / Cohere / E5 | Convert text to vectors |
| Re-Ranker | Cohere Rerank / ColBERT | Improve retrieval precision |
| LLM | GPT-4 / Claude / Llama | Answer generation |
| Orchestration | LangChain / LlamaIndex | Pipeline orchestration |
| Evaluation | RAGAS / DeepEval | Measure RAG quality |
Key Takeaways
- Chunking quality is the #1 factor in RAG performance — invest heavily here
- Use hybrid retrieval (dense + sparse) for best recall
- Always add a re-ranking step — cheap and highly effective
- Build guardrails from day one: constrained prompts, citation verification, confidence thresholds
- Measure with RAGAS metrics: faithfulness, relevancy, precision, recall