Problem Statement & Requirements
Why Search Engines Matter
Google processes 8.5 billion queries per day. Search is the gateway to the internet — and increasingly to enterprise knowledge. Understanding search engine design teaches core distributed systems concepts: crawling, indexing, ranking, caching, and serving at massive scale.
Think of a search engine like a vast library with a brilliant librarian. The crawler is the librarian who reads every book and creates an index card for each. The index is the card catalog. The ranking algorithm is the librarian's judgment about which books best answer your question. And autocomplete is the librarian anticipating your question before you finish asking.
Functional Requirements
- Web crawling — Discover and download billions of web pages
- Indexing — Build searchable index from crawled content
- Query processing — Parse, expand, and execute search queries
- Ranking — Order results by relevance using signals like PageRank, BM25, user behavior
- Autocomplete — Suggest queries as the user types
- Spell correction — Handle typos and misspellings
Non-Functional Requirements
- Query latency — <200ms for web search, <50ms for autocomplete
- Freshness — Breaking news indexed within minutes, general content within hours
- Scale — Billions of documents, 100K+ QPS
- Relevance — Top-3 results contain the answer >80% of the time
Back-of-Envelope Estimation
| Parameter | Estimate |
|---|---|
| Pages indexed | 50 billion |
| Avg page size | 50 KB (text content extracted) |
| Total raw content | 50B × 50 KB = 2.5 PB |
| Inverted index size | ~500 TB (compressed) |
| Search QPS | 100,000 (peak: 300,000) |
| Crawl rate | ~1 billion pages/day |
| Autocomplete QPS | 500,000 (more than search, fires per keystroke) |
System API Design
# Web search
GET /api/v1/search?q=distributed+consensus+algorithms
&page=1&count=10&lang=en&safe=true
# Response
{
"results": [
{
"url": "https://example.com/raft-consensus",
"title": "Understanding Raft Consensus",
"snippet": "Raft is a consensus algorithm designed...",
"score": 0.87
}
],
"total_results": 1250000,
"query_time_ms": 45,
"spell_suggestion": null
}
# Autocomplete
GET /api/v1/suggest?q=distrib&limit=5
# ["distributed systems", "distributed consensus", ...]
# Index management (internal)
POST /api/v1/index/documents
{ "urls": ["https://example.com/page1"], "priority": "high" }
Data Model
-- Forward index (document store)
CREATE TABLE documents (
doc_id BIGINT PRIMARY KEY,
url TEXT,
title TEXT,
content TEXT,
pagerank FLOAT,
crawled_at TIMESTAMP,
language VARCHAR(5)
);
-- Inverted index (term -> document list)
-- Stored as: term -> [(doc_id, tf, positions), ...]
-- Compressed with variable-byte or PForDelta encoding
-- URL frontier (crawl queue)
CREATE TABLE url_frontier (
url_hash BIGINT PRIMARY KEY,
url TEXT,
priority FLOAT,
last_crawl TIMESTAMP,
next_crawl TIMESTAMP,
domain VARCHAR
);
-- Query logs (for autocomplete and ranking improvement)
CREATE TABLE query_logs (
query TEXT,
count BIGINT,
clicked_urls TEXT[],
timestamp TIMESTAMP
);
High-Level Architecture
Web Crawler
Distributed crawlers fetch pages, respecting robots.txt and politeness policies (1-2 requests/sec per domain). URL frontier prioritizes fresh, high-PageRank pages. DNS resolution cached. Duplicate content detected via SimHash.
Document Processor
Extracts text, title, headings, and links from HTML. Strips boilerplate (navigation, ads). Detects language. Computes document features (word count, freshness, spam score).
Indexer
Builds the inverted index: maps each term to its posting list (list of documents containing the term, with positions and frequencies). Index is sharded across machines by document ID range.
Query Processor & Ranker
Parses query, applies spell correction, expands synonyms. Looks up terms in inverted index. Computes BM25 scores, combines with PageRank and ML-learned signals. Returns top-K results.
Deep Dive: Core Components
Inverted Index
from collections import defaultdict
class InvertedIndex:
def __init__(self):
# term -> [(doc_id, term_freq, [positions])]
self.index = defaultdict(list)
self.doc_lengths = {}
def add_document(self, doc_id, text):
tokens = self._tokenize(text)
self.doc_lengths[doc_id] = len(tokens)
term_positions = defaultdict(list)
for pos, token in enumerate(tokens):
term_positions[token].append(pos)
for term, positions in term_positions.items():
self.index[term].append(
(doc_id, len(positions), positions)
)
def search(self, query, k=10):
tokens = self._tokenize(query)
scores = defaultdict(float)
for token in tokens:
for doc_id, tf, _ in self.index.get(token, []):
scores[doc_id] += self._bm25(
tf, self.doc_lengths[doc_id], token
)
return sorted(scores.items(),
key=lambda x: -x[1])[:k]
BM25 Scoring
BM25 Formula
BM25 is the standard relevance scoring function: score(D, Q) = Σ IDF(q) × (tf × (k1 + 1)) / (tf + k1 × (1 - b + b × |D|/avgdl)) where tf = term frequency, IDF = inverse document frequency, |D| = document length, avgdl = average document length. Typical parameters: k1=1.2, b=0.75.
PageRank
Computes page authority from the link graph. Pages linked by many high-authority pages get higher scores. Computed offline via iterative MapReduce over the entire web graph. Combined with BM25 at query time.
Learning to Rank (LTR)
Modern search engines use ML models to combine 100+ ranking signals:
- Query-document features: BM25, title match, URL match
- Document features: PageRank, freshness, spam score, content quality
- User features: Click-through rate, dwell time, bounce rate
- Context: Location, device, time of day, search history
Autocomplete
Built on a trie data structure populated from query logs. Each node stores the top-K completions by frequency. Updated hourly from aggregated query logs. Served from memory with <10ms latency.
Scaling & Optimization
Index Sharding
Two strategies: document-based sharding (each shard holds all terms for a subset of documents) or term-based sharding (each shard holds posting lists for a subset of terms). Document-based is simpler and more common — query is broadcast to all shards, results are merged.
Query Optimization
- Early termination: Stop scoring once enough high-quality results are found
- Tiered index: Check high-quality documents first; only search lower tiers if needed
- Result caching: Cache top results for popular queries (40-60% cache hit rate)
- Skip lists: Efficiently intersect posting lists for multi-term queries
Practice Problems
Practice 1: Real-Time News Index
Breaking news must be searchable within 30 seconds of publication. Your batch indexing pipeline runs hourly. Design a real-time indexing layer that handles 10,000 new articles/minute.
Practice 2: Personalized Search
Two users search for "python." One is a developer, the other is a pet owner. Design a personalization layer that re-ranks results based on user context without sacrificing latency.
Practice 3: Multi-Modal Search
Users want to search by uploading an image (reverse image search) or combining text + image queries. Design the indexing and retrieval pipeline for multi-modal search.
Quick Reference
| Component | Technology | Purpose |
|---|---|---|
| Crawler | Scrapy / Custom (Go) | Distributed web crawling |
| Index | Lucene / Elasticsearch | Inverted index + search |
| Ranking | BM25 + LambdaMART | Relevance scoring |
| PageRank | Spark GraphX | Link analysis |
| Autocomplete | Trie + Redis | Query suggestions |
| Spell Check | SymSpell / Norvig | Typo correction |
| Cache | CDN + Redis | Popular query results |
Key Takeaways
- The inverted index is the fundamental data structure — maps terms to documents
- BM25 remains the baseline for relevance; combine with PageRank and LTR models
- Shard by document ID, query all shards in parallel, merge results
- Caching popular queries gives 40-60% hit rate and huge latency savings
- Separate real-time index (small, fast) from main index (large, batch-built)