Design a Search Engine

Hard 35 min read

Problem Statement & Requirements

Why Search Engines Matter

Google processes 8.5 billion queries per day. Search is the gateway to the internet — and increasingly to enterprise knowledge. Understanding search engine design teaches core distributed systems concepts: crawling, indexing, ranking, caching, and serving at massive scale.

Think of a search engine like a vast library with a brilliant librarian. The crawler is the librarian who reads every book and creates an index card for each. The index is the card catalog. The ranking algorithm is the librarian's judgment about which books best answer your question. And autocomplete is the librarian anticipating your question before you finish asking.

Functional Requirements

Non-Functional Requirements

Back-of-Envelope Estimation

ParameterEstimate
Pages indexed50 billion
Avg page size50 KB (text content extracted)
Total raw content50B × 50 KB = 2.5 PB
Inverted index size~500 TB (compressed)
Search QPS100,000 (peak: 300,000)
Crawl rate~1 billion pages/day
Autocomplete QPS500,000 (more than search, fires per keystroke)

System API Design

Search APIs
# Web search
GET /api/v1/search?q=distributed+consensus+algorithms
    &page=1&count=10&lang=en&safe=true

# Response
{
  "results": [
    {
      "url": "https://example.com/raft-consensus",
      "title": "Understanding Raft Consensus",
      "snippet": "Raft is a consensus algorithm designed...",
      "score": 0.87
    }
  ],
  "total_results": 1250000,
  "query_time_ms": 45,
  "spell_suggestion": null
}

# Autocomplete
GET /api/v1/suggest?q=distrib&limit=5
# ["distributed systems", "distributed consensus", ...]

# Index management (internal)
POST /api/v1/index/documents
{ "urls": ["https://example.com/page1"], "priority": "high" }

Data Model

Index Structures
-- Forward index (document store)
CREATE TABLE documents (
    doc_id      BIGINT PRIMARY KEY,
    url         TEXT,
    title       TEXT,
    content     TEXT,
    pagerank    FLOAT,
    crawled_at  TIMESTAMP,
    language    VARCHAR(5)
);

-- Inverted index (term -> document list)
-- Stored as: term -> [(doc_id, tf, positions), ...]
-- Compressed with variable-byte or PForDelta encoding

-- URL frontier (crawl queue)
CREATE TABLE url_frontier (
    url_hash    BIGINT PRIMARY KEY,
    url         TEXT,
    priority    FLOAT,
    last_crawl  TIMESTAMP,
    next_crawl  TIMESTAMP,
    domain      VARCHAR
);

-- Query logs (for autocomplete and ranking improvement)
CREATE TABLE query_logs (
    query       TEXT,
    count       BIGINT,
    clicked_urls TEXT[],
    timestamp   TIMESTAMP
);

High-Level Architecture

Web Crawler

Distributed crawlers fetch pages, respecting robots.txt and politeness policies (1-2 requests/sec per domain). URL frontier prioritizes fresh, high-PageRank pages. DNS resolution cached. Duplicate content detected via SimHash.

Document Processor

Extracts text, title, headings, and links from HTML. Strips boilerplate (navigation, ads). Detects language. Computes document features (word count, freshness, spam score).

Indexer

Builds the inverted index: maps each term to its posting list (list of documents containing the term, with positions and frequencies). Index is sharded across machines by document ID range.

Query Processor & Ranker

Parses query, applies spell correction, expands synonyms. Looks up terms in inverted index. Computes BM25 scores, combines with PageRank and ML-learned signals. Returns top-K results.

Deep Dive: Core Components

Inverted Index

Building an Inverted Index
from collections import defaultdict

class InvertedIndex:
    def __init__(self):
        # term -> [(doc_id, term_freq, [positions])]
        self.index = defaultdict(list)
        self.doc_lengths = {}

    def add_document(self, doc_id, text):
        tokens = self._tokenize(text)
        self.doc_lengths[doc_id] = len(tokens)
        term_positions = defaultdict(list)

        for pos, token in enumerate(tokens):
            term_positions[token].append(pos)

        for term, positions in term_positions.items():
            self.index[term].append(
                (doc_id, len(positions), positions)
            )

    def search(self, query, k=10):
        tokens = self._tokenize(query)
        scores = defaultdict(float)
        for token in tokens:
            for doc_id, tf, _ in self.index.get(token, []):
                scores[doc_id] += self._bm25(
                    tf, self.doc_lengths[doc_id], token
                )
        return sorted(scores.items(),
                      key=lambda x: -x[1])[:k]

BM25 Scoring

BM25 Formula

BM25 is the standard relevance scoring function: score(D, Q) = Σ IDF(q) × (tf × (k1 + 1)) / (tf + k1 × (1 - b + b × |D|/avgdl)) where tf = term frequency, IDF = inverse document frequency, |D| = document length, avgdl = average document length. Typical parameters: k1=1.2, b=0.75.

PageRank

Computes page authority from the link graph. Pages linked by many high-authority pages get higher scores. Computed offline via iterative MapReduce over the entire web graph. Combined with BM25 at query time.

Learning to Rank (LTR)

Modern search engines use ML models to combine 100+ ranking signals:

Autocomplete

Built on a trie data structure populated from query logs. Each node stores the top-K completions by frequency. Updated hourly from aggregated query logs. Served from memory with <10ms latency.

Scaling & Optimization

Index Sharding

Two strategies: document-based sharding (each shard holds all terms for a subset of documents) or term-based sharding (each shard holds posting lists for a subset of terms). Document-based is simpler and more common — query is broadcast to all shards, results are merged.

Query Optimization

Practice Problems

Practice 1: Real-Time News Index

Breaking news must be searchable within 30 seconds of publication. Your batch indexing pipeline runs hourly. Design a real-time indexing layer that handles 10,000 new articles/minute.

Practice 2: Personalized Search

Two users search for "python." One is a developer, the other is a pet owner. Design a personalization layer that re-ranks results based on user context without sacrificing latency.

Practice 3: Multi-Modal Search

Users want to search by uploading an image (reverse image search) or combining text + image queries. Design the indexing and retrieval pipeline for multi-modal search.

Quick Reference

ComponentTechnologyPurpose
CrawlerScrapy / Custom (Go)Distributed web crawling
IndexLucene / ElasticsearchInverted index + search
RankingBM25 + LambdaMARTRelevance scoring
PageRankSpark GraphXLink analysis
AutocompleteTrie + RedisQuery suggestions
Spell CheckSymSpell / NorvigTypo correction
CacheCDN + RedisPopular query results

Key Takeaways

  • The inverted index is the fundamental data structure — maps terms to documents
  • BM25 remains the baseline for relevance; combine with PageRank and LTR models
  • Shard by document ID, query all shards in parallel, merge results
  • Caching popular queries gives 40-60% hit rate and huge latency savings
  • Separate real-time index (small, fast) from main index (large, batch-built)