Design a Search Engine | LIZIU System Design

Problem Statement & Requirements

Why Search Engines Matter

Google processes 8.5 billion queries per day. Search is the gateway to the internet — and increasingly to enterprise knowledge. Understanding search engine design teaches core distributed systems concepts: crawling, indexing, ranking, caching, and serving at massive scale.

Think of a search engine like a vast library with a brilliant librarian. The crawler is the librarian who reads every book and creates an index card for each. The index is the card catalog. The ranking algorithm is the librarian's judgment about which books best answer your question. And autocomplete is the librarian anticipating your question before you finish asking.

Functional Requirements

Web crawling — Discover and download billions of web pages
Indexing — Build searchable index from crawled content
Query processing — Parse, expand, and execute search queries
Ranking — Order results by relevance using signals like PageRank, BM25, user behavior
Autocomplete — Suggest queries as the user types
Spell correction — Handle typos and misspellings

Non-Functional Requirements

Query latency — <200ms for web search, <50ms for autocomplete
Freshness — Breaking news indexed within minutes, general content within hours
Scale — Billions of documents, 100K+ QPS
Relevance — Top-3 results contain the answer >80% of the time

Back-of-Envelope Estimation

Parameter	Estimate
Pages indexed	50 billion
Avg page size	50 KB (text content extracted)
Total raw content	50B × 50 KB = 2.5 PB
Inverted index size	~500 TB (compressed)
Search QPS	100,000 (peak: 300,000)
Crawl rate	~1 billion pages/day
Autocomplete QPS	500,000 (more than search, fires per keystroke)

System API Design

Search APIs

# Web search
GET /api/v1/search?q=distributed+consensus+algorithms
    &page=1&count=10&lang=en&safe=true

# Response
{
  "results": [
    {
      "url": "https://example.com/raft-consensus",
      "title": "Understanding Raft Consensus",
      "snippet": "Raft is a consensus algorithm designed...",
      "score": 0.87
    }
  ],
  "total_results": 1250000,
  "query_time_ms": 45,
  "spell_suggestion": null
}

# Autocomplete
GET /api/v1/suggest?q=distrib&limit=5
# ["distributed systems", "distributed consensus", ...]

# Index management (internal)
POST /api/v1/index/documents
{ "urls": ["https://example.com/page1"], "priority": "high" }

Data Model

Index Structures

-- Forward index (document store)
CREATE TABLE documents (
    doc_id      BIGINT PRIMARY KEY,
    url         TEXT,
    title       TEXT,
    content     TEXT,
    pagerank    FLOAT,
    crawled_at  TIMESTAMP,
    language    VARCHAR(5)
);

-- Inverted index (term -> document list)
-- Stored as: term -> [(doc_id, tf, positions), ...]
-- Compressed with variable-byte or PForDelta encoding

-- URL frontier (crawl queue)
CREATE TABLE url_frontier (
    url_hash    BIGINT PRIMARY KEY,
    url         TEXT,
    priority    FLOAT,
    last_crawl  TIMESTAMP,
    next_crawl  TIMESTAMP,
    domain      VARCHAR
);

-- Query logs (for autocomplete and ranking improvement)
CREATE TABLE query_logs (
    query       TEXT,
    count       BIGINT,
    clicked_urls TEXT[],
    timestamp   TIMESTAMP
);

High-Level Architecture

Web Crawler

Distributed crawlers fetch pages, respecting robots.txt and politeness policies (1-2 requests/sec per domain). URL frontier prioritizes fresh, high-PageRank pages. DNS resolution cached. Duplicate content detected via SimHash.

Document Processor

Extracts text, title, headings, and links from HTML. Strips boilerplate (navigation, ads). Detects language. Computes document features (word count, freshness, spam score).

Indexer

Builds the inverted index: maps each term to its posting list (list of documents containing the term, with positions and frequencies). Index is sharded across machines by document ID range.

Query Processor & Ranker

Parses query, applies spell correction, expands synonyms. Looks up terms in inverted index. Computes BM25 scores, combines with PageRank and ML-learned signals. Returns top-K results.

Deep Dive: Core Components

Inverted Index

Building an Inverted Index

from collections import defaultdict

class InvertedIndex:
    def __init__(self):
        # term -> [(doc_id, term_freq, [positions])]
        self.index = defaultdict(list)
        self.doc_lengths = {}

    def add_document(self, doc_id, text):
        tokens = self._tokenize(text)
        self.doc_lengths[doc_id] = len(tokens)
        term_positions = defaultdict(list)

        for pos, token in enumerate(tokens):
            term_positions[token].append(pos)

        for term, positions in term_positions.items():
            self.index[term].append(
                (doc_id, len(positions), positions)
            )

    def search(self, query, k=10):
        tokens = self._tokenize(query)
        scores = defaultdict(float)
        for token in tokens:
            for doc_id, tf, _ in self.index.get(token, []):
                scores[doc_id] += self._bm25(
                    tf, self.doc_lengths[doc_id], token
                )
        return sorted(scores.items(),
                      key=lambda x: -x[1])[:k]

BM25 Scoring

BM25 Formula

BM25 is the standard relevance scoring function: score(D, Q) = Σ IDF(q) × (tf × (k1 + 1)) / (tf + k1 × (1 - b + b × |D|/avgdl)) where tf = term frequency, IDF = inverse document frequency, |D| = document length, avgdl = average document length. Typical parameters: k1=1.2, b=0.75.

PageRank

Computes page authority from the link graph. Pages linked by many high-authority pages get higher scores. Computed offline via iterative MapReduce over the entire web graph. Combined with BM25 at query time.

Learning to Rank (LTR)

Modern search engines use ML models to combine 100+ ranking signals:

Query-document features: BM25, title match, URL match
Document features: PageRank, freshness, spam score, content quality
User features: Click-through rate, dwell time, bounce rate
Context: Location, device, time of day, search history

Autocomplete

Built on a trie data structure populated from query logs. Each node stores the top-K completions by frequency. Updated hourly from aggregated query logs. Served from memory with <10ms latency.

Scaling & Optimization

Index Sharding

Two strategies: document-based sharding (each shard holds all terms for a subset of documents) or term-based sharding (each shard holds posting lists for a subset of terms). Document-based is simpler and more common — query is broadcast to all shards, results are merged.

Query Optimization

Early termination: Stop scoring once enough high-quality results are found
Tiered index: Check high-quality documents first; only search lower tiers if needed
Result caching: Cache top results for popular queries (40-60% cache hit rate)
Skip lists: Efficiently intersect posting lists for multi-term queries

Practice Problems

Practice 1: Real-Time News Index

Breaking news must be searchable within 30 seconds of publication. Your batch indexing pipeline runs hourly. Design a real-time indexing layer that handles 10,000 new articles/minute.

Practice 2: Personalized Search

Two users search for "python." One is a developer, the other is a pet owner. Design a personalization layer that re-ranks results based on user context without sacrificing latency.

Practice 3: Multi-Modal Search

Users want to search by uploading an image (reverse image search) or combining text + image queries. Design the indexing and retrieval pipeline for multi-modal search.

Quick Reference

Component	Technology	Purpose
Crawler	Scrapy / Custom (Go)	Distributed web crawling
Index	Lucene / Elasticsearch	Inverted index + search
Ranking	BM25 + LambdaMART	Relevance scoring
PageRank	Spark GraphX	Link analysis
Autocomplete	Trie + Redis	Query suggestions
Spell Check	SymSpell / Norvig	Typo correction
Cache	CDN + Redis	Popular query results

Key Takeaways

The inverted index is the fundamental data structure — maps terms to documents
BM25 remains the baseline for relevance; combine with PageRank and LTR models
Shard by document ID, query all shards in parallel, merge results
Caching popular queries gives 40-60% hit rate and huge latency savings
Separate real-time index (small, fast) from main index (large, batch-built)