🔍 Pattern Matching Algorithms

Master the art of finding patterns in text! From simple search to advanced algorithms like KMP and Boyer-Moore, learn how search engines, DNA analyzers, and text editors find what they're looking for.

⏱️ 45-60 mins 📊 Intermediate Level 🎮 6 Interactive Demos 💡 Real-world Applications

🤔 Why Should You Care About Pattern Matching?

Pattern matching is everywhere in modern technology! Let's see where you encounter it daily:

📝

Text Editors & IDEs

When you press Ctrl+F to find text, or when your IDE highlights syntax errors, that's pattern matching in action! VS Code uses Boyer-Moore for fast text search.

🧬

DNA Sequencing

Scientists use pattern matching to find specific gene sequences in DNA strands. A single human genome has 3 billion base pairs - imagine finding a specific sequence without efficient algorithms!

🔎

Search Engines

Google processes billions of searches daily. They use advanced pattern matching to find your query in trillions of web pages. Without efficient algorithms, searches would take forever!

🛡️

Security & Antivirus

Antivirus software scans files for known virus signatures using pattern matching. They need to check millions of patterns quickly without slowing down your computer.

📚

Plagiarism Detection

Tools like Turnitin use sophisticated pattern matching to compare submitted work against billions of documents to detect potential plagiarism.

🌟 The Basics - Let's Start Simple!

What is Pattern Matching?

Imagine you're reading a book and looking for every occurrence of the word "magic". Pattern matching is teaching a computer to do exactly that - find a specific pattern (like "magic") in a larger text (like your book).

Try It: Naive Pattern Matching

See how the simplest algorithm checks every position:

Your First Pattern Matching Algorithm 🎯
# The Naive Approach - Simple but it works! def find_pattern_simple(text, pattern): # Think of it like sliding a ruler along a page matches = [] # Try every starting position in the text for start in range(len(text) - len(pattern) + 1): # Check if pattern matches at this position found = True for i in range(len(pattern)): if text[start + i] != pattern[i]: found = False break # No match, try next position if found: matches.append(start) # Found it! 🎉 return matches # Let's try it! text = "The cat in the hat sat on the mat" pattern = "at" positions = find_pattern_simple(text, pattern) print(f"Found '{pattern}' at positions: {positions}") # Output: Found 'at' at positions: [5, 16, 24, 34]

How It Works - Step by Step

1
Start at the beginning: Place the pattern at position 0 of the text
2
Compare characters: Check each character of the pattern with the text
3
Match or Move: If all match, record position. Either way, slide pattern one position right
4
Repeat: Continue until the pattern can't fit in remaining text

⚠️ Common Mistake: Off-by-One Error

# ❌ WRONG - This will miss the last possible position! for start in range(len(text) - len(pattern)): # Missing + 1 # ... rest of code

Why it's wrong: If text="ABC" and pattern="BC", we need to check position 1, but range(3-2) only gives us position 0!

✅ The Right Way

# ✅ CORRECT - Includes all valid positions for start in range(len(text) - len(pattern) + 1): # Now we check all positions where pattern can fit

🎨 Common Pattern Matching Algorithms

Intermediate

KMP Algorithm - The Smart Skipper

KMP (Knuth-Morris-Pratt) is like having a smart bookmark that remembers partial matches!

🎮 Interactive KMP Visualizer

See how KMP builds its "failure function" to skip unnecessary comparisons:

KMP Algorithm - Skip the Redundant Checks! 🏃‍♂️
# KMP - It's like having a smart assistant that remembers! def build_lps_array(pattern): """LPS = Longest Proper Prefix which is also Suffix It tells us: 'If we fail at position i, where should we restart?'""" lps = [0] * len(pattern) length = 0 # Length of previous longest prefix suffix i = 1 while i < len(pattern): if pattern[i] == pattern[length]: # We found a match! Extend the prefix length += 1 lps[i] = length i += 1 else: if length != 0: # Try shorter prefix length = lps[length - 1] else: # No prefix matches lps[i] = 0 i += 1 return lps def kmp_search(text, pattern): """KMP - Smart pattern matching that never goes backward!""" lps = build_lps_array(pattern) matches = [] text_idx = 0 pattern_idx = 0 while text_idx < len(text): if text[text_idx] == pattern[pattern_idx]: # Characters match! Move both forward text_idx += 1 pattern_idx += 1 if pattern_idx == len(pattern): # Found complete match! 🎉 matches.append(text_idx - pattern_idx) pattern_idx = lps[pattern_idx - 1] elif text_idx < len(text) and text[text_idx] != pattern[pattern_idx]: # Mismatch! Use LPS to skip smartly if pattern_idx != 0: pattern_idx = lps[pattern_idx - 1] else: text_idx += 1 return matches # Example: Finding "ABAB" in text text = "ABABDABACDABABCABAB" pattern = "ABAB" print(f"KMP found matches at: {kmp_search(text, pattern)}")
Intermediate

Rabin-Karp - The Hash Detective 🔍

Instead of comparing characters, Rabin-Karp uses math! It calculates a "fingerprint" (hash) for the pattern and checks if any part of the text has the same fingerprint.

🎮 Rolling Hash Calculator

See how the hash "rolls" as we slide the window:

Rabin-Karp - Pattern Matching with Math! 🔢
# Rabin-Karp - Like comparing fingerprints instead of faces! def rabin_karp(text, pattern, prime=101): """Uses rolling hash for fast pattern matching""" m = len(pattern) n = len(text) d = 256 # Number of characters in alphabet pattern_hash = 0 text_hash = 0 h = 1 matches = [] # Calculate h = d^(m-1) % prime for i in range(m - 1): h = (h * d) % prime # Calculate initial hash values for i in range(m): pattern_hash = (d * pattern_hash + ord(pattern[i])) % prime text_hash = (d * text_hash + ord(text[i])) % prime # Slide the pattern over text for i in range(n - m + 1): # If hash values match, double-check character by character if pattern_hash == text_hash: if text[i:i+m] == pattern: matches.append(i) # Found it! 🎯 # Roll the hash to next window if i < n - m: # Remove leading character, add trailing character text_hash = (d * (text_hash - ord(text[i]) * h) + ord(text[i + m])) % prime # Handle negative hash value if text_hash < 0: text_hash += prime return matches # Pro tip: Great for finding multiple patterns! text = "The quick brown fox jumps over the lazy dog" pattern = "the" print(f"Found '{pattern}' at: {rabin_karp(text.lower(), pattern)}")
Advanced

Boyer-Moore - The Backwards Genius 🔄

Boyer-Moore reads the pattern backwards! When it finds a mismatch, it can skip many positions at once.

Boyer-Moore - Skip Like a Pro! ⚡
# Boyer-Moore - Why check every character when you can skip? def boyer_moore_simplified(text, pattern): """Simplified Boyer-Moore with bad character rule""" # Build bad character table # It tells us: "If we see character X, how far can we skip?" bad_char = {} for i in range(len(pattern)): bad_char[pattern[i]] = i matches = [] shift = 0 while shift <= len(text) - len(pattern): j = len(pattern) - 1 # Start from the end! # Compare backwards while j >= 0 and pattern[j] == text[shift + j]: j -= 1 if j < 0: # Found a match! 🎉 matches.append(shift) shift += len(pattern) else: # Mismatch - skip smartly! bad_char_shift = j - bad_char.get(text[shift + j], -1) shift += max(1, bad_char_shift) return matches # Boyer-Moore shines with long patterns! text = "HERE IS A SIMPLE EXAMPLE" pattern = "EXAMPLE" print(f"Boyer-Moore found: {boyer_moore_simplified(text, pattern)}")

💪 Practice Problems

Beginner

Problem 1: Count Pattern Occurrences

Write a function that counts how many times a pattern appears in a text (including overlapping occurrences).

Think About It First! 🤔

If text = "AAAA" and pattern = "AA", how many occurrences are there?

Hint: Don't skip ahead after finding a match!

Show Solution
def count_overlapping(text, pattern): count = 0 for i in range(len(text) - len(pattern) + 1): if text[i:i+len(pattern)] == pattern: count += 1 return count # Test: "AAAA" has 3 overlapping "AA"s print(count_overlapping("AAAA", "AA")) # Output: 3
Intermediate

Problem 2: Find All Anagrams

Find all starting positions in text where an anagram of the pattern exists.

Think About It First! 🤔

An anagram has the same characters but in any order. How can you check this efficiently?

Show Solution
from collections import Counter def find_anagrams(text, pattern): result = [] pattern_count = Counter(pattern) window_count = Counter(text[:len(pattern)]) # Check first window if window_count == pattern_count: result.append(0) # Slide the window for i in range(len(pattern), len(text)): # Add new character window_count[text[i]] += 1 # Remove old character old_char = text[i - len(pattern)] window_count[old_char] -= 1 if window_count[old_char] == 0: del window_count[old_char] # Check if current window is anagram if window_count == pattern_count: result.append(i - len(pattern) + 1) return result # Test: Find anagrams of "abc" print(find_anagrams("cbaebabacd", "abc")) # Output: [0, 6]
Advanced

Problem 3: Longest Repeating Substring

Find the longest substring that appears at least twice in the text.

Think About It First! 🤔

You'll need to check all possible substrings. How can you optimize this?

Show Solution
def longest_repeating_substring(text): n = len(text) longest = "" # Try all possible lengths, starting from longest for length in range(n - 1, 0, -1): seen = set() for i in range(n - length + 1): substring = text[i:i + length] if substring in seen: return substring # Found repeating substring! seen.add(substring) return longest # Test print(longest_repeating_substring("banana")) # Output: "ana" print(longest_repeating_substring("abcdefg")) # Output: ""

🚀 Advanced Algorithms

Advanced

Z-Algorithm - The Prefix Master

The Z-algorithm creates an array where Z[i] tells you the length of the longest substring starting from position i that matches a prefix of the string.

Z-Algorithm - Linear Time Magic! ✨
def z_algorithm(s): """Build Z array in linear time""" n = len(s) z = [0] * n l, r = 0, 0 # Z-box boundaries for i in range(1, n): if i <= r: # We're inside a Z-box, use previous values z[i] = min(r - i + 1, z[i - l]) # Try to extend the match while i + z[i] < n and s[z[i]] == s[i + z[i]]: z[i] += 1 # Update Z-box if we found a longer match if i + z[i] - 1 > r: l, r = i, i + z[i] - 1 return z def z_pattern_search(text, pattern): """Use Z-algorithm for pattern matching""" # Concatenate pattern + separator + text combined = pattern + "$" + text z = z_algorithm(combined) matches = [] pattern_len = len(pattern) # Check where Z value equals pattern length for i in range(pattern_len + 1, len(combined)): if z[i] == pattern_len: matches.append(i - pattern_len - 1) return matches
Advanced

Aho-Corasick - Multiple Pattern Search Champion

When you need to find multiple patterns at once (like virus signatures or banned words), Aho-Corasick builds a trie of all patterns and searches them simultaneously!

Aho-Corasick - Find All Patterns at Once! 🎯
class TrieNode: def __init__(self): self.children = {} self.is_end = False self.patterns = [] # Patterns ending at this node def find_multiple_patterns(text, patterns): """Simple version of Aho-Corasick for multiple patterns""" # Build trie from patterns root = TrieNode() for pattern in patterns: node = root for char in pattern: if char not in node.children: node.children[char] = TrieNode() node = node.children[char] node.is_end = True node.patterns.append(pattern) # Search all patterns simultaneously results = {pattern: [] for pattern in patterns} for start in range(len(text)): node = root for i in range(start, len(text)): if text[i] not in node.children: break node = node.children[text[i]] if node.is_end: for pattern in node.patterns: results[pattern].append(start) return results # Find multiple patterns in one pass! text = "ushers" patterns = ["he", "she", "hers"] matches = find_multiple_patterns(text, patterns) print(f"Found: {matches}") # Output: {'he': [2], 'she': [1], 'hers': [2]}

📖 Quick Reference

Algorithm Comparison Chart

Algorithm Time Complexity When to Use Plain English Speed
Naive O(n × m) Short patterns, simple cases 🐢 Slow but simple
KMP O(n + m) Need guaranteed linear time ⚡ Fast and consistent
Rabin-Karp O(n + m) average Multiple pattern search 🎲 Usually fast
Boyer-Moore O(n/m) best case Long patterns, large alphabet 🚀 Super fast for long patterns
Z-Algorithm O(n + m) Need prefix matches ⚡ Linear and elegant
Aho-Corasick O(n + m + z) Multiple patterns at once 💫 Find everything in one pass

Use Naive When...

  • Pattern is very short (< 10 chars)
  • Text is small
  • Simplicity matters most
  • One-time search

Use KMP When...

  • Pattern has repetitive parts
  • Need guaranteed O(n) time
  • Streaming data
  • Can't go backwards

Use Boyer-Moore When...

  • Pattern is long
  • Large alphabet (like English)
  • Speed is critical
  • Text is very long

Use Rabin-Karp When...

  • Searching multiple patterns
  • Pattern matching in 2D
  • Plagiarism detection
  • Rolling hash useful

Common Pitfalls to Avoid

❌ Pitfall #1: Hash Collisions in Rabin-Karp

Always double-check matches character by character when hashes match!

❌ Pitfall #2: Off-by-One in Loop Bounds

Remember: range(len(text) - len(pattern) + 1) not just - len(pattern)

❌ Pitfall #3: Not Handling Empty Patterns

Always check if pattern is empty before searching!