Design a Real-Time Fraud Detection System

Problem Statement & Requirements

Why Fraud Detection Matters

Global payment fraud exceeds $30 billion per year. Every major payment processor (Stripe, PayPal, Visa) runs real-time fraud detection on every transaction. The system must make a block/allow decision in milliseconds while maintaining an extremely low false positive rate — blocking legitimate transactions costs revenue and customer trust.

Think of fraud detection like airport security with multiple screening layers. The first layer is a quick metal detector (rules engine). The second is an X-ray machine (ML model). Suspicious items get additional manual inspection (human review). Each layer catches different threats, and together they provide defense in depth.

Functional Requirements

Real-time scoring — Score every transaction before authorization
Rule engine — Configurable rules (velocity checks, blocklists, thresholds)
ML model inference — Run trained fraud models on transaction features
Alerting — Flag high-risk transactions for human review
Feedback loop — Incorporate confirmed fraud/legitimate labels for retraining
Case management — Investigation workflow for flagged transactions

Non-Functional Requirements

Decision latency — <100ms for block/allow decision
False positive rate — <0.1% (1 in 1,000 legitimate transactions incorrectly blocked)
Fraud detection rate — >95% of fraudulent transactions caught
Availability — 99.99% (downtime = all transactions auto-approved)

Back-of-Envelope Estimation

Parameter	Estimate
Transactions per second	10,000 (peak: 50,000)
Fraud rate	0.1% (1 in 1,000 transactions)
Features per transaction	200-500
Feature computation budget	<30ms
ML inference budget	<20ms
Total latency budget	<100ms
Historical transactions stored	2 years (~600B transactions)
Model retraining frequency	Daily (full) + hourly (incremental)

System API Design

Fraud Detection APIs

# Score a transaction (called by payment gateway)
POST /api/v1/transactions/score
{
  "transaction_id": "txn_abc123",
  "amount": 499.99,
  "currency": "USD",
  "merchant_id": "merch_456",
  "user_id": "user_789",
  "card_hash": "sha256_xxx",
  "ip_address": "203.0.113.42",
  "device_fingerprint": "fp_xyz",
  "timestamp": "2024-01-15T10:30:00Z"
}
# Response (must return in <100ms)
{
  "decision": "allow",  // allow, block, review
  "risk_score": 0.12,
  "triggered_rules": [],
  "model_version": "v4.2"
}

# Submit fraud/legitimate label (feedback loop)
POST /api/v1/transactions/label
{
  "transaction_id": "txn_abc123",
  "label": "fraud",
  "source": "chargeback"
}

# Manage rules
POST /api/v1/rules
{
  "name": "high_velocity_check",
  "condition": "txn_count_1h > 10 AND amount > 500",
  "action": "block"
}

Data Model

Core Schema

CREATE TABLE transactions (
    txn_id        VARCHAR PRIMARY KEY,
    user_id       VARCHAR,
    merchant_id   VARCHAR,
    amount        DECIMAL(12,2),
    currency      VARCHAR(3),
    risk_score    FLOAT,
    decision      VARCHAR,
    label         VARCHAR,  -- fraud, legitimate, null (unknown)
    features      JSONB,
    timestamp     TIMESTAMP
) PARTITION BY RANGE (timestamp);

CREATE TABLE rules (
    rule_id       VARCHAR PRIMARY KEY,
    name          TEXT,
    condition     TEXT,   -- expression DSL
    action        VARCHAR, -- block, review, score_boost
    enabled       BOOLEAN,
    priority      INT
);

CREATE TABLE alerts (
    alert_id      VARCHAR PRIMARY KEY,
    txn_id        VARCHAR,
    status        VARCHAR,  -- open, investigating, resolved
    assigned_to   VARCHAR,
    resolution    VARCHAR,  -- confirmed_fraud, false_positive
    created_at    TIMESTAMP
);

High-Level Architecture

The system has two paths: a real-time scoring path (<100ms) and an offline learning path (hours-days).

Event Ingestion

Transaction events arrive via Kafka. Each event triggers the scoring pipeline. Events are also stored for offline analysis and model retraining.

Feature Engine

Computes 200-500 features in real-time: user velocity (transactions in last 1h), merchant risk, device reputation, geo-anomaly, amount deviation. Pulls pre-computed features from the feature store and computes session-level features on the fly.

Rule Engine

Evaluates deterministic rules first (blocklists, velocity limits, impossible travel). Fast and interpretable. Rules can be updated instantly without model retraining.

ML Scoring

Runs the fraud model on the feature vector. Outputs a risk score (0-1). Ensemble of gradient boosted trees (fast) and neural network (accurate). Combined with rule engine score for final decision.

Decision Engine

Combines rule and ML scores. Applies business logic: score <0.3 = allow, 0.3-0.7 = review, >0.7 = block. Thresholds tuned per merchant category and risk appetite.

Deep Dive: Core Components

Real-Time Feature Engineering

Streaming Feature Computation

class FraudFeatureEngine:
    def compute_features(self, txn, feature_store):
        user_id = txn["user_id"]
        # Pre-computed features from feature store (<5ms)
        stored = feature_store.get_online(
            entity="user", id=user_id,
            features=["avg_txn_30d", "account_age",
                      "device_count", "country_count_7d"]
        )
        # Real-time windowed aggregations (<10ms)
        velocity = self.redis.get_sliding_window(
            f"velocity:{user_id}", window="1h"
        )
        # Derived features
        features = {
            "amount_deviation": (
                txn["amount"] - stored["avg_txn_30d"]
            ) / max(stored["avg_txn_30d"], 1),
            "txn_count_1h": velocity["count"],
            "txn_sum_1h": velocity["sum"],
            "is_new_device": txn["device_fingerprint"]
                not in stored.get("known_devices", []),
            "is_new_country": txn["country"]
                != stored.get("home_country"),
            **stored  # Include all pre-computed features
        }
        return features

Rule Engine + ML Hybrid

Why Both Rules AND ML?

Rules are fast, interpretable, and instantly updatable. Use them for known fraud patterns (stolen card lists, impossible travel, velocity limits). ML models catch novel patterns that rules miss. The hybrid approach provides defense in depth: rules for known threats, ML for unknown threats.

Handling Class Imbalance

Only 0.1% of transactions are fraudulent. Training on raw data gives a model that predicts "legitimate" 99.9% of the time. Solutions:

Oversampling: SMOTE generates synthetic fraud examples
Undersampling: Randomly reduce legitimate examples to 10:1 ratio
Cost-sensitive learning: Weight fraud examples 100x higher in loss function
Anomaly detection: Train on legitimate transactions only, flag outliers

Graph-Based Fraud Detection

Fraud rings involve coordinated accounts. Build a transaction graph: nodes are users, merchants, devices, IPs. Edges connect related entities. Use graph algorithms (community detection, PageRank) to identify suspicious clusters sharing devices or addresses.

Concept Drift

Fraudsters Adapt

Fraud patterns change constantly. A model trained on last month's data may miss this month's attack vectors. Monitor for concept drift by tracking: (1) feature distribution shifts, (2) model score distribution changes, (3) rising false negative rate. Retrain daily and deploy new models via canary rollout.

Scaling & Optimization

Stream Processing Architecture

Use Kafka for event ingestion and Flink for real-time feature computation. Flink maintains sliding window state for velocity features. Back-pressure handling prevents queue buildup during traffic spikes.

Low-Latency Model Serving

Pre-compile models: Convert to ONNX/TensorRT for 2-5x faster inference
CPU-optimized models: Use XGBoost/LightGBM for <5ms inference (no GPU needed)
Model caching: Keep hot models in memory, cold models on disk
Parallel scoring: Run rule engine and ML model concurrently, combine results

Feedback Loop Latency

Label Source	Delay	Volume
Manual review	Minutes-hours	~1% of transactions
Chargebacks	30-90 days	~0.1% of transactions
User reports	Hours-days	~0.05%
Auto-confirmed legitimate	7 days (no dispute)	~99%

Practice Problems

Practice 1: New Merchant Onboarding

A new merchant joins your platform with zero transaction history. Your ML model has no merchant-level features. Design a cold-start strategy that provides fraud protection without excessive false positives.

Practice 2: Coordinated Attack

You detect 500 small transactions ($1-5) from different accounts hitting the same merchant within 10 minutes — a card testing attack. Design a detection mechanism that catches this pattern in real-time.

Practice 3: Regional Compliance

EU regulations (PSD2/SCA) require different fraud thresholds than US markets. Design a system that applies region-specific rules and models while sharing global fraud signals.

Quick Reference

Component	Technology	Purpose
Event Streaming	Kafka	Transaction ingestion
Stream Processing	Flink / Kafka Streams	Real-time feature computation
Feature Store	Redis / Feast	Low-latency feature serving
Rule Engine	Drools / Custom DSL	Deterministic fraud rules
ML Model	XGBoost / LightGBM	Fraud scoring (<5ms inference)
Graph Analysis	Neo4j / TigerGraph	Fraud ring detection
Case Management	Custom / SaaS	Human review workflow

Key Takeaways

Use a layered approach: rules for known patterns, ML for novel fraud
Real-time features (velocity, device, geo) are the strongest fraud signals
Handle class imbalance with cost-sensitive learning or oversampling
Monitor for concept drift and retrain models daily
Design for <100ms latency — use CPU-optimized models, not GPU
Graph analysis catches coordinated fraud that individual scoring misses