Design a Feature Store

Hard 25 min read

Problem Statement & Requirements

Why Feature Stores Matter

The #1 cause of ML model failures in production is training-serving skew — when features computed during training differ from those at serving time. Uber built Michelangelo's feature store after discovering that 60% of ML bugs were feature-related. Feature stores eliminate this entire class of bugs.

Think of a feature store like a prep kitchen in a restaurant. Instead of each chef (data scientist) individually chopping vegetables (computing features) from scratch, the prep kitchen prepares standardized ingredients that any chef can use. This ensures consistency, saves time, and prevents mistakes.

Functional Requirements

Non-Functional Requirements

Back-of-Envelope Estimation

ParameterEstimate
Total features defined5,000-10,000
Entity typesUsers, Items, Sessions, Merchants
Online store entities100M users × 200 features = 20B values
Online serving QPS500K-2M
Feature vector size~2 KB per entity (200 features × 10B avg)
Online store size100M × 2 KB = ~200 GB (fits in Redis cluster)
Offline store size~50 TB (years of historical data)
Batch materializationHourly for slow features, real-time for streaming

System API Design

Feature Store APIs
# Register a feature
POST /api/v1/features
{
  "name": "user_purchase_count_30d",
  "entity": "user",
  "dtype": "int64",
  "description": "Number of purchases in last 30 days",
  "source": { "type": "batch", "query": "SELECT user_id, COUNT(*) ..." },
  "owner": "fraud-team"
}

# Online: Get features for an entity (low-latency)
GET /api/v1/features/online?entity=user&id=user_123
    &features=purchase_count_30d,avg_order_value,account_age_days

# Offline: Generate training dataset with point-in-time join
POST /api/v1/features/offline
{
  "entity": "user",
  "features": ["purchase_count_30d", "avg_order_value"],
  "entity_df": "s3://data/training_entities.parquet",
  "timestamp_column": "event_timestamp"
}

Data Model

Core Schema
CREATE TABLE feature_definitions (
    feature_name  VARCHAR PRIMARY KEY,
    entity_type   VARCHAR,
    dtype         VARCHAR,
    description   TEXT,
    source_type   VARCHAR,  -- batch, streaming, on-demand
    transform_sql TEXT,
    owner         VARCHAR,
    tags          TEXT[],
    created_at    TIMESTAMP
);
-- Offline store (Parquet/Delta Lake, partitioned by date)
-- Schema: entity_id, feature_name, value, event_timestamp

-- Online store (Redis/DynamoDB)
-- Key: {entity_type}:{entity_id}
-- Value: {feature_name: value, ...} (hash map)

CREATE TABLE materialization_jobs (
    job_id        VARCHAR PRIMARY KEY,
    feature_name  VARCHAR,
    status        VARCHAR,
    started_at    TIMESTAMP,
    records       BIGINT,
    latency_ms    FLOAT
);

High-Level Architecture

Feature Registry

Central catalog of all features with metadata: schema, description, owner, lineage, freshness SLA. Enables discovery and prevents duplicate work across teams.

Transformation Engine

Executes feature computations. Batch transforms run on Spark/SQL (hourly/daily). Streaming transforms run on Flink/Spark Streaming for real-time features. Same transformation code is used for both training and serving.

Online Store

Low-latency key-value store (Redis, DynamoDB) for serving features during inference. Materialized from batch/streaming pipelines. Optimized for single-entity lookups.

Offline Store

Columnar storage (Parquet/Delta Lake) for historical feature values with timestamps. Used for training dataset generation with point-in-time joins. Optimized for bulk reads.

Deep Dive: Core Components

Point-in-Time Correctness

Critical: Preventing Data Leakage

When building training datasets, you must use feature values as they were at the time of each training example. If a user made a purchase at 10:00 AM and you are predicting fraud for a transaction at 9:00 AM, you must NOT include the 10:00 AM purchase in the feature. Using future data is called "data leakage" and produces models that look great in training but fail in production.

Point-in-Time Join
def point_in_time_join(entity_df, feature_df):
    """Join features as-of each entity's timestamp."""
    # Sort features by time
    feature_df = feature_df.sort_values("event_timestamp")

    # For each entity row, find the latest feature
    # value BEFORE the entity's timestamp
    result = pd.merge_asof(
        entity_df.sort_values("event_timestamp"),
        feature_df,
        on="event_timestamp",
        by="entity_id",
        direction="backward"  # Only look at past values
    )
    return result

Online vs. Offline Store Tradeoffs

AspectOnline StoreOffline Store
Latency<10msSeconds-minutes
Access patternSingle entity lookupBulk scan/join
StorageRedis / DynamoDBParquet / Delta Lake
DataLatest values onlyFull history with timestamps
CostHigh (in-memory)Low (object storage)

Feature Freshness: Batch vs. Streaming

Not all features need real-time updates. Categorize by freshness requirements:

Scaling & Optimization

Online Store at Scale

For millions of QPS: shard Redis by entity_id (consistent hashing), use read replicas, compress feature values (delta encoding for time-series), and batch multi-feature lookups into single MGET calls.

Efficient Materialization

Practice Problems

Practice 1: Cross-Entity Features

You need a feature "average spending of the user's city in the last 7 days." This requires joining user entities with city-level aggregations. Design the computation and serving architecture.

Practice 2: Feature Versioning

The fraud team changes the definition of "high_risk_score" from a simple threshold to an ML-derived score. Old models still depend on the original definition. Design a versioning strategy.

Practice 3: Streaming Feature Consistency

Your streaming feature "transaction_count_1h" uses a sliding window. Due to late-arriving events, the online value sometimes differs from the offline value by 5-10%. Design a reconciliation mechanism.

Quick Reference

ComponentTechnologyPurpose
Online StoreRedis / DynamoDBLow-latency feature serving
Offline StoreDelta Lake / Parquet on S3Historical features for training
Batch ComputeSpark / dbtHourly/daily feature transforms
Stream ComputeFlink / Spark StreamingReal-time feature updates
RegistryFeast / Tecton / HopsworksFeature catalog and discovery
OrchestrationAirflow / DagsterMaterialization scheduling

Key Takeaways

  • Point-in-time correctness is the most critical property — prevents data leakage
  • Separate online (low-latency, latest values) from offline (historical, bulk access) stores
  • Use the same transformation code for training and serving to prevent skew
  • Categorize features by freshness needs to optimize cost vs. latency
  • A feature registry enables discovery and prevents duplicate work across teams