Design a Feature Store | LIZIU System Design

Problem Statement & Requirements

Why Feature Stores Matter

The #1 cause of ML model failures in production is training-serving skew — when features computed during training differ from those at serving time. Uber built Michelangelo's feature store after discovering that 60% of ML bugs were feature-related. Feature stores eliminate this entire class of bugs.

Think of a feature store like a prep kitchen in a restaurant. Instead of each chef (data scientist) individually chopping vegetables (computing features) from scratch, the prep kitchen prepares standardized ingredients that any chef can use. This ensures consistency, saves time, and prevents mistakes.

Functional Requirements

Feature registration — Define features with schemas, descriptions, and ownership
Online serving — Low-latency lookup of feature values for real-time inference
Offline serving — Batch retrieval of historical features for training datasets
Point-in-time joins — Retrieve features as they were at a specific timestamp
Feature transformation — Define and execute batch + streaming transformations
Discovery & reuse — Search and browse available features across teams

Non-Functional Requirements

Online latency — <10ms p99 for feature vector retrieval
Consistency — Training and serving features must be identical
Scale — Millions of entities, thousands of features, millions QPS
Freshness — Streaming features updated within seconds

Back-of-Envelope Estimation

Parameter	Estimate
Total features defined	5,000-10,000
Entity types	Users, Items, Sessions, Merchants
Online store entities	100M users × 200 features = 20B values
Online serving QPS	500K-2M
Feature vector size	~2 KB per entity (200 features × 10B avg)
Online store size	100M × 2 KB = ~200 GB (fits in Redis cluster)
Offline store size	~50 TB (years of historical data)
Batch materialization	Hourly for slow features, real-time for streaming

System API Design

Feature Store APIs

# Register a feature
POST /api/v1/features
{
  "name": "user_purchase_count_30d",
  "entity": "user",
  "dtype": "int64",
  "description": "Number of purchases in last 30 days",
  "source": { "type": "batch", "query": "SELECT user_id, COUNT(*) ..." },
  "owner": "fraud-team"
}

# Online: Get features for an entity (low-latency)
GET /api/v1/features/online?entity=user&id=user_123
    &features=purchase_count_30d,avg_order_value,account_age_days

# Offline: Generate training dataset with point-in-time join
POST /api/v1/features/offline
{
  "entity": "user",
  "features": ["purchase_count_30d", "avg_order_value"],
  "entity_df": "s3://data/training_entities.parquet",
  "timestamp_column": "event_timestamp"
}

Data Model

Core Schema

CREATE TABLE feature_definitions (
    feature_name  VARCHAR PRIMARY KEY,
    entity_type   VARCHAR,
    dtype         VARCHAR,
    description   TEXT,
    source_type   VARCHAR,  -- batch, streaming, on-demand
    transform_sql TEXT,
    owner         VARCHAR,
    tags          TEXT[],
    created_at    TIMESTAMP
);
-- Offline store (Parquet/Delta Lake, partitioned by date)
-- Schema: entity_id, feature_name, value, event_timestamp

-- Online store (Redis/DynamoDB)
-- Key: {entity_type}:{entity_id}
-- Value: {feature_name: value, ...} (hash map)

CREATE TABLE materialization_jobs (
    job_id        VARCHAR PRIMARY KEY,
    feature_name  VARCHAR,
    status        VARCHAR,
    started_at    TIMESTAMP,
    records       BIGINT,
    latency_ms    FLOAT
);

High-Level Architecture

Feature Registry

Central catalog of all features with metadata: schema, description, owner, lineage, freshness SLA. Enables discovery and prevents duplicate work across teams.

Transformation Engine

Executes feature computations. Batch transforms run on Spark/SQL (hourly/daily). Streaming transforms run on Flink/Spark Streaming for real-time features. Same transformation code is used for both training and serving.

Online Store

Low-latency key-value store (Redis, DynamoDB) for serving features during inference. Materialized from batch/streaming pipelines. Optimized for single-entity lookups.

Offline Store

Columnar storage (Parquet/Delta Lake) for historical feature values with timestamps. Used for training dataset generation with point-in-time joins. Optimized for bulk reads.

Deep Dive: Core Components

Point-in-Time Correctness

Critical: Preventing Data Leakage

When building training datasets, you must use feature values as they were at the time of each training example. If a user made a purchase at 10:00 AM and you are predicting fraud for a transaction at 9:00 AM, you must NOT include the 10:00 AM purchase in the feature. Using future data is called "data leakage" and produces models that look great in training but fail in production.

Point-in-Time Join

def point_in_time_join(entity_df, feature_df):
    """Join features as-of each entity's timestamp."""
    # Sort features by time
    feature_df = feature_df.sort_values("event_timestamp")

    # For each entity row, find the latest feature
    # value BEFORE the entity's timestamp
    result = pd.merge_asof(
        entity_df.sort_values("event_timestamp"),
        feature_df,
        on="event_timestamp",
        by="entity_id",
        direction="backward"  # Only look at past values
    )
    return result

Online vs. Offline Store Tradeoffs

Aspect	Online Store	Offline Store
Latency	<10ms	Seconds-minutes
Access pattern	Single entity lookup	Bulk scan/join
Storage	Redis / DynamoDB	Parquet / Delta Lake
Data	Latest values only	Full history with timestamps
Cost	High (in-memory)	Low (object storage)

Feature Freshness: Batch vs. Streaming

Not all features need real-time updates. Categorize by freshness requirements:

Static features (account_age, country): Update daily or on change
Slow-moving (30-day purchase count): Batch update hourly
Fast-moving (session click count): Stream with seconds-level freshness
On-demand (real-time price): Compute at request time, don't store

Scaling & Optimization

Online Store at Scale

For millions of QPS: shard Redis by entity_id (consistent hashing), use read replicas, compress feature values (delta encoding for time-series), and batch multi-feature lookups into single MGET calls.

Efficient Materialization

Incremental updates: Only recompute features for entities that changed
Backfill: When adding a new feature, compute historical values in parallel
Deduplication: Skip materialization if feature values haven't changed

Practice Problems

Practice 1: Cross-Entity Features

You need a feature "average spending of the user's city in the last 7 days." This requires joining user entities with city-level aggregations. Design the computation and serving architecture.

Practice 2: Feature Versioning

The fraud team changes the definition of "high_risk_score" from a simple threshold to an ML-derived score. Old models still depend on the original definition. Design a versioning strategy.

Practice 3: Streaming Feature Consistency

Your streaming feature "transaction_count_1h" uses a sliding window. Due to late-arriving events, the online value sometimes differs from the offline value by 5-10%. Design a reconciliation mechanism.

Quick Reference

Component	Technology	Purpose
Online Store	Redis / DynamoDB	Low-latency feature serving
Offline Store	Delta Lake / Parquet on S3	Historical features for training
Batch Compute	Spark / dbt	Hourly/daily feature transforms
Stream Compute	Flink / Spark Streaming	Real-time feature updates
Registry	Feast / Tecton / Hopsworks	Feature catalog and discovery
Orchestration	Airflow / Dagster	Materialization scheduling

Key Takeaways

Point-in-time correctness is the most critical property — prevents data leakage
Separate online (low-latency, latest values) from offline (historical, bulk access) stores
Use the same transformation code for training and serving to prevent skew
Categorize features by freshness needs to optimize cost vs. latency
A feature registry enables discovery and prevents duplicate work across teams