Problem Statement & Requirements
Why Feature Stores Matter
The #1 cause of ML model failures in production is training-serving skew — when features computed during training differ from those at serving time. Uber built Michelangelo's feature store after discovering that 60% of ML bugs were feature-related. Feature stores eliminate this entire class of bugs.
Think of a feature store like a prep kitchen in a restaurant. Instead of each chef (data scientist) individually chopping vegetables (computing features) from scratch, the prep kitchen prepares standardized ingredients that any chef can use. This ensures consistency, saves time, and prevents mistakes.
Functional Requirements
- Feature registration — Define features with schemas, descriptions, and ownership
- Online serving — Low-latency lookup of feature values for real-time inference
- Offline serving — Batch retrieval of historical features for training datasets
- Point-in-time joins — Retrieve features as they were at a specific timestamp
- Feature transformation — Define and execute batch + streaming transformations
- Discovery & reuse — Search and browse available features across teams
Non-Functional Requirements
- Online latency — <10ms p99 for feature vector retrieval
- Consistency — Training and serving features must be identical
- Scale — Millions of entities, thousands of features, millions QPS
- Freshness — Streaming features updated within seconds
Back-of-Envelope Estimation
| Parameter | Estimate |
|---|---|
| Total features defined | 5,000-10,000 |
| Entity types | Users, Items, Sessions, Merchants |
| Online store entities | 100M users × 200 features = 20B values |
| Online serving QPS | 500K-2M |
| Feature vector size | ~2 KB per entity (200 features × 10B avg) |
| Online store size | 100M × 2 KB = ~200 GB (fits in Redis cluster) |
| Offline store size | ~50 TB (years of historical data) |
| Batch materialization | Hourly for slow features, real-time for streaming |
System API Design
# Register a feature
POST /api/v1/features
{
"name": "user_purchase_count_30d",
"entity": "user",
"dtype": "int64",
"description": "Number of purchases in last 30 days",
"source": { "type": "batch", "query": "SELECT user_id, COUNT(*) ..." },
"owner": "fraud-team"
}
# Online: Get features for an entity (low-latency)
GET /api/v1/features/online?entity=user&id=user_123
&features=purchase_count_30d,avg_order_value,account_age_days
# Offline: Generate training dataset with point-in-time join
POST /api/v1/features/offline
{
"entity": "user",
"features": ["purchase_count_30d", "avg_order_value"],
"entity_df": "s3://data/training_entities.parquet",
"timestamp_column": "event_timestamp"
}
Data Model
CREATE TABLE feature_definitions (
feature_name VARCHAR PRIMARY KEY,
entity_type VARCHAR,
dtype VARCHAR,
description TEXT,
source_type VARCHAR, -- batch, streaming, on-demand
transform_sql TEXT,
owner VARCHAR,
tags TEXT[],
created_at TIMESTAMP
);
-- Offline store (Parquet/Delta Lake, partitioned by date)
-- Schema: entity_id, feature_name, value, event_timestamp
-- Online store (Redis/DynamoDB)
-- Key: {entity_type}:{entity_id}
-- Value: {feature_name: value, ...} (hash map)
CREATE TABLE materialization_jobs (
job_id VARCHAR PRIMARY KEY,
feature_name VARCHAR,
status VARCHAR,
started_at TIMESTAMP,
records BIGINT,
latency_ms FLOAT
);
High-Level Architecture
Feature Registry
Central catalog of all features with metadata: schema, description, owner, lineage, freshness SLA. Enables discovery and prevents duplicate work across teams.
Transformation Engine
Executes feature computations. Batch transforms run on Spark/SQL (hourly/daily). Streaming transforms run on Flink/Spark Streaming for real-time features. Same transformation code is used for both training and serving.
Online Store
Low-latency key-value store (Redis, DynamoDB) for serving features during inference. Materialized from batch/streaming pipelines. Optimized for single-entity lookups.
Offline Store
Columnar storage (Parquet/Delta Lake) for historical feature values with timestamps. Used for training dataset generation with point-in-time joins. Optimized for bulk reads.
Deep Dive: Core Components
Point-in-Time Correctness
Critical: Preventing Data Leakage
When building training datasets, you must use feature values as they were at the time of each training example. If a user made a purchase at 10:00 AM and you are predicting fraud for a transaction at 9:00 AM, you must NOT include the 10:00 AM purchase in the feature. Using future data is called "data leakage" and produces models that look great in training but fail in production.
def point_in_time_join(entity_df, feature_df):
"""Join features as-of each entity's timestamp."""
# Sort features by time
feature_df = feature_df.sort_values("event_timestamp")
# For each entity row, find the latest feature
# value BEFORE the entity's timestamp
result = pd.merge_asof(
entity_df.sort_values("event_timestamp"),
feature_df,
on="event_timestamp",
by="entity_id",
direction="backward" # Only look at past values
)
return result
Online vs. Offline Store Tradeoffs
| Aspect | Online Store | Offline Store |
|---|---|---|
| Latency | <10ms | Seconds-minutes |
| Access pattern | Single entity lookup | Bulk scan/join |
| Storage | Redis / DynamoDB | Parquet / Delta Lake |
| Data | Latest values only | Full history with timestamps |
| Cost | High (in-memory) | Low (object storage) |
Feature Freshness: Batch vs. Streaming
Not all features need real-time updates. Categorize by freshness requirements:
- Static features (account_age, country): Update daily or on change
- Slow-moving (30-day purchase count): Batch update hourly
- Fast-moving (session click count): Stream with seconds-level freshness
- On-demand (real-time price): Compute at request time, don't store
Scaling & Optimization
Online Store at Scale
For millions of QPS: shard Redis by entity_id (consistent hashing), use read replicas, compress feature values (delta encoding for time-series), and batch multi-feature lookups into single MGET calls.
Efficient Materialization
- Incremental updates: Only recompute features for entities that changed
- Backfill: When adding a new feature, compute historical values in parallel
- Deduplication: Skip materialization if feature values haven't changed
Practice Problems
Practice 1: Cross-Entity Features
You need a feature "average spending of the user's city in the last 7 days." This requires joining user entities with city-level aggregations. Design the computation and serving architecture.
Practice 2: Feature Versioning
The fraud team changes the definition of "high_risk_score" from a simple threshold to an ML-derived score. Old models still depend on the original definition. Design a versioning strategy.
Practice 3: Streaming Feature Consistency
Your streaming feature "transaction_count_1h" uses a sliding window. Due to late-arriving events, the online value sometimes differs from the offline value by 5-10%. Design a reconciliation mechanism.
Quick Reference
| Component | Technology | Purpose |
|---|---|---|
| Online Store | Redis / DynamoDB | Low-latency feature serving |
| Offline Store | Delta Lake / Parquet on S3 | Historical features for training |
| Batch Compute | Spark / dbt | Hourly/daily feature transforms |
| Stream Compute | Flink / Spark Streaming | Real-time feature updates |
| Registry | Feast / Tecton / Hopsworks | Feature catalog and discovery |
| Orchestration | Airflow / Dagster | Materialization scheduling |
Key Takeaways
- Point-in-time correctness is the most critical property — prevents data leakage
- Separate online (low-latency, latest values) from offline (historical, bulk access) stores
- Use the same transformation code for training and serving to prevent skew
- Categorize features by freshness needs to optimize cost vs. latency
- A feature registry enables discovery and prevents duplicate work across teams