Built the team’s first feature store to solve the train-serve skew problem — ensuring ML models see the same features in training and production.
Architecture
- Offline store — Parquet files on S3, computed via Airflow batch jobs
- Online store — Redis cluster with sub-10ms reads for real-time serving
- Feature registry — centralized catalog with lineage, ownership, and freshness SLAs
- SDK — Python client for consistent feature retrieval in notebooks, training, and serving
Key Design Decisions
- Chose Redis over DynamoDB for online store — 3x lower p99 latency at our scale
- Parquet over Delta Lake for offline — simpler, team already familiar, good enough for batch
- Built custom registry instead of adopting Feast — our schema requirements didn’t fit
Impact
Eliminated train-serve skew for all production models. Feature reuse across teams went from 0% to 60%, reducing duplicate computation by ~$2K/month.