Designed and deployed a centralized ML experiment tracking platform that brought reproducibility and visibility to the team’s ML workflows.
Architecture
- MLflow Tracking Server — ECS-hosted with PostgreSQL metadata store
- Artifact Store — S3 with lifecycle policies for cost management
- Model Registry — versioned model artifacts with stage transitions (Staging → Production)
- Auth — OIDC integration with company SSO
Key Features
- Auto-logging for PyTorch, sklearn, and HuggingFace training runs
- Custom MLflow plugins for GPU utilization and cost tracking
- Slack notifications on model promotion events
- Grafana dashboards for experiment trends and compute usage