Reinforcement Learning from Human Feedback
Studying the full alignment pipeline — reward modeling from pairwise preferences, PPO vs DPO tradeoffs, constitutional AI, and open problems around reward hacking and scalable oversight. DPO is simpler and more stable for most cases, but PPO gives finer control. The real bottleneck is always preference data quality.
Mixture of Experts Architectures
How sparse activation scales model capacity without proportional compute cost. Working through Switch Transformers, ST-MoE, Mixtral, and DeepSeek-MoE. Key insight: expert load balancing is the critical implementation challenge, and sparse models need fundamentally different serving infrastructure.
Rust for High-Performance Inference
Python is ML’s lingua franca, but inference serving is a systems problem. Building lightweight servers with Tokio async runtime, ONNX Runtime Rust bindings, and zero-copy tensor handling — targeting sub-millisecond overhead where Python’s GIL is the bottleneck.