Deep Dive: RLHF & Alignment Techniques

Writing

Documenting my learning journey through RLHF and alignment techniques. Covering the full pipeline from preference data collection to reward model training to policy optimization.

Topics Covered

Key Takeaways So Far

DPO is simpler and more stable than PPO for most use cases, but PPO gives more fine-grained control when you need it. The real bottleneck is always preference data quality.