Deep Dive: RLHF & Alignment Techniques

Documenting my learning journey through RLHF and alignment techniques. Covering the full pipeline from preference data collection to reward model training to policy optimization.

Topics Covered

Reward modeling from pairwise human preferences
PPO vs DPO — tradeoffs in practice
Constitutional AI and self-supervised alignment
Open questions: reward hacking, goodharting, scalable oversight

Key Takeaways So Far

DPO is simpler and more stable than PPO for most use cases, but PPO gives more fine-grained control when you need it. The real bottleneck is always preference data quality.