Week 9: Fine-tuning and Alignment
RLHF, instruction tuning, and making models useful
Time Estimate: 3-4 hours
Topics Covered
- Supervised fine-tuning (SFT) vs RLHF
- Reward modeling and preference learning
- Constitutional AI and safety alignment
- InstructGPT and ChatGPT training process
- LoRA and parameter-efficient fine-tuning
Featured Speaker
SA
Sam Altman
CEO, OpenAI
Learn from industry leaders who are building the future of AI infrastructure and applications.
Video Resources
📹 Video content will be added here by Agent 2
Videos include keynotes, technical talks, and tutorials from industry leaders.
Reading Materials
📚 Reading list will be added here by Agent 3
Research papers, blog posts, and technical documentation.
🛠️ Hands-On Lab
Fine-tune with RLHF & DPO
Intermediate 3 hoursObjective
Fine-tune a small language model using supervised fine-tuning and Direct Preference Optimization.
Prerequisites
- PyTorch and Hugging Face Transformers
- Understanding of RLHF concepts
- Google Colab with GPU
- Hugging Face account
Setup Instructions
- Install TRL library:
pip install trl transformers datasets - Log into Hugging Face:
huggingface-cli login - Clone starter repo:
git clone https://github.com/stanford-cs153/rlhf-lab
Tasks
- Supervised fine-tune GPT-2 on instruction dataset
- Prepare preference dataset (chosen/rejected pairs)
- Implement DPO training loop with TRL
- Compare aligned vs unaligned model outputs
- Measure helpfulness improvement qualitatively