Week 9: Fine-tuning and Alignment

RLHF, instruction tuning, and making models useful

Time Estimate: 3-4 hours

Topics Covered

Supervised fine-tuning (SFT) vs RLHF
Reward modeling and preference learning
Constitutional AI and safety alignment
InstructGPT and ChatGPT training process
LoRA and parameter-efficient fine-tuning

Featured Speaker

SA

Sam Altman

CEO, OpenAI

Learn from industry leaders who are building the future of AI infrastructure and applications.

Video Resources

📹 Video content will be added here by Agent 2

Videos include keynotes, technical talks, and tutorials from industry leaders.

Reading Materials

📚 Reading list will be added here by Agent 3

Research papers, blog posts, and technical documentation.

🛠️ Hands-On Lab

Fine-tune with RLHF & DPO

Intermediate 3 hours

Objective

Fine-tune a small language model using supervised fine-tuning and Direct Preference Optimization.

Prerequisites

PyTorch and Hugging Face Transformers
Understanding of RLHF concepts
Google Colab with GPU
Hugging Face account

Setup Instructions

Install TRL library: pip install trl transformers datasets
Log into Hugging Face: huggingface-cli login
Clone starter repo: git clone https://github.com/stanford-cs153/rlhf-lab

Tasks

Supervised fine-tune GPT-2 on instruction dataset
Prepare preference dataset (chosen/rejected pairs)
Implement DPO training loop with TRL
Compare aligned vs unaligned model outputs
Measure helpfulness improvement qualitatively

Resources