Week 3: Distributed Systems Fundamentals
The software infrastructure that coordinates thousands of GPUs
Time Estimate: 3-4 hours
Topics Covered
- Data parallelism vs model parallelism vs pipeline parallelism
- Cluster management systems (Kubernetes, Slurm)
- High-speed networking requirements for distributed training
- Google Borg and cluster orchestration
- Fault tolerance and recovery strategies
Featured Speaker
SN
Satya Nadella
CEO, Microsoft
Learn from industry leaders who are building the future of AI infrastructure and applications.
Video Resources
📹 Video content will be added here by Agent 2
Videos include keynotes, technical talks, and tutorials from industry leaders.
Reading Materials
📚 Reading list will be added here by Agent 3
Research papers, blog posts, and technical documentation.
🛠️ Hands-On Lab
Cloud GPU Benchmarking
Intermediate 3 hoursObjective
Provision GPU instances across cloud providers, benchmark performance, and understand NCCL for distributed communication.
Prerequisites
- Python 3.8+
- PyTorch basics
- Cloud provider account (AWS/GCP free tier OK)
- SSH and terminal familiarity
Setup Instructions
- Create AWS and GCP accounts (use free tier credits)
- Install PyTorch:
pip install torch torchvision - Install NCCL tests:
git clone https://github.com/NVIDIA/nccl-tests - Configure cloud CLI tools (aws-cli, gcloud)
Tasks
- Provision T4 GPU instances on AWS and GCP
- Run matrix multiplication benchmarks (1024x1024, 4096x4096)
- Measure inter-node latency with NCCL all-reduce
- Compare cost per TFLOP across providers
- Document findings in a performance report