Week 3: Distributed Systems Fundamentals

The software infrastructure that coordinates thousands of GPUs

Time Estimate: 3-4 hours

Topics Covered

Data parallelism vs model parallelism vs pipeline parallelism
Cluster management systems (Kubernetes, Slurm)
High-speed networking requirements for distributed training
Google Borg and cluster orchestration
Fault tolerance and recovery strategies

Featured Speaker

SN

Satya Nadella

CEO, Microsoft

Learn from industry leaders who are building the future of AI infrastructure and applications.

Video Resources

📹 Video content will be added here by Agent 2

Videos include keynotes, technical talks, and tutorials from industry leaders.

Reading Materials

📚 Reading list will be added here by Agent 3

Research papers, blog posts, and technical documentation.

🛠️ Hands-On Lab

Cloud GPU Benchmarking

Intermediate 3 hours

Objective

Provision GPU instances across cloud providers, benchmark performance, and understand NCCL for distributed communication.

Prerequisites

Python 3.8+
PyTorch basics
Cloud provider account (AWS/GCP free tier OK)
SSH and terminal familiarity

Setup Instructions

Create AWS and GCP accounts (use free tier credits)
Install PyTorch: pip install torch torchvision
Install NCCL tests: git clone https://github.com/NVIDIA/nccl-tests
Configure cloud CLI tools (aws-cli, gcloud)

Tasks

Provision T4 GPU instances on AWS and GCP
Run matrix multiplication benchmarks (1024x1024, 4096x4096)
Measure inter-node latency with NCCL all-reduce
Compare cost per TFLOP across providers
Document findings in a performance report

Resources