Week 3: Distributed Systems Fundamentals

The software infrastructure that coordinates thousands of GPUs

Time Estimate: 3-4 hours

Topics Covered

Featured Speaker

SN

Satya Nadella

CEO, Microsoft

Learn from industry leaders who are building the future of AI infrastructure and applications.

Video Resources

📹 Video content will be added here by Agent 2

Videos include keynotes, technical talks, and tutorials from industry leaders.

Reading Materials

📚 Reading list will be added here by Agent 3

Research papers, blog posts, and technical documentation.

🛠️ Hands-On Lab

Cloud GPU Benchmarking

Intermediate 3 hours

Objective

Provision GPU instances across cloud providers, benchmark performance, and understand NCCL for distributed communication.

Prerequisites

  • Python 3.8+
  • PyTorch basics
  • Cloud provider account (AWS/GCP free tier OK)
  • SSH and terminal familiarity

Setup Instructions

  1. Create AWS and GCP accounts (use free tier credits)
  2. Install PyTorch: pip install torch torchvision
  3. Install NCCL tests: git clone https://github.com/NVIDIA/nccl-tests
  4. Configure cloud CLI tools (aws-cli, gcloud)

Tasks

  1. Provision T4 GPU instances on AWS and GCP
  2. Run matrix multiplication benchmarks (1024x1024, 4096x4096)
  3. Measure inter-node latency with NCCL all-reduce
  4. Compare cost per TFLOP across providers
  5. Document findings in a performance report

Resources