Week 4: Training Infrastructure at Scale

How to train models with thousands of GPUs across multiple data centers

Time Estimate: 3-4 hours

Topics Covered

Featured Speaker

AE

Ashok Elluswamy

VP Autopilot Software, Tesla

Learn from industry leaders who are building the future of AI infrastructure and applications.

Video Resources

📹 Video content will be added here by Agent 2

Videos include keynotes, technical talks, and tutorials from industry leaders.

Reading Materials

📚 Reading list will be added here by Agent 3

Research papers, blog posts, and technical documentation.

🛠️ Hands-On Lab

CUDA Kernels & GPU Profiling

Advanced 4 hours

Objective

Write CUDA kernels from scratch, understand GPU memory hierarchy, and profile neural network operations.

Prerequisites

  • C/C++ programming
  • CUDA toolkit installed (or use Colab)
  • PyTorch knowledge
  • Linux/Unix terminal skills

Setup Instructions

  1. Use Google Colab with GPU runtime (free T4 GPU)
  2. Install CUDA toolkit: !apt-get install nvidia-cuda-toolkit
  3. Clone starter repo: git clone https://github.com/stanford-cs153/cuda-lab
  4. Install nvprof and Nsight Compute

Tasks

  1. Implement naive matrix multiplication CUDA kernel
  2. Optimize with shared memory tiling
  3. Compare your kernel vs cuBLAS performance
  4. Profile a PyTorch ResNet forward pass with nvprof
  5. Identify memory-bound vs compute-bound operations

Resources