Week 4: Training Infrastructure at Scale

How to train models with thousands of GPUs across multiple data centers

Time Estimate: 3-4 hours

Topics Covered

Gradient synchronization strategies
Mixed precision training and loss scaling
Checkpointing and fault tolerance
Tesla's data engine and training pipeline
End-to-end neural network training

Featured Speaker

AE

Ashok Elluswamy

VP Autopilot Software, Tesla

Learn from industry leaders who are building the future of AI infrastructure and applications.

Video Resources

📹 Video content will be added here by Agent 2

Videos include keynotes, technical talks, and tutorials from industry leaders.

Reading Materials

📚 Reading list will be added here by Agent 3

Research papers, blog posts, and technical documentation.

🛠️ Hands-On Lab

CUDA Kernels & GPU Profiling

Advanced 4 hours

Objective

Write CUDA kernels from scratch, understand GPU memory hierarchy, and profile neural network operations.

Prerequisites

C/C++ programming
CUDA toolkit installed (or use Colab)
PyTorch knowledge
Linux/Unix terminal skills

Setup Instructions

Use Google Colab with GPU runtime (free T4 GPU)
Install CUDA toolkit: !apt-get install nvidia-cuda-toolkit
Clone starter repo: git clone https://github.com/stanford-cs153/cuda-lab
Install nvprof and Nsight Compute

Tasks

Implement naive matrix multiplication CUDA kernel
Optimize with shared memory tiling
Compare your kernel vs cuBLAS performance
Profile a PyTorch ResNet forward pass with nvprof
Identify memory-bound vs compute-bound operations

Resources