Week 4: Training Infrastructure at Scale
How to train models with thousands of GPUs across multiple data centers
Time Estimate: 3-4 hours
Topics Covered
- Gradient synchronization strategies
- Mixed precision training and loss scaling
- Checkpointing and fault tolerance
- Tesla's data engine and training pipeline
- End-to-end neural network training
Featured Speaker
AE
Ashok Elluswamy
VP Autopilot Software, Tesla
Learn from industry leaders who are building the future of AI infrastructure and applications.
Video Resources
📹 Video content will be added here by Agent 2
Videos include keynotes, technical talks, and tutorials from industry leaders.
Reading Materials
📚 Reading list will be added here by Agent 3
Research papers, blog posts, and technical documentation.
🛠️ Hands-On Lab
CUDA Kernels & GPU Profiling
Advanced 4 hoursObjective
Write CUDA kernels from scratch, understand GPU memory hierarchy, and profile neural network operations.
Prerequisites
- C/C++ programming
- CUDA toolkit installed (or use Colab)
- PyTorch knowledge
- Linux/Unix terminal skills
Setup Instructions
- Use Google Colab with GPU runtime (free T4 GPU)
- Install CUDA toolkit:
!apt-get install nvidia-cuda-toolkit - Clone starter repo:
git clone https://github.com/stanford-cs153/cuda-lab - Install nvprof and Nsight Compute
Tasks
- Implement naive matrix multiplication CUDA kernel
- Optimize with shared memory tiling
- Compare your kernel vs cuBLAS performance
- Profile a PyTorch ResNet forward pass with nvprof
- Identify memory-bound vs compute-bound operations