Week 10: Deployment and Inference Optimization
Serving models in production at scale
Time Estimate: 3-4 hours
Topics Covered
- KV caching and PagedAttention (vLLM)
- Quantization techniques (INT8, INT4, GPTQ, AWQ)
- Batching strategies and request scheduling
- TensorRT-LLM and inference optimization
- Monitoring and debugging production systems
Featured Speaker
MP
Matthew Prince
CEO, Cloudflare
Learn from industry leaders who are building the future of AI infrastructure and applications.
Video Resources
📹 Video content will be added here by Agent 2
Videos include keynotes, technical talks, and tutorials from industry leaders.
Reading Materials
📚 Reading list will be added here by Agent 3
Research papers, blog posts, and technical documentation.
🛠️ Hands-On Lab
Deploy a Model with vLLM
Intermediate 3 hoursObjective
Quantize a 7B model and deploy with vLLM, measuring throughput and latency improvements.
Prerequisites
- PyTorch and Transformers library
- Understanding of quantization concepts
- Linux server or Colab with A100 (recommended)
- FastAPI basics helpful
Setup Instructions
- Install vLLM:
pip install vllm - Install quantization tools:
pip install auto-gptq autoawq - Download Llama-2-7B or Mistral-7B model
- Set up inference benchmarking scripts
Tasks
- Quantize a 7B model to 4-bit with GPTQ
- Quantize same model with AWQ and compare
- Deploy quantized model with vLLM server
- Benchmark throughput (tokens/sec) and latency
- Compare FP16 vs 4-bit quality on test prompts
- Write up findings on quality-performance tradeoffs