Week 10: Deployment and Inference Optimization

Serving models in production at scale

Time Estimate: 3-4 hours

Topics Covered

KV caching and PagedAttention (vLLM)
Quantization techniques (INT8, INT4, GPTQ, AWQ)
Batching strategies and request scheduling
TensorRT-LLM and inference optimization
Monitoring and debugging production systems

Featured Speaker

MP

Matthew Prince

CEO, Cloudflare

Learn from industry leaders who are building the future of AI infrastructure and applications.

Video Resources

📹 Video content will be added here by Agent 2

Videos include keynotes, technical talks, and tutorials from industry leaders.

Reading Materials

📚 Reading list will be added here by Agent 3

Research papers, blog posts, and technical documentation.

🛠️ Hands-On Lab

Deploy a Model with vLLM

Intermediate 3 hours

Objective

Quantize a 7B model and deploy with vLLM, measuring throughput and latency improvements.

Prerequisites

PyTorch and Transformers library
Understanding of quantization concepts
Linux server or Colab with A100 (recommended)
FastAPI basics helpful

Setup Instructions

Install vLLM: pip install vllm
Install quantization tools: pip install auto-gptq autoawq
Download Llama-2-7B or Mistral-7B model
Set up inference benchmarking scripts

Tasks

Quantize a 7B model to 4-bit with GPTQ
Quantize same model with AWQ and compare
Deploy quantized model with vLLM server
Benchmark throughput (tokens/sec) and latency
Compare FP16 vs 4-bit quality on test prompts
Write up findings on quality-performance tradeoffs

Resources