Week 10: Deployment and Inference Optimization

Serving models in production at scale

Time Estimate: 3-4 hours

Topics Covered

Featured Speaker

MP

Matthew Prince

CEO, Cloudflare

Learn from industry leaders who are building the future of AI infrastructure and applications.

Video Resources

📹 Video content will be added here by Agent 2

Videos include keynotes, technical talks, and tutorials from industry leaders.

Reading Materials

📚 Reading list will be added here by Agent 3

Research papers, blog posts, and technical documentation.

🛠️ Hands-On Lab

Deploy a Model with vLLM

Intermediate 3 hours

Objective

Quantize a 7B model and deploy with vLLM, measuring throughput and latency improvements.

Prerequisites

  • PyTorch and Transformers library
  • Understanding of quantization concepts
  • Linux server or Colab with A100 (recommended)
  • FastAPI basics helpful

Setup Instructions

  1. Install vLLM: pip install vllm
  2. Install quantization tools: pip install auto-gptq autoawq
  3. Download Llama-2-7B or Mistral-7B model
  4. Set up inference benchmarking scripts

Tasks

  1. Quantize a 7B model to 4-bit with GPTQ
  2. Quantize same model with AWQ and compare
  3. Deploy quantized model with vLLM server
  4. Benchmark throughput (tokens/sec) and latency
  5. Compare FP16 vs 4-bit quality on test prompts
  6. Write up findings on quality-performance tradeoffs

Resources