Deploying LLMs with vLLM and Docker: A Production LLMOps Guide
Serving LLMs in production isn't just loading a model in Flask. Learn vLLM for high-throughput inference, Docker for reproducible deploys, and the ops layer.
The LLMOps Stack
Serving an LLM in production requires three layers:
- Serving engine — vLLM (high-throughput inference with PagedAttention)
- Containerization — Docker (reproducible, portable deploys)
- Ops layer — health checks, logging, autoscaling, model swapping
1. Why vLLM?
vLLM outperforms plain HuggingFace Transformers by 3-24x thanks to:
- PagedAttention — KV cache managed like virtual memory (no fragmentation)
- Continuous batching — new requests join active batches without waiting
- Tensor parallelism — shard model across multiple GPUs
1# Throughput comparison (Llama-2-7B, A100 40GB)
2HuggingFace Transformers: ~50 tokens/sec
3vLLM: ~2,000 tokens/sec2. Serving with vLLM
1# serve.py — minimal vLLM server
2from vllm import LLM, SamplingParams
3
4llm = LLM(
5 model="meta-llama/Llama-2-7b-chat-hf",
6 tensor_parallel_size=1, # GPUs to shard across
7 gpu_memory_utilization=0.90, # fraction of GPU memory to use
8 max_model_len=4096, # max sequence length
9 dtype="float16", # or "bfloat16" on Ampere+
10)
11
12sampling = SamplingParams(
13 temperature=0.7,
14 top_p=0.9,
15 max_tokens=512,
16)
17
18# Batch inference
19prompts = ["Explain RAG in one sentence.", "What is PagedAttention?"]
20outputs = llm.generate(prompts, sampling)
21for output in outputs:
22 print(output.outputs[0].text)For an OpenAI-compatible API server:
1python -m vllm.entrypoints.openai.api_server \
2 --model meta-llama/Llama-2-7b-chat-hf \
3 --port 8000 \
4 --tensor-parallel-size 1 \
5 --gpu-memory-utilization 0.90Now any OpenAI SDK client works:
1from openai import OpenAI
2
3client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")
4response = client.chat.completions.create(
5 model="meta-llama/Llama-2-7b-chat-hf",
6 messages=[{"role": "user", "content": "Explain PagedAttention."}],
7)
8print(response.choices[0].message.content)3. Dockerizing the LLM Server
1# Dockerfile
2FROM vllm/vllm-openai:latest
3
4# Copy custom model if not downloading from HF
5# COPY ./models/llama-7b /models/llama-7b
6
7EXPOSE 8000
8
9CMD [
10 "--model", "meta-llama/Llama-2-7b-chat-hf",
11 "--port", "8000",
12 "--tensor-parallel-size", "1",
13 "--gpu-memory-utilization", "0.90",
14 "--max-model-len", "4096",
15 "--dtype", "float16",
16]docker-compose.yml:
1version: "3.8"
2
3services:
4 vllm:
5 image: vllm/vllm-openai:latest
6 ports:
7 - "8000:8000"
8 volumes:
9 - hf-cache:/root/.cache/huggingface
10 environment:
11 - HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}
12 command:
13 - --model=meta-llama/Llama-2-7b-chat-hf
14 - --port=8000
15 - --gpu-memory-utilization=0.90
16 - --max-model-len=4096
17 deploy:
18 resources:
19 reservations:
20 devices:
21 - driver: nvidia
22 count: 1
23 capabilities: [gpu]
24 healthcheck:
25 test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
26 interval: 30s
27 timeout: 10s
28 retries: 3
29 start_period: 120s # model loading takes time
30
31 nginx:
32 image: nginx:alpine
33 ports:
34 - "80:80"
35 volumes:
36 - ./nginx.conf:/etc/nginx/nginx.conf:ro
37 depends_on:
38 vllm:
39 condition: service_healthy
40
41volumes:
42 hf-cache:4. Nginx Reverse Proxy with Rate Limiting
1# nginx.conf
2events { worker_connections 1024; }
3
4http {
5 upstream vllm_backend {
6 server vllm:8000;
7 }
8
9 # Rate limit: 10 requests/sec per IP
10 limit_req_zone $binary_remote_addr zone=llm_limit:10m rate=10r/s;
11
12 server {
13 listen 80;
14
15 location /v1/ {
16 limit_req zone=llm_limit burst=20 nodelay;
17 proxy_pass http://vllm_backend;
18 proxy_set_header Host $host;
19 proxy_set_header X-Real-IP $remote_addr;
20 proxy_read_timeout 300s; # LLM generation can take time
21 }
22
23 location /health {
24 proxy_pass http://vllm_backend/health;
25 access_log off;
26 }
27 }
28}5. Monitoring and Logging
1# monitor.py — Prometheus metrics for vLLM
2from prometheus_client import Counter, Histogram, generate_latest
3import time
4import requests
5
6REQUEST_COUNT = Counter("llm_requests_total", "Total LLM requests", ["status"])
7REQUEST_LATENCY = Histogram("llm_request_duration_seconds", "LLM request latency")
8
9def health_check(vllm_url):
10 try:
11 start = time.time()
12 resp = requests.get(f"{vllm_url}/health", timeout=5)
13 latency = time.time() - start
14 REQUEST_LATENCY.observe(latency)
15 REQUEST_COUNT.labels(status=resp.status_code).inc()
16 return resp.status_code == 200
17 except requests.RequestException:
18 REQUEST_COUNT.labels(status="error").inc()
19 return False6. Autoscaling Strategy
LLM workloads don't scale like web apps. The bottleneck is GPU memory, not CPU.
| Metric | Action |
|---|---|
| GPU memory > 95% | Scale up (add replicas) |
| Queue depth > 10 | Scale up |
| GPU memory < 50% | Scale down |
| Request latency > 5s | Scale up |
Kubernetes HPA example:
1apiVersion: autoscaling/v2
2kind: HorizontalPodAutoscaler
3metadata:
4 name: vllm-hpa
5spec:
6 scaleTargetRef:
7 apiVersion: apps/v1
8 kind: Deployment
9 name: vllm
10 minReplicas: 1
11 maxReplicas: 4
12 metrics:
13 - type: Resource
14 resource:
15 name: nvidia_com_gpu_memory
16 target:
17 type: Utilization
18 averageUtilization: 85Summary
| Layer | Tool | Purpose |
|---|---|---|
| Serving | vLLM | High-throughput inference with PagedAttention |
| Container | Docker | Reproducible, portable deployments |
| Proxy | Nginx | Rate limiting, load balancing, timeouts |
| Monitoring | Prometheus | Request count, latency, GPU utilization |
| Scaling | K8s HPA | Auto-scale based on GPU memory utilization |
The key insight: LLM serving is a systems problem, not just a model problem. vLLM handles the inference, but the ops layer around it — rate limiting, health checks, autoscaling — is what makes it production-ready.
Go from Arduino to Production Firmware
The ESP32-IDF Workshop covers ESP-IDF from scratch — tasks, queues, OTA, Wifi management, and deploying firmware that doesn't break at 3am.
Frequently Asked Questions
Quick answers to common questions

I build things that run on chips and the software that talks to them. ESP32, STM32, FreeRTOS, FastAPI, TinyML — from bare-metal firmware to cloud backends to on-device inference. Based in Bengaluru. Founder of Analog Data.