50% OFF

ESP32-IDF Workshop

Blog/MLOps / LLMOps

Deploying LLMs with vLLM and Docker: A Production LLMOps Guide

Serving LLMs in production isn't just loading a model in Flask. Learn vLLM for high-throughput inference, Docker for reproducible deploys, and the ops layer.

| Advanced
Rajath Kumar
Rajath KumarEdge AI Engineer & Founder, Analog Data
2026-06-27·13 min read
Deploying LLMs with vLLM and Docker: A Production LLMOps Guide

The LLMOps Stack

Serving an LLM in production requires three layers:

  1. Serving engine — vLLM (high-throughput inference with PagedAttention)
  2. Containerization — Docker (reproducible, portable deploys)
  3. Ops layer — health checks, logging, autoscaling, model swapping

1. Why vLLM?

vLLM outperforms plain HuggingFace Transformers by 3-24x thanks to:

  • PagedAttention — KV cache managed like virtual memory (no fragmentation)
  • Continuous batching — new requests join active batches without waiting
  • Tensor parallelism — shard model across multiple GPUs
text
1# Throughput comparison (Llama-2-7B, A100 40GB)
2HuggingFace Transformers:  ~50 tokens/sec
3vLLM:                      ~2,000 tokens/sec

2. Serving with vLLM

python
1# serve.py — minimal vLLM server
2from vllm import LLM, SamplingParams
3
4llm = LLM(
5    model="meta-llama/Llama-2-7b-chat-hf",
6    tensor_parallel_size=1,       # GPUs to shard across
7    gpu_memory_utilization=0.90,  # fraction of GPU memory to use
8    max_model_len=4096,           # max sequence length
9    dtype="float16",              # or "bfloat16" on Ampere+
10)
11
12sampling = SamplingParams(
13    temperature=0.7,
14    top_p=0.9,
15    max_tokens=512,
16)
17
18# Batch inference
19prompts = ["Explain RAG in one sentence.", "What is PagedAttention?"]
20outputs = llm.generate(prompts, sampling)
21for output in outputs:
22    print(output.outputs[0].text)

For an OpenAI-compatible API server:

bash
1python -m vllm.entrypoints.openai.api_server \
2  --model meta-llama/Llama-2-7b-chat-hf \
3  --port 8000 \
4  --tensor-parallel-size 1 \
5  --gpu-memory-utilization 0.90

Now any OpenAI SDK client works:

python
1from openai import OpenAI
2
3client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")
4response = client.chat.completions.create(
5    model="meta-llama/Llama-2-7b-chat-hf",
6    messages=[{"role": "user", "content": "Explain PagedAttention."}],
7)
8print(response.choices[0].message.content)

3. Dockerizing the LLM Server

dockerfile
1# Dockerfile
2FROM vllm/vllm-openai:latest
3
4# Copy custom model if not downloading from HF
5# COPY ./models/llama-7b /models/llama-7b
6
7EXPOSE 8000
8
9CMD [
10  "--model", "meta-llama/Llama-2-7b-chat-hf",
11  "--port", "8000",
12  "--tensor-parallel-size", "1",
13  "--gpu-memory-utilization", "0.90",
14  "--max-model-len", "4096",
15  "--dtype", "float16",
16]

docker-compose.yml:

yaml
1version: "3.8"
2
3services:
4  vllm:
5    image: vllm/vllm-openai:latest
6    ports:
7      - "8000:8000"
8    volumes:
9      - hf-cache:/root/.cache/huggingface
10    environment:
11      - HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}
12    command:
13      - --model=meta-llama/Llama-2-7b-chat-hf
14      - --port=8000
15      - --gpu-memory-utilization=0.90
16      - --max-model-len=4096
17    deploy:
18      resources:
19        reservations:
20          devices:
21            - driver: nvidia
22              count: 1
23              capabilities: [gpu]
24    healthcheck:
25      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
26      interval: 30s
27      timeout: 10s
28      retries: 3
29      start_period: 120s  # model loading takes time
30
31  nginx:
32    image: nginx:alpine
33    ports:
34      - "80:80"
35    volumes:
36      - ./nginx.conf:/etc/nginx/nginx.conf:ro
37    depends_on:
38      vllm:
39        condition: service_healthy
40
41volumes:
42  hf-cache:

4. Nginx Reverse Proxy with Rate Limiting

nginx
1# nginx.conf
2events { worker_connections 1024; }
3
4http {
5    upstream vllm_backend {
6        server vllm:8000;
7    }
8
9    # Rate limit: 10 requests/sec per IP
10    limit_req_zone $binary_remote_addr zone=llm_limit:10m rate=10r/s;
11
12    server {
13        listen 80;
14
15        location /v1/ {
16            limit_req zone=llm_limit burst=20 nodelay;
17            proxy_pass http://vllm_backend;
18            proxy_set_header Host $host;
19            proxy_set_header X-Real-IP $remote_addr;
20            proxy_read_timeout 300s;  # LLM generation can take time
21        }
22
23        location /health {
24            proxy_pass http://vllm_backend/health;
25            access_log off;
26        }
27    }
28}

5. Monitoring and Logging

python
1# monitor.py — Prometheus metrics for vLLM
2from prometheus_client import Counter, Histogram, generate_latest
3import time
4import requests
5
6REQUEST_COUNT = Counter("llm_requests_total", "Total LLM requests", ["status"])
7REQUEST_LATENCY = Histogram("llm_request_duration_seconds", "LLM request latency")
8
9def health_check(vllm_url):
10    try:
11        start = time.time()
12        resp = requests.get(f"{vllm_url}/health", timeout=5)
13        latency = time.time() - start
14        REQUEST_LATENCY.observe(latency)
15        REQUEST_COUNT.labels(status=resp.status_code).inc()
16        return resp.status_code == 200
17    except requests.RequestException:
18        REQUEST_COUNT.labels(status="error").inc()
19        return False

6. Autoscaling Strategy

LLM workloads don't scale like web apps. The bottleneck is GPU memory, not CPU.

MetricAction
GPU memory > 95%Scale up (add replicas)
Queue depth > 10Scale up
GPU memory < 50%Scale down
Request latency > 5sScale up

Kubernetes HPA example:

yaml
1apiVersion: autoscaling/v2
2kind: HorizontalPodAutoscaler
3metadata:
4  name: vllm-hpa
5spec:
6  scaleTargetRef:
7    apiVersion: apps/v1
8    kind: Deployment
9    name: vllm
10  minReplicas: 1
11  maxReplicas: 4
12  metrics:
13    - type: Resource
14      resource:
15        name: nvidia_com_gpu_memory
16        target:
17          type: Utilization
18          averageUtilization: 85

Summary

LayerToolPurpose
ServingvLLMHigh-throughput inference with PagedAttention
ContainerDockerReproducible, portable deployments
ProxyNginxRate limiting, load balancing, timeouts
MonitoringPrometheusRequest count, latency, GPU utilization
ScalingK8s HPAAuto-scale based on GPU memory utilization

The key insight: LLM serving is a systems problem, not just a model problem. vLLM handles the inference, but the ops layer around it — rate limiting, health checks, autoscaling — is what makes it production-ready.

Share
Live Workshop

Go from Arduino to Production Firmware

The ESP32-IDF Workshop covers ESP-IDF from scratch — tasks, queues, OTA, Wifi management, and deploying firmware that doesn't break at 3am.

Join the Workshop →

Frequently Asked Questions

Quick answers to common questions

Rajath Kumar

Written by

Rajath Kumar

Edge AI Engineer & Founder, Analog Data

I build things that run on chips and the software that talks to them. ESP32, STM32, FreeRTOS, FastAPI, TinyML — from bare-metal firmware to cloud backends to on-device inference. Based in Bengaluru. Founder of Analog Data.

More in MLOps / LLMOps