LoRA Fine-Tuning for Edge Deployment: Shrink, Quantize, Ship
Full fine-tuning is expensive and wasteful when you only need domain adaptation. LoRA trains 1% of the parameters, quantizes to INT8, and runs on edge hardware.
Why LoRA Instead of Full Fine-Tuning
Full fine-tuning a 7B model updates 7 billion parameters. LoRA (Low-Rank Adaptation) updates ~10-20 million — 0.1-0.3% of the model — and achieves comparable results for domain adaptation.
| Approach | Trainable Params | VRAM (7B) | Disk Size | Quality |
|---|---|---|---|---|
| Full fine-tuning | 7B | 80GB+ | 14GB | Best |
| LoRA (r=16) | ~20M | 16GB | 50MB adapter | Near-identical |
| QLoRA (4-bit + LoRA) | ~20M | 6GB | 50MB adapter | Slightly below LoRA |
The math: instead of updating the full weight matrix $W$, LoRA learns two small matrices $A$ and $B$ such that $W' = W + A \times B$. If $W$ is $d \times d$ and rank $r=16$, then $A$ is $d \times 16$ and $B$ is $16 \times d$ — reducing parameters from $d^2$ to $2 \times d \times 16$.
1. Setting Up QLoRA Fine-Tuning
1import torch
2from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
3from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
4from trl import SFTTrainer, SFTConfig
5from datasets import Dataset
6
7# Load model in 4-bit quantization
8model = AutoModelForCausalLM.from_pretrained(
9 "meta-llama/Llama-2-7b-chat-hf",
10 quantization_config={"load_in_4bit": True, "bnb_4bit_compute_dtype": torch.float16},
11 device_map="auto",
12)
13model = prepare_model_for_kbit_training(model)
14
15# LoRA configuration
16lora_config = LoraConfig(
17 r=16, # rank — start with 16
18 lora_alpha=32, # scaling factor (typically 2x rank)
19 target_modules=["q_proj", "v_proj", "k_proj", "o_proj"], # attention layers
20 lora_dropout=0.05,
21 bias="none",
22 task_type="CAUSAL_LM",
23)
24
25model = get_peft_model(model, lora_config)
26model.print_trainable_parameters()
27# Output: trainable params: 13,631,488 || all params: 3,553,894,400 || trainable%: 0.383%Which modules to target?
- Minimal:
["q_proj", "v_proj"]— fastest, good for simple tasks - Standard:
["q_proj", "v_proj", "k_proj", "o_proj"]— balanced - Aggressive:
["q_proj", "v_proj", "k_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]— best quality, more params
2. Preparing the Dataset
1# Format: instruction → response
2data = [
3 {"instruction": "Analyze this sensor reading: temp=85C, pressure=1.2bar",
4 "response": "Temperature exceeds safe operating range (80C). Recommend thermal throttling..."},
5 # ... more examples
6]
7
8dataset = Dataset.from_list([
9 {"text": f"### Instruction:\n{d['instruction']}\n\n### Response:\n{d['response']}"}
10 for d in data
11])Dataset size guidelines:
- Style/format adaptation: 500-1,000 examples
- Domain knowledge injection: 2,000-10,000 examples
- New task capability: 10,000-50,000 examples
3. Training
1tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
2tokenizer.pad_token = tokenizer.eos_token
3
4training_config = SFTConfig(
5 output_dir="./lora-adapter",
6 num_train_epochs=3,
7 per_device_train_batch_size=4,
8 gradient_accumulation_steps=4, # effective batch size = 16
9 learning_rate=2e-4, # LoRA needs higher LR than full FT
10 warmup_ratio=0.03,
11 lr_scheduler_type="cosine",
12 logging_steps=10,
13 save_strategy="epoch",
14 bf16=True, # use bf16 on Ampere+, fp16 on older
15 optim="paged_adamw_8bit", # 8-bit optimizer saves VRAM
16 max_seq_length=2048,
17 dataset_text_field="text",
18)
19
20trainer = SFTTrainer(
21 model=model,
22 train_dataset=dataset,
23 peft_config=lora_config,
24 tokenizer=tokenizer,
25 args=training_config,
26)
27
28trainer.train()
29
30# Save only the adapter (tiny — ~50MB)
31trainer.save_model("./lora-adapter")Key hyperparameters explained:
learning_rate=2e-4: LoRA needs 10-100x higher LR than full fine-tuning because only adapter weights are updatedlora_alpha=32: Scales the adapter's contribution. Common heuristic:alpha = 2 * rankpaged_adamw_8bit: Uses paged memory for optimizer states — prevents OOM on long sequences
4. Merging and Quantizing for Edge Deployment
For edge deployment, merge the LoRA adapter into the base model, then quantize to INT8:
1from peft import PeftModel
2from transformers import AutoModelForCausalLM
3import torch
4
5# Load base model in full precision
6base_model = AutoModelForCausalLM.from_pretrained(
7 "meta-llama/Llama-2-7b-chat-hf",
8 torch_dtype=torch.float16,
9 device_map="auto",
10)
11
12# Load and merge LoRA adapter
13model = PeftModel.from_pretrained(base_model, "./lora-adapter")
14merged_model = model.merge_and_unload() # merges A×B into W
15
16# Save merged model
17merged_model.save_pretrained("./merged-model")
18tokenizer.save_pretrained("./merged-model")Quantize to INT8 for edge:
1from transformers import BitsAndBytesConfig
2
3quant_config = BitsAndBytesConfig(
4 load_in_8bit=True,
5 llm_int8_threshold=6.0, # skip quantization for outliers
6)
7
8quantized_model = AutoModelForCausalLM.from_pretrained(
9 "./merged-model",
10 quantization_config=quant_config,
11 device_map="auto",
12)
13quantized_model.save_pretrained("./edge-model-int8")Size comparison:
1Base model (FP16): 14.0 GB
2LoRA adapter: 50 MB
3Merged model (FP16): 14.0 GB
4Quantized (INT8): 7.0 GB ← deploy this5. Serving the Edge Model
1# Edge inference with the quantized model
2from transformers import AutoModelForCausalLM, AutoTokenizer
3import torch
4
5model = AutoModelForCausalLM.from_pretrained(
6 "./edge-model-int8",
7 device_map="auto",
8)
9tokenizer = AutoTokenizer.from_pretrained("./edge-model-int8")
10
11def generate(prompt, max_new_tokens=256):
12 inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
13 with torch.no_grad():
14 outputs = model.generate(
15 **inputs,
16 max_new_tokens=max_new_tokens,
17 temperature=0.7,
18 top_p=0.9,
19 do_sample=True,
20 )
21 return tokenizer.decode(outputs[0], skip_special_tokens=True)6. Evaluating LoRA vs Full Fine-Tuning
1def evaluate_model(model, tokenizer, eval_dataset):
2 correct = 0
3 total = len(eval_dataset)
4
5 for example in eval_dataset:
6 response = generate(f"### Instruction:\n{example['instruction']}\n\n### Response:\n")
7 if example["expected_keyword"].lower() in response.lower():
8 correct += 1
9
10 return correct / total
11
12# Compare
13base_score = evaluate_model(base_model, tokenizer, eval_set)
14lora_score = evaluate_model(merged_model, tokenizer, eval_set)
15
16print(f"Base model: {base_score:.2%}")
17print(f"LoRA tuned: {lora_score:.2%}")
18print(f"Improvement: +{(lora_score - base_score) * 100:.1f}pp")Summary
| Step | What Happens | Output Size |
|---|---|---|
| QLoRA training | Train 0.3% of params in 4-bit base | 50MB adapter |
| Merge | Fold adapter into base weights | 14GB (FP16) |
| Quantize | Compress to INT8 for edge | 7GB (INT8) |
| Deploy | Serve on Jetson Nano / RTX 3060 | 7GB model |
LoRA isn't a compromise — for domain adaptation, it matches full fine-tuning at 1% the cost. The adapter is tiny, mergeable, and swappable at runtime. That's why it's the standard for production LLM customization.
Go from Arduino to Production Firmware
The ESP32-IDF Workshop covers ESP-IDF from scratch — tasks, queues, OTA, Wifi management, and deploying firmware that doesn't break at 3am.
Frequently Asked Questions
Quick answers to common questions

I build things that run on chips and the software that talks to them. ESP32, STM32, FreeRTOS, FastAPI, TinyML — from bare-metal firmware to cloud backends to on-device inference. Based in Bengaluru. Founder of Analog Data.