50% OFF

ESP32-IDF Workshop

Blog/ML / DL

LoRA Fine-Tuning for Edge Deployment: Shrink, Quantize, Ship

Full fine-tuning is expensive and wasteful when you only need domain adaptation. LoRA trains 1% of the parameters, quantizes to INT8, and runs on edge hardware.

| Advanced
Rajath Kumar
Rajath KumarEdge AI Engineer & Founder, Analog Data
2026-06-27·13 min read
LoRA Fine-Tuning for Edge Deployment: Shrink, Quantize, Ship

Why LoRA Instead of Full Fine-Tuning

Full fine-tuning a 7B model updates 7 billion parameters. LoRA (Low-Rank Adaptation) updates ~10-20 million — 0.1-0.3% of the model — and achieves comparable results for domain adaptation.

ApproachTrainable ParamsVRAM (7B)Disk SizeQuality
Full fine-tuning7B80GB+14GBBest
LoRA (r=16)~20M16GB50MB adapterNear-identical
QLoRA (4-bit + LoRA)~20M6GB50MB adapterSlightly below LoRA

The math: instead of updating the full weight matrix $W$, LoRA learns two small matrices $A$ and $B$ such that $W' = W + A \times B$. If $W$ is $d \times d$ and rank $r=16$, then $A$ is $d \times 16$ and $B$ is $16 \times d$ — reducing parameters from $d^2$ to $2 \times d \times 16$.

1. Setting Up QLoRA Fine-Tuning

python
1import torch
2from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
3from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
4from trl import SFTTrainer, SFTConfig
5from datasets import Dataset
6
7# Load model in 4-bit quantization
8model = AutoModelForCausalLM.from_pretrained(
9    "meta-llama/Llama-2-7b-chat-hf",
10    quantization_config={"load_in_4bit": True, "bnb_4bit_compute_dtype": torch.float16},
11    device_map="auto",
12)
13model = prepare_model_for_kbit_training(model)
14
15# LoRA configuration
16lora_config = LoraConfig(
17    r=16,                              # rank — start with 16
18    lora_alpha=32,                     # scaling factor (typically 2x rank)
19    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],  # attention layers
20    lora_dropout=0.05,
21    bias="none",
22    task_type="CAUSAL_LM",
23)
24
25model = get_peft_model(model, lora_config)
26model.print_trainable_parameters()
27# Output: trainable params: 13,631,488 || all params: 3,553,894,400 || trainable%: 0.383%

Which modules to target?

  • Minimal: ["q_proj", "v_proj"] — fastest, good for simple tasks
  • Standard: ["q_proj", "v_proj", "k_proj", "o_proj"] — balanced
  • Aggressive: ["q_proj", "v_proj", "k_proj", "o_proj", "gate_proj", "up_proj", "down_proj"] — best quality, more params

2. Preparing the Dataset

python
1# Format: instruction → response
2data = [
3    {"instruction": "Analyze this sensor reading: temp=85C, pressure=1.2bar",
4     "response": "Temperature exceeds safe operating range (80C). Recommend thermal throttling..."},
5    # ... more examples
6]
7
8dataset = Dataset.from_list([
9    {"text": f"### Instruction:\n{d['instruction']}\n\n### Response:\n{d['response']}"}
10    for d in data
11])

Dataset size guidelines:

  • Style/format adaptation: 500-1,000 examples
  • Domain knowledge injection: 2,000-10,000 examples
  • New task capability: 10,000-50,000 examples

3. Training

python
1tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
2tokenizer.pad_token = tokenizer.eos_token
3
4training_config = SFTConfig(
5    output_dir="./lora-adapter",
6    num_train_epochs=3,
7    per_device_train_batch_size=4,
8    gradient_accumulation_steps=4,      # effective batch size = 16
9    learning_rate=2e-4,                 # LoRA needs higher LR than full FT
10    warmup_ratio=0.03,
11    lr_scheduler_type="cosine",
12    logging_steps=10,
13    save_strategy="epoch",
14    bf16=True,                          # use bf16 on Ampere+, fp16 on older
15    optim="paged_adamw_8bit",           # 8-bit optimizer saves VRAM
16    max_seq_length=2048,
17    dataset_text_field="text",
18)
19
20trainer = SFTTrainer(
21    model=model,
22    train_dataset=dataset,
23    peft_config=lora_config,
24    tokenizer=tokenizer,
25    args=training_config,
26)
27
28trainer.train()
29
30# Save only the adapter (tiny — ~50MB)
31trainer.save_model("./lora-adapter")

Key hyperparameters explained:

  • learning_rate=2e-4: LoRA needs 10-100x higher LR than full fine-tuning because only adapter weights are updated
  • lora_alpha=32: Scales the adapter's contribution. Common heuristic: alpha = 2 * rank
  • paged_adamw_8bit: Uses paged memory for optimizer states — prevents OOM on long sequences

4. Merging and Quantizing for Edge Deployment

For edge deployment, merge the LoRA adapter into the base model, then quantize to INT8:

python
1from peft import PeftModel
2from transformers import AutoModelForCausalLM
3import torch
4
5# Load base model in full precision
6base_model = AutoModelForCausalLM.from_pretrained(
7    "meta-llama/Llama-2-7b-chat-hf",
8    torch_dtype=torch.float16,
9    device_map="auto",
10)
11
12# Load and merge LoRA adapter
13model = PeftModel.from_pretrained(base_model, "./lora-adapter")
14merged_model = model.merge_and_unload()  # merges A×B into W
15
16# Save merged model
17merged_model.save_pretrained("./merged-model")
18tokenizer.save_pretrained("./merged-model")

Quantize to INT8 for edge:

python
1from transformers import BitsAndBytesConfig
2
3quant_config = BitsAndBytesConfig(
4    load_in_8bit=True,
5    llm_int8_threshold=6.0,  # skip quantization for outliers
6)
7
8quantized_model = AutoModelForCausalLM.from_pretrained(
9    "./merged-model",
10    quantization_config=quant_config,
11    device_map="auto",
12)
13quantized_model.save_pretrained("./edge-model-int8")

Size comparison:

text
1Base model (FP16):      14.0 GB
2LoRA adapter:             50 MB
3Merged model (FP16):    14.0 GB
4Quantized (INT8):        7.0 GB  ← deploy this

5. Serving the Edge Model

python
1# Edge inference with the quantized model
2from transformers import AutoModelForCausalLM, AutoTokenizer
3import torch
4
5model = AutoModelForCausalLM.from_pretrained(
6    "./edge-model-int8",
7    device_map="auto",
8)
9tokenizer = AutoTokenizer.from_pretrained("./edge-model-int8")
10
11def generate(prompt, max_new_tokens=256):
12    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
13    with torch.no_grad():
14        outputs = model.generate(
15            **inputs,
16            max_new_tokens=max_new_tokens,
17            temperature=0.7,
18            top_p=0.9,
19            do_sample=True,
20        )
21    return tokenizer.decode(outputs[0], skip_special_tokens=True)

6. Evaluating LoRA vs Full Fine-Tuning

python
1def evaluate_model(model, tokenizer, eval_dataset):
2    correct = 0
3    total = len(eval_dataset)
4
5    for example in eval_dataset:
6        response = generate(f"### Instruction:\n{example['instruction']}\n\n### Response:\n")
7        if example["expected_keyword"].lower() in response.lower():
8            correct += 1
9
10    return correct / total
11
12# Compare
13base_score = evaluate_model(base_model, tokenizer, eval_set)
14lora_score = evaluate_model(merged_model, tokenizer, eval_set)
15
16print(f"Base model:  {base_score:.2%}")
17print(f"LoRA tuned:  {lora_score:.2%}")
18print(f"Improvement: +{(lora_score - base_score) * 100:.1f}pp")

Summary

StepWhat HappensOutput Size
QLoRA trainingTrain 0.3% of params in 4-bit base50MB adapter
MergeFold adapter into base weights14GB (FP16)
QuantizeCompress to INT8 for edge7GB (INT8)
DeployServe on Jetson Nano / RTX 30607GB model

LoRA isn't a compromise — for domain adaptation, it matches full fine-tuning at 1% the cost. The adapter is tiny, mergeable, and swappable at runtime. That's why it's the standard for production LLM customization.

Share
Live Workshop

Go from Arduino to Production Firmware

The ESP32-IDF Workshop covers ESP-IDF from scratch — tasks, queues, OTA, Wifi management, and deploying firmware that doesn't break at 3am.

Join the Workshop →

Frequently Asked Questions

Quick answers to common questions

Rajath Kumar

Written by

Rajath Kumar

Edge AI Engineer & Founder, Analog Data

I build things that run on chips and the software that talks to them. ESP32, STM32, FreeRTOS, FastAPI, TinyML — from bare-metal firmware to cloud backends to on-device inference. Based in Bengaluru. Founder of Analog Data.

More in ML / DL