What rank (r) should I choose for LoRA?

Start with r=16 for most tasks. Use r=8 for simple domain adaptation (style transfer, format changes) and r=64 for complex reasoning tasks (math, code). Higher rank = more capacity but more parameters to train and merge.

Can I merge multiple LoRA adapters at runtime?

Yes. vLLM and HuggingFace PEFT support loading multiple LoRA adapters and routing requests to different adapters dynamically. This lets you serve one base model with multiple domain-specific personalities.

Does LoRA work for non-LLM models?

Yes. LoRA works for any transformer-based model — Stable Diffusion, Whisper, ViT. The principle is the same: freeze base weights, train low-rank adapters for the attention layers.

How much VRAM do I need for LoRA fine-tuning a 7B model?

With 4-bit quantization (QLoRA), you can fine-tune a 7B model on a single 16GB GPU (RTX 3090/4090, T4). Without quantization, you need 24GB+ for the optimizer states and activations.

Blog/ML / DL

LoRA Fine-Tuning for Edge Deployment: Shrink, Quantize, Ship

Full fine-tuning is expensive and wasteful when you only need domain adaptation. LoRA trains 1% of the parameters, quantizes to INT8, and runs on edge hardware.

| Advanced

Rajath KumarEdge AI Engineer & Founder, Analog Data

2026-06-27·13 min read

LoRA Fine-Tuning for Edge Deployment: Shrink, Quantize, Ship

Why LoRA Instead of Full Fine-Tuning

Full fine-tuning a 7B model updates 7 billion parameters. LoRA (Low-Rank Adaptation) updates ~10-20 million — 0.1-0.3% of the model — and achieves comparable results for domain adaptation.

Approach	Trainable Params	VRAM (7B)	Disk Size	Quality
Full fine-tuning	7B	80GB+	14GB	Best
LoRA (r=16)	~20M	16GB	50MB adapter	Near-identical
QLoRA (4-bit + LoRA)	~20M	6GB	50MB adapter	Slightly below LoRA

The math: instead of updating the full weight matrix $W$, LoRA learns two small matrices $A$ and $B$ such that $W' = W + A \times B$. If $W$ is $d \times d$ and rank $r=16$, then $A$ is $d \times 16$ and $B$ is $16 \times d$ — reducing parameters from $d^2$ to $2 \times d \times 16$.

1. Setting Up QLoRA Fine-Tuning

python

1import torch
2from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
3from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
4from trl import SFTTrainer, SFTConfig
5from datasets import Dataset
6
7# Load model in 4-bit quantization
8model = AutoModelForCausalLM.from_pretrained(
9    "meta-llama/Llama-2-7b-chat-hf",
10    quantization_config={"load_in_4bit": True, "bnb_4bit_compute_dtype": torch.float16},
11    device_map="auto",
12)
13model = prepare_model_for_kbit_training(model)
14
15# LoRA configuration
16lora_config = LoraConfig(
17    r=16,                              # rank — start with 16
18    lora_alpha=32,                     # scaling factor (typically 2x rank)
19    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],  # attention layers
20    lora_dropout=0.05,
21    bias="none",
22    task_type="CAUSAL_LM",
23)
24
25model = get_peft_model(model, lora_config)
26model.print_trainable_parameters()
27# Output: trainable params: 13,631,488 || all params: 3,553,894,400 || trainable%: 0.383%

Which modules to target?

Minimal: ["q_proj", "v_proj"] — fastest, good for simple tasks
Standard: ["q_proj", "v_proj", "k_proj", "o_proj"] — balanced
Aggressive: ["q_proj", "v_proj", "k_proj", "o_proj", "gate_proj", "up_proj", "down_proj"] — best quality, more params

2. Preparing the Dataset

python

1# Format: instruction → response
2data = [
3    {"instruction": "Analyze this sensor reading: temp=85C, pressure=1.2bar",
4     "response": "Temperature exceeds safe operating range (80C). Recommend thermal throttling..."},
5    # ... more examples
6]
7
8dataset = Dataset.from_list([
9    {"text": f"### Instruction:\n{d['instruction']}\n\n### Response:\n{d['response']}"}
10    for d in data
11])

Dataset size guidelines:

Style/format adaptation: 500-1,000 examples
Domain knowledge injection: 2,000-10,000 examples
New task capability: 10,000-50,000 examples

3. Training

python

1tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
2tokenizer.pad_token = tokenizer.eos_token
3
4training_config = SFTConfig(
5    output_dir="./lora-adapter",
6    num_train_epochs=3,
7    per_device_train_batch_size=4,
8    gradient_accumulation_steps=4,      # effective batch size = 16
9    learning_rate=2e-4,                 # LoRA needs higher LR than full FT
10    warmup_ratio=0.03,
11    lr_scheduler_type="cosine",
12    logging_steps=10,
13    save_strategy="epoch",
14    bf16=True,                          # use bf16 on Ampere+, fp16 on older
15    optim="paged_adamw_8bit",           # 8-bit optimizer saves VRAM
16    max_seq_length=2048,
17    dataset_text_field="text",
18)
19
20trainer = SFTTrainer(
21    model=model,
22    train_dataset=dataset,
23    peft_config=lora_config,
24    tokenizer=tokenizer,
25    args=training_config,
26)
27
28trainer.train()
29
30# Save only the adapter (tiny — ~50MB)
31trainer.save_model("./lora-adapter")

Key hyperparameters explained:

learning_rate=2e-4: LoRA needs 10-100x higher LR than full fine-tuning because only adapter weights are updated
lora_alpha=32: Scales the adapter's contribution. Common heuristic: alpha = 2 * rank
paged_adamw_8bit: Uses paged memory for optimizer states — prevents OOM on long sequences

4. Merging and Quantizing for Edge Deployment

For edge deployment, merge the LoRA adapter into the base model, then quantize to INT8:

python

1from peft import PeftModel
2from transformers import AutoModelForCausalLM
3import torch
4
5# Load base model in full precision
6base_model = AutoModelForCausalLM.from_pretrained(
7    "meta-llama/Llama-2-7b-chat-hf",
8    torch_dtype=torch.float16,
9    device_map="auto",
10)
11
12# Load and merge LoRA adapter
13model = PeftModel.from_pretrained(base_model, "./lora-adapter")
14merged_model = model.merge_and_unload()  # merges A×B into W
15
16# Save merged model
17merged_model.save_pretrained("./merged-model")
18tokenizer.save_pretrained("./merged-model")

Quantize to INT8 for edge:

python

1from transformers import BitsAndBytesConfig
2
3quant_config = BitsAndBytesConfig(
4    load_in_8bit=True,
5    llm_int8_threshold=6.0,  # skip quantization for outliers
6)
7
8quantized_model = AutoModelForCausalLM.from_pretrained(
9    "./merged-model",
10    quantization_config=quant_config,
11    device_map="auto",
12)
13quantized_model.save_pretrained("./edge-model-int8")

Size comparison:

text

1Base model (FP16):      14.0 GB
2LoRA adapter:             50 MB
3Merged model (FP16):    14.0 GB
4Quantized (INT8):        7.0 GB  ← deploy this

5. Serving the Edge Model

python

1# Edge inference with the quantized model
2from transformers import AutoModelForCausalLM, AutoTokenizer
3import torch
4
5model = AutoModelForCausalLM.from_pretrained(
6    "./edge-model-int8",
7    device_map="auto",
8)
9tokenizer = AutoTokenizer.from_pretrained("./edge-model-int8")
10
11def generate(prompt, max_new_tokens=256):
12    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
13    with torch.no_grad():
14        outputs = model.generate(
15            **inputs,
16            max_new_tokens=max_new_tokens,
17            temperature=0.7,
18            top_p=0.9,
19            do_sample=True,
20        )
21    return tokenizer.decode(outputs[0], skip_special_tokens=True)

6. Evaluating LoRA vs Full Fine-Tuning

python

1def evaluate_model(model, tokenizer, eval_dataset):
2    correct = 0
3    total = len(eval_dataset)
4
5    for example in eval_dataset:
6        response = generate(f"### Instruction:\n{example['instruction']}\n\n### Response:\n")
7        if example["expected_keyword"].lower() in response.lower():
8            correct += 1
9
10    return correct / total
11
12# Compare
13base_score = evaluate_model(base_model, tokenizer, eval_set)
14lora_score = evaluate_model(merged_model, tokenizer, eval_set)
15
16print(f"Base model:  {base_score:.2%}")
17print(f"LoRA tuned:  {lora_score:.2%}")
18print(f"Improvement: +{(lora_score - base_score) * 100:.1f}pp")

Summary

Step	What Happens	Output Size
QLoRA training	Train 0.3% of params in 4-bit base	50MB adapter
Merge	Fold adapter into base weights	14GB (FP16)
Quantize	Compress to INT8 for edge	7GB (INT8)
Deploy	Serve on Jetson Nano / RTX 3060	7GB model

LoRA isn't a compromise — for domain adaptation, it matches full fine-tuning at 1% the cost. The adapter is tiny, mergeable, and swappable at runtime. That's why it's the standard for production LLM customization.

Live Workshop

Go from Arduino to Production Firmware

The ESP32-IDF Workshop covers ESP-IDF from scratch — tasks, queues, OTA, Wifi management, and deploying firmware that doesn't break at 3am.

Join the Workshop →

#LoRA #Fine-Tuning #LLM #Quantization #Edge AI #PEFT

Frequently Asked Questions

Quick answers to common questions

Written by

Rajath Kumar

Edge AI Engineer & Founder, Analog Data

Bengaluru, India

I build things that run on chips and the software that talks to them. ESP32, STM32, FreeRTOS, FastAPI, TinyML — from bare-metal firmware to cloud backends to on-device inference. Based in Bengaluru. Founder of Analog Data.