PEFT Methods Explained: LoRA, QLoRA, and Adapter-Based Fine-Tuning

Introduction

Imagine you hire an expert consultant who already knows everything about language, writing, and general knowledge. You want them to focus specifically on your company's products and tone of voice. You do not retrain them from scratch, that would take years and cost a fortune. Instead, you give them a short briefing and some examples, and they quickly adapt.

Fine-tuning a large language model works the same way. Fine-tuning means taking a pre-trained model, one that already understands language, and training it further on your specific data so it behaves the way you want. The problem is that doing this the traditional way, called full fine-tuning, is extremely expensive.

For a 7B parameter model, full fine-tuning can require 80GB or more of GPU memory and days of compute time. For a 70B model, you need multiple high-end GPUs running in parallel. This put meaningful customisation out of reach for most teams.

Parameter-Efficient Fine-Tuning (PEFT) changed this completely. Instead of updating every parameter in the model, PEFT methods update only a tiny fraction, or add small trainable modules on top of the frozen base model. The result: you can fine-tune a large LLM on a single consumer GPU while achieving performance close to full fine-tuning.

Problem Statement

Full fine-tuning updates every weight in the model. For modern LLMs, this creates four serious problems.

First, memory requirements scale directly with model size. Storing the model weights alone for Llama 2 7B in full precision requires approximately 28GB of VRAM (28 GB applies to FP32 single precision; in the more common FP16 half precision regime, the same 7B model requires approximately 14 GB). Add the optimizer states and gradients required during training, with Adam, these can triple or quadruple the total, and you need 80GB or more just to begin training.

Second, training time is significant. Full fine-tuning involves backpropagation through billions of parameters on every training step, making experimentation slow and iteration expensive.

Third, catastrophic forgetting is a real risk. The model can lose its general language abilities while specialising on your data, the way an expert who studies too narrowly may start forgetting common sense reasoning.

Fourth, you must store a separate, full copy of the model for every fine-tuned variant. If you need 10 task-specific models, you store 10 full copies, each as large as the original.

PEFT methods solve all four of these problems by training a tiny fraction of the total parameters while keeping everything else frozen.

Core Concepts and Terminology

Term	Definition	Relevance
Full Fine-Tuning	Training all parameters of a pre-trained model on a new dataset	Baseline approach; high quality but prohibitively expensive in GPU memory and time
PEFT (Parameter-Efficient Fine-Tuning)	A family of techniques that update only a small fraction of parameters or add lightweight trainable modules	Enables fine-tuning on consumer hardware with minimal quality loss
LoRA (Low-Rank Adaptation)	A PEFT method that approximates weight updates using two small low-rank matrices instead of updating the full weight matrix	Most widely used PEFT technique; can reduce trainable parameters by 100 to 400 times
Rank (r)	The dimensionality of the low-rank decomposition in LoRA; controls the capacity of the adapter	Lower rank saves more memory; higher rank improves task expressiveness, typical values are 4, 8, 16, or 32
QLoRA (Quantized LoRA)	LoRA applied on top of a 4-bit quantised base model, dramatically reducing memory usage	Makes fine-tuning of 65B+ models feasible on a single GPU
Quantisation	Representing model weights in a lower-precision numeric format to reduce memory consumption	4-bit quantisation cuts base model memory by 4x with negligible quality loss when combined with LoRA
Adapter	A small bottleneck neural network module inserted inside transformer layers; only the adapter is trained	An earlier PEFT approach that adds inference overhead unless modules are fused, now largely superseded by LoRA
Catastrophic Forgetting	The tendency of a neural network to lose previously learned knowledge when trained on new data	Full fine-tuning risks this; PEFT mitigates it by freezing most of the model
Adapter Merging	Combining trained LoRA adapter weights back into the base model weights after fine-tuning	Produces a single deployable model with zero additional inference overhead

How It Works

Full Transformer architecture diagram showing encoder and decoder with Multi-Headed Self-Attention, Feed-Forward Network layers, and V/K/Q projection inputs — **Figure:** The full Transformer architecture, showing the encoder (left) and decoder (right) with their stacked layers of Multi-Headed Self-Attention and Feed-Forward Networks. LoRA targets the weight matrices inside these attention projection layers (Q, K, V), freezing the original weights and injecting small trainable low-rank matrices in their place. Source: dvgodoy / Wikimedia Commons (CC BY 4.0)

To understand LoRA, start with a concrete analogy. Imagine a very large painting. You want to modify it slightly, change the lighting, adjust the colour palette, but you do not want to repaint every square inch. Instead, you place a thin transparent overlay on top of the original canvas and paint only the changes you want onto the overlay. The original painting stays intact. Only the overlay is new.

LoRA works exactly this way with weight matrices inside a transformer model.

Identify the target layers. LoRA is typically applied to the attention projection layers inside each transformer block, the query, key, value, and output projections. These are the parts of the model that learn what to pay attention to, and they are where most task-specific adaptation happens.
Freeze the original weights. The existing weight matrices in those layers are frozen. No gradients will flow through them. They remain exactly as they were after pre-training.
Add the low-rank decomposition. For each target weight matrix of dimension D by K, LoRA introduces two much smaller matrices: one of size D by R and one of size R by K. The value R is the rank, a small number like 8 or 16. The product of these two small matrices is added to the frozen original weights during the forward pass.
Train only the small matrices. Backpropagation updates only the two low-rank matrices. Because R is much smaller than D and K, the number of trainable parameters drops dramatically. For a matrix of size 4096 by 4096 with rank 8, the parameter count falls from over 16 million to just 65,000, a 256 times reduction.
Merge or keep separate. After training, you have a choice. You can merge the learned low-rank update back into the base model's original weights, producing a single model with no inference overhead. Or you can keep the adapter separate and load it on top of the base model at inference time, which is useful when serving multiple task-specific adapters from the same base model.

QLoRA follows the same steps but adds one more transformation: before attaching the LoRA adapters, it quantises the base model weights from 16-bit floating point to 4-bit precision. This cuts the memory required to store the frozen base model by a factor of four, making it possible to fit a 65B parameter model into the memory of a single 48GB GPU. The LoRA adapter matrices themselves remain in full 16-bit precision, they are doing the actual learning, so precision matters there.

Practical Example

Consider a company that wants to build a customer support assistant using an open-source 7B parameter model. Their goal is to make the model respond in their specific tone of voice, use their internal product terminology correctly, and follow their escalation procedures when asked about billing issues.

They collect 5,000 examples of ideal customer support conversations and prepare them for training. Full fine-tuning is ruled out immediately, the team's cloud budget cannot justify renting 80GB GPU instances for days at a time.

Instead, they use QLoRA. They load the base model in 4-bit precision, attach LoRA adapters to the attention projection layers with rank 16, and train for three epochs on their dataset. The total training time on a single 24GB consumer GPU is about four hours. The resulting adapter weighs only a few hundred megabytes, compared to the 14GB base model.

During evaluation, they find the adapted model correctly uses their product names, matches their tone in 89% of test cases, and applies the escalation logic faithfully. Compared to a prompt-engineering-only approach that simply described the style in the system prompt, the fine-tuned version is significantly more consistent.

Before deployment, they merge the adapter back into the base model weights. The deployed model behaves identically to the adapted version but has no additional inference overhead. To a user, it looks and feels like a vanilla 7B model, just one that happens to know exactly how to talk about their products.

Advantages

Dramatic memory reduction enables fine-tuning on accessible hardware. LoRA reduces trainable parameter count by 100 to 400 times. QLoRA additionally cuts base model memory by 4x, making large-model fine-tuning possible on a single consumer GPU. Teams that could not afford full fine-tuning can now iterate rapidly.
Fast iteration speeds up experimentation. Because fewer parameters are updated per training step, PEFT training runs significantly faster than full fine-tuning. Teams can run multiple experiments in the time full fine-tuning would take for a single run.
Multiple adapters from one base model reduce storage costs. Each task-specific LoRA adapter is a small file, typically tens to hundreds of megabytes, compared to the gigabytes required for a full model copy. You can maintain adapters for dozens of use cases without needing to store full model copies for each one.
Merging eliminates all inference overhead. Once a LoRA adapter is merged into the base model weights, the resulting model runs at identical speed to the unmodified base model. There is no latency penalty for having fine-tuned the model.
Reduced catastrophic forgetting risk. Because the base model weights remain frozen throughout PEFT training, the model retains its general language capabilities. Only the small adapter parameters are updated, which limits how much the model can drift from its original behaviour.

Limitations and Trade-offs

Low rank limits adapter expressiveness. If the task requires very large changes to the model's behaviour, for instance, adapting a general-purpose model for a highly specialised scientific domain with unusual reasoning patterns, a low-rank adapter may not have enough capacity. Higher ranks help but reduce the memory savings.
PEFT cannot deeply embed new factual knowledge. LoRA adapters are good at changing how the model behaves but not at injecting large amounts of new factual information. For knowledge-heavy tasks, RAG is more appropriate. Trying to fine-tune factual knowledge into the weights often results in memorisation without generalisation.
Overfitting is still possible on small datasets. Despite training far fewer parameters, PEFT models can overfit if the training dataset is too small or not diverse enough. Validation monitoring and early stopping are still necessary.
QLoRA's 4-bit quantisation introduces a small quality ceiling. The base model is permanently quantised during QLoRA training. For most tasks the quality difference versus 16-bit fine-tuning is negligible, but for tasks requiring maximum precision, such as complex mathematical reasoning, the quantisation may impose a small but measurable performance gap.
Adapter management adds operational overhead. If you maintain many adapters for different tasks or user groups, you need a system to track, version, and load the right adapter at inference time. This is manageable but requires infrastructure investment.

Common Mistakes

Setting rank too low and then attributing failure to PEFT itself. If the model is not learning the task, the first thing to check is the rank. A rank of 4 may not have enough capacity for complex tasks. Try 16 or 32 before concluding that fine-tuning is insufficient.
Fine-tuning when RAG would be the better choice. Teams sometimes fine-tune models to learn specific facts, product catalog data, recent news, frequently changing policies. This is the wrong tool. Facts that change frequently should be retrieved at runtime using RAG, not baked into model weights that require retraining every time the data updates.
Not monitoring general benchmarks during training. Focusing only on task-specific metrics during training can mask catastrophic forgetting. If the model scores well on your evaluation set but starts giving bizarre answers to simple questions, the adapter is eroding general capabilities. Track a general benchmark throughout training.
Skipping adapter merging before deployment. Some teams deploy the adapter and base model separately, loading the adapter at inference time. This adds a small overhead per request. For production latency-sensitive systems, always merge the adapter before deployment.
Using a learning rate that is too high. PEFT adapters are small and sensitive to the learning rate. A rate appropriate for full fine-tuning is typically too high for LoRA adapters and can cause training instability. Start with a lower learning rate, around 2e-4, and adjust from there.

Best Practices

Start with QLoRA for any model larger than 7B. The memory savings are substantial and quality is nearly identical to 16-bit LoRA for most tasks.
Begin with rank 16 and alpha 32 as default LoRA hyperparameters. Adjust based on observed training dynamics rather than guessing upfront.
Apply LoRA adapters to at minimum the query, key, value, and output projection layers. For difficult tasks, extend to the feed-forward layers and observe whether quality improves.
Monitor both your task-specific metric and a general benchmark like MMLU throughout training to catch quality regression on general capabilities.
Use early stopping and a held-out validation set. Even with fewer trainable parameters, overfitting on small datasets is common.
After training is complete, merge the adapter into the base model before deployment to eliminate all inference overhead.
Use the Hugging Face PEFT library for adapter management. It handles the boilerplate of adapter loading, saving, and merging reliably and is well-maintained.
If you need to serve multiple task-specific adapters simultaneously from one base model, evaluate multi-LoRA serving frameworks before building a custom solution.

Comparison: PEFT Methods vs Alternatives

Method	Parameters Trained	Memory Usage	Inference Overhead	Quality vs Full Fine-Tuning	Best Use Case
Full Fine-Tuning	100%	Very High	None	Baseline	Maximum quality needed, large compute budget, significant domain shift
LoRA	0.1 to 1%	Low	None if merged	Near-identical	General-purpose fine-tuning, fast iteration, limited GPU memory
QLoRA	0.1 to 1%	Very Low	None if merged	Near-identical	Large models on single GPU, consumer hardware, budget-constrained projects
Adapter Layers	0.5 to 2%	Low	5 to 10% latency increase	Near-identical	Multi-task scenarios where adapters are swapped at runtime
Prefix Tuning	Under 0.1%	Very Low	None	Slightly lower on complex tasks	Extremely constrained resources, simple behavioural adjustments
RAG (not fine-tuning)	None	N/A	Retrieval latency per request	Does not change model behaviour	Dynamic knowledge, frequently changing data, citation requirements

Frequently Asked Questions

How do I choose the right LoRA rank?

Start with rank 16 for most tasks. If the model is not converging or the quality plateau is lower than expected, increase to 32. If memory is very constrained, try rank 8. The rank controls how much capacity the adapter has to capture task-specific patterns, higher is more expressive but uses more memory. In practice, the difference between rank 8 and rank 32 is often small, so choose based on your memory constraints first.

Can I use PEFT with any base model?

LoRA and QLoRA work with any transformer-based model. The Hugging Face PEFT library supports the most common architectures including Llama, Mistral, Falcon, GPT-2, and many others. For models not natively supported, the library provides a configuration system to specify which layers to target manually.

How does PEFT compare to prompt engineering?

Prompt engineering modifies the input to guide the model's behaviour without changing its weights. It is fast and requires no training, but it is limited in how much it can change the model's output style, domain vocabulary, or reasoning patterns. PEFT modifies the model itself, which produces more consistent and deeply embedded behavioural changes. For production applications with consistent requirements, fine-tuning typically outperforms prompt engineering alone.

Is PEFT appropriate for instilling safety guardrails?

PEFT can help teach a model to refuse certain request types or follow specific response protocols, but it should not be the only safety mechanism. Models fine-tuned with PEFT can have their guardrails bypassed through adversarial prompts. Safety-critical applications need defence in depth, fine-tuning, input filtering, output validation, and human review working together.

What happens to the base model after PEFT training?

Nothing. The base model weights are frozen throughout PEFT training and remain unchanged. When you merge a LoRA adapter, the merge is a mathematical combination of the frozen base weights and the learned low-rank updates, producing a new set of weights. The original base model is untouched and can be reused for other adapters.

References

Hu, E., et al. (2021). LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685.
Dettmers, T., et al. (2023). QLoRA: Efficient Finetuning of Quantized LLMs. NeurIPS 2023.
Houlsby, N., et al. (2019). Parameter-Efficient Transfer Learning for NLP. ICML 2019.
Lester, B., et al. (2021). The Power of Scale for Parameter-Efficient Prompt Tuning. EMNLP 2021.
Hugging Face PEFT Library Documentation

Key Takeaways

Full fine-tuning is memory-prohibitive for most teams. PEFT methods like LoRA and QLoRA achieve comparable quality while training fewer than 1% of the model's parameters.
LoRA works by decomposing weight updates into two small low-rank matrices, reducing trainable parameters by 100 to 400 times with minimal quality loss.
QLoRA combines LoRA with 4-bit base model quantisation, making fine-tuning of 65B parameter models feasible on a single consumer GPU.
LoRA adapters can be merged back into the base model after training, eliminating all inference overhead. Always merge before deploying to production.
PEFT changes how the model behaves; RAG provides dynamic knowledge. Most production systems benefit from using both together.
The Hugging Face PEFT library is the standard tooling for LoRA and QLoRA, use it rather than implementing these techniques from scratch.