Skip to content

Parameter-Efficient Fine-Tuning (PEFT)

Full fine-tuning of a 70B model needs hundreds of gigabytes of GPU memory and produces a 70B-sized artifact per task. PEFT methods fine-tune <1% of parameters, get within a few points of full fine-tuning quality, and produce checkpoints small enough to store thousands of them.

Prompt tuning

The Power of Scale for Parameter-Efficient Prompt Tuning (Lester et al., 2021) freezes the entire model and learns a small set of continuous prompt embeddings prepended to the input:

[p1,,pklearned,x1,,xn]

Only the k×d prompt parameters are trained. At small scale this loses to full fine-tuning, but the gap closes as the base model grows — at 10B+ parameters, prompt tuning matches full FT on most benchmarks.

Adapters

Parameter-Efficient Transfer Learning for NLP (Houlsby et al., 2019) — the original adapters paper — inserts a small bottleneck MLP after each Transformer sub-layer:

Adapter(h)=h+Wupσ(Wdownh)

with WdownRr×d,WupRd×r and rd. Only the adapter weights train; the rest of the model is frozen. Inference cost rises slightly because adapters add layers; LoRA fixes that.

LoRA

LoRA: Low-Rank Adaptation of Large Language Models (Hu et al., 2021) — the workhorse. Instead of inserting new modules, decompose the update to each weight matrix as a low-rank product:

W=W+ΔW,ΔW=BA,ARr×d, BRd×r.

Train only A,B; freeze W. At inference time, ΔW can be merged back into W — zero added latency. With r=8 on a 7B model, you fine-tune ~0.1% of parameters and get within ~1 point of full FT.

LoRA is now the default for instruction tuning, RLHF/DPO post-training, and per-user personalisation. Variants:

  • QLoRA — load the base model in 4-bit and train LoRA in fp16; fits a 65B model on one consumer GPU.
  • DoRA — decompose updates into magnitude and direction, only direction is low-rank.
  • rsLoRA — re-scale to fix the small-rank performance cliff.

Text-to-LoRA

Text-to-LoRA: Instant Transformer Adaption (Charakorn et al., 2025) trains a hypernetwork that, given a natural-language task description, generates a LoRA adapter directly — no per-task training data needed. Tens of thousands of LoRAs are pre-generated from synthetic task descriptions; at inference, the user describes their task and the matching adapter is dispatched. A glimpse of "skill libraries" for LLMs.

When to use what

  • Prompt tuning — best when you have only a frozen API model with prefix-tuning support.
  • Adapters — historical interest; superseded by LoRA.
  • LoRA / QLoRA — default choice for any local fine-tune.
  • Text-to-LoRA — emerging; promising for personalisation at scale.

Reading list

  • The Power of Scale for Parameter-Efficient Prompt Tuning — Lester, Al-Rfou, Constant, EMNLP 2021.
  • Parameter-Efficient Transfer Learning for NLP — Houlsby et al., ICML 2019.
  • LoRA: Low-Rank Adaptation of Large Language Models — Hu et al., ICLR 2022.
  • Text-to-LoRA: Instant Transformer Adaption — Charakorn et al., 2025.

Released under the MIT License. Content imported and adapted from NoteNextra.