Home/ CSE/ Parameter-Efficient Fine-Tuning (QLoRA)
CSE · Seminar 08 · Fine-tune billion-parameter models on one GPU

Parameter-Efficient Fine-Tuning (QLoRA)

QLoRA combines 4-bit quantisation with low-rank adapters, letting a large language model be fine-tuned on a single consumer GPU while updating under 1% of its parameters.

PEFTLoRAQLoRAquantisationNF4fine-tuning

Full fine-tuning of a large model updates every weight and needs many high-memory GPUs. Parameter-efficient fine-tuning (PEFT) freezes the pretrained weights and trains only a tiny set of new parameters. QLoRA pushes this further by also quantising the frozen model to 4 bits, so a 65B-parameter model can be adapted on a single 48 GB GPU.

Working principle

LoRA observes that the weight change during fine-tuning is low-rank, so it represents the update ΔW as a product of two small matrices B·A. Only A and B are trained; the original W stays frozen. QLoRA adds three ideas: store W in a new 4-bit NormalFloat (NF4) format, apply double quantisation to compress the quantisation constants, and use paged optimisers to avoid memory spikes. Gradients flow through the frozen 4-bit weights into the 16-bit adapters.

WxΔWxInput xFrozen W (4-bit NF4)A·B adapter (trainable, 16-bit)⊕ addOutput hQLoRA: frozen 4-bit base + trainable low-rank adapter
Figure 1. Forward path h = Wx + (B·A)x. Only the small adapter matrices A, B are updated; the base weights stay frozen in 4-bit.
Table 1. Fine-tuning strategies compared
MethodTrainable paramsMemoryQuality
Full fine-tune100%Very highReference
LoRA< 1%Moderate≈ full
QLoRA< 1% + 4-bit baseLow (single GPU)≈ full (16-bit)
Key resultQLoRA's headline result: 4-bit fine-tuning matched 16-bit full fine-tuning quality on instruction tasks, democratising LLM customisation to commodity hardware.

Applications

  • Domain adaptation (legal, medical, code) of open-weight LLMs
  • On-device and private fine-tuning where data cannot leave the org
  • Rapid, cheap experimentation with many task-specific adapters

References & further reading

  1. Hu et al., “LoRA: Low-Rank Adaptation of Large Language Models,” ICLR 2022.
  2. Dettmers et al., “QLoRA: Efficient Finetuning of Quantized LLMs,” NeurIPS 2023.
  3. Houlsby et al., “Parameter-Efficient Transfer Learning for NLP,” ICML 2019.