HydraLoRA; An Asymmetric LoRA Architecture for Efficient Fine-Tuning (NeurIPS 2024)Permalink
Tian, Chunlin, et al. "HydraLoRA: An Asymmetric LoRA Architecture for Efficient Fine-Tuning." NeurIPS (2024)
( https://arxiv.org/pdf/2404.19245 )
ContentsPermalink
- (1) Abstract
- (2) Limitation of LoRA
- (3) HydraLoRA
- LoRA
- HydraLoRA
- (4) Workflow of HydraLoRA
- Fine-tuning
- Inference
1. AbstractPermalink
(1) LoRA: Widely used Parameter-Efficient Fine-Tuning (PEFT) technique
(2) Limitation of LoRA: Often underperform compared to full fine-tuning
- ( especially in complex datasets )
(3) Proposal: HydraLoRA
- LoRA framework with an asymmetric structure that eliminates the need for domain expertise
2. Limitation of LoRAPermalink
Underperform compared to full fine-tuning, especially in heterogeneous datasets
3. HydraLoRAPermalink
(1) LoRAPermalink
y′=y+Δy=W0x+BAx,
-
y∈Rd: output
-
x∈Rk: input
-
B∈Rd×r,A∈Rr×k with r≪min(d,k).
-
B is initialized with zeroes
-
A is initialized with Kaiming Uniform [14]
→ to force Δy=0 at the beginning
-
(2) HydraLoRAPermalink
W=W0+ΔW=W0+N∑i=1ωi⋅BiA.
- Bi∈Rd×r → N matrices
- A∈Rr×k. → single matrix (shared)
- ωi: modulates these contribution weights for head Bi
4. Workflow of HydraLoRAPermalink
(1) Fine-tuningPermalink
MoE (Mixture-of-Experts) = Experts are selectvely activated by a gating mechanism (router)
a) Set of expertsPermalink
To achieve a unified approach of multiple B matrices…
→ Define a set of experts: (E1,…,EN)
Interpretation
- (1) Shared matrix A : inherently captures collaborative knowledge to augment intra-gains
- (2) Different matrices B : foster knowledge modularity to mitigate fine-tuning inter-offsets
b) RouterPermalink
ωi=softmax(WTgx).
- trainable weights (transformation matrix) Wg∈Rr×N
→ becomes a gating scores (ω1,…,ωN)
c) HydraLoRAPermalink
y=W0x+∑Ni=1ωiEiAx(MoE).
- where N denotes the number of experts, i.e., B matrices.
(2) InferencePermalink
Merges adapters by enabling routing computation based on the input!