HydraLoRA; An Asymmetric LoRA Architecture for Efficient Fine-Tuning (NeurIPS 2024)

Tian, Chunlin, et al. "HydraLoRA: An Asymmetric LoRA Architecture for Efficient Fine-Tuning." NeurIPS (2024)

( https://arxiv.org/pdf/2404.19245 )

(1) Abstract
(2) Limitation of LoRA
(3) HydraLoRA
- LoRA
- HydraLoRA
(4) Workflow of HydraLoRA
- Fine-tuning
- Inference

1. Abstract

(1) LoRA: Widely used Parameter-Efficient Fine-Tuning (PEFT) technique

(2) Limitation of LoRA: Often underperform compared to full fine-tuning

( especially in complex datasets )

(3) Proposal: HydraLoRA

LoRA framework with an asymmetric structure that eliminates the need for domain expertise

2. Limitation of LoRA

Underperform compared to full fine-tuning, especially in heterogeneous datasets

3. HydraLoRA

(1) LoRA

\(y \prime=y+\Delta y=W_0 x+B A x\),

\(y \in R^{\mathrm{d}}\): output
\(x \in R^{\mathrm{k}}\): input
\(B \in R^{\mathrm{d \times r}}, A \in R^{\mathrm{r \times k}}\) with \(r \ll \min (d, k)\).
- \(B\) is initialized with zeroes
- \(A\) is initialized with Kaiming Uniform [14]
  
  \(\rightarrow\) to force \(\Delta y=0\) at the beginning

(2) HydraLoRA

\(\begin{aligned} W & =W_0+\Delta W \\ & =W_0+\sum_{i=1}^N \omega_i \cdot B_i A \end{aligned}\).

\(B_i \in \mathbb{R}^{d \times r}\) \(\rightarrow\) \(N\) matrices
\(A \in \mathbb{R}^{r \times k}\). \(\rightarrow\) single matrix (shared)
\(\omega_i\): modulates these contribution weights for head \(B_i\)

4. Workflow of HydraLoRA

(1) Fine-tuning

MoE (Mixture-of-Experts) = Experts are selectvely activated by a gating mechanism (router)

a) Set of experts

To achieve a unified approach of multiple \(B\) matrices…

\(\rightarrow\) Define a set of experts: \(\left(E_1, \ldots, E_N\right)\)

Interpretation

(1) Shared matrix \(A\) : inherently captures collaborative knowledge to augment intra-gains
(2) Different matrices \(B\) : foster knowledge modularity to mitigate fine-tuning inter-offsets

b) Router

\(\omega_i=\operatorname{softmax}\left(W_g^T x\right)\).

trainable weights (transformation matrix) \(W_g \in \mathbb{R}^{r \times N}\)

\(\rightarrow\) becomes a gating scores \(\left(\omega_1, \ldots, \omega_N\right)\)

c) HydraLoRA

\(y=W_0 x+\sum_{i=1}^N \omega_i E_i A x \quad(M o E)\).

where \(N\) denotes the number of experts, i.e., \(B\) matrices.

(2) Inference

Merges adapters by enabling routing computation based on the input!

Twitter Facebook LinkedIn

HydraLoRA; An Asymmetric LoRA Architecture for Efficient Fine-Tuning

Seunghan Lee