Mixture of nested experts: Adaptive processing of visual tokens

Jain, Gagan, et al. "Mixture of nested experts: Adaptive processing of visual tokens." arXiv preprint arXiv:2407.19985 (2024).

참고:

  • https://aipapersacademy.com/mixture-of-nested-experts/
  • https://arxiv.org/pdf/2407.19985


Contents

  1. Motivation
  2. MoNE: Mixture of Nested Experts
    1. Nested Experts
    2. Router
    3. MoNE Layer Output
    4. MoNE Layer Details
  3. Experiments


1. Motivation

Is Standard MoE Enough?

Mixture-of-Experts (MoE)

  • Helps to increase models size (w/o a proportional increase in computational cost)
  • Limitation = **Large memory footprint
    • \(\because\) Need to load all of the experts


Information redundancy in CV

figure2

Patch on the upper right

  • Mostly contain background pixels
  • Nonetheless, ViT ( + MoE ) allocate the same compute power to all tokens!


2. MoNE: Mixture of Nested Experts

Limitation of MoE

  • (1) Large memory footprint
  • (2) Information redundancy

\(\rightarrow\) Solution: Mixture of Nested Experts (MoNE)


(1) Nested Experts

figure2

Example) 3 nested experts

  • With 3 different colors.
  • Size
    • Expert 1 (L)
    • Expert 2 (M)
    • Expert 3 (S)
  • Each expert has different capacity (of handling tokens)
    • Expert 1 > Expert 2 > Expert 3


(2) Router

( Expert Preferred Router )

Router assigns probabilities to the input tokens

  • First expert = allocated with the most important input tokens
  • Second expert = ~ for unallocated tokens
  • Third expert = ~ for unallocated tokens


figure2

figure2

figure2


(3) MoNE Layer Output

figure2

Three nested experts

\(\rightarrow\) Output from all nested experts is combined together


(4) MoNE Layer Details

Two things to note!

  • This is not a single model, but a single layer !
  • Tokens that are routed to nested experts which are smaller than the full layer, are downsized to the dimension of the nested expert


Example

figure2

  • Two tokens to be processed

  • (Left) Assigned to the 3rd expert

    \(\rightarrow\) dimension = 64

  • (Right) Assigned to the 1st expert

    \(\rightarrow\) dimension = 256


a) Self-attention

  • (Left) Smaller nested expert

    • Only subset of the weights of the attention module are used to extract Q,K,V
  • (Right) Larger nested expert

    • Whole matrices are used.

    ( Tokens still interact with each other in the self-attention module )

    • By padding the values received from smaller nested models, to the full model dimension

b) MLP

Only a subset of the weights being used

\(\rightarrow\) Tokens to smaller nested models: less compute!


3. Experiments

(1) MoNE Tokens Important Understanding

figure2


(2) Image Classification

figure2

Updated: