A Sentence-level Hierarchical BERT Model for Document Classification with Limited Labelled Data (HBM)

Contents

  1. Abstract
  2. Introduction
  3. Related Work
  4. The Hierarchical BERT Model
    1. Token-level Roberta Encoder
    2. Sentence-level BERT Encoder
    3. Prediction Layer
    4. Training HBM


0. Abstract

  • LIMITED labeled data로 DL 학습하기!

  • BERT : 적은 labeled data로도 좋은 성능

    BUT, LONG document classification에는 그다지…

  • 따라서, 이 논문에서는 HBM (Hierarchical BERT Model)를 제안한다


HBM

  • 1) sentence-level feature 학습
  • 2) limited labeled data 상황속에서도 잘 작동
  • 특히, document가 “길 때”, 기존의 알고리즘보다 나은 성능을 보임


1. Introduction

(A) BERT

  • extract extensive language features from large corpora
  • labeled data가 limited 상황에서 good
  • pre-training & fine-tuning

  • text가 너무 길 때, BERT도 완벽하진 못하다!


(B) HAN (Hierarchical Attention Network)

  • sentence-level information을 문서에서 뽑아낸다


(A) + (B) = HBM

  • improve the performance of BERT-based models / in document classification / under low labeled data context / by considering the sentence structure information


BERT vs HBM

  • BERT : captures connections between WORDS
  • HBM : captures connections between SENTENCES
    • sentence : 서로 동일하게 중요 X
    • higher attention score 받은 문장은, representation of document에 더 많이 반영


2. Related Work

BERT

  • 일반적으로 DL에서 5,000 examples per class가 필요하다고 봄

  • class 별 100~1,000 examples가 있는 case에서, BERT가 document classification에서 최강자

  • 하지만…

    • (1) BERT 자체의 modification을 준 것과 비교해서 더 나은지 확인해바야!

      ( 다른 구닥다리랑 싸워서 이긴거 말고 )

    • (2) long text를 처리하는데에 BERT는 부적합

\(\rightarrow\) using a sentence-level language model to alleviate the LONG-texts problems suffered by BERT


기타

  • HAN : Bi-GRU with attention
  • HIBERT (Hierarchical Bidirectional Encoder Representations from Transformers) : for document summarization
  • 주로 labeled data가 abundant 하다고 가정함 ( limited (X) )


Contribution

  • 1) propose HBM
    • capture sentence structure
    • with small amounts of labeled data
  • 2) salient sentences identified by HBM are useful as explanations for document classification


3. The Hierarchical BERT Model

3가지의 요소로 구성

  • (1) token-level Roberta encoder
  • (2) sentence-level BERT encoder
  • (3) prediction layer


figure2


간단 소개 :

  • step 1) raw text가 token-level Roberta Encoder에 들어가서 text feature를 extract

    ( = form vector representation for each sentence )

  • step 2) sentence vector가 sentence-level BERT encoder의 input으로 들어가서 intermediate representation 생성

  • step 3) intermediate representation ( = document representation )가 prediction layer로 들어가서 document class 예측


3-1) Token-level Roberta Encoder

  • Roberta = BERT + larger datasets

  • pre-trained Roberta를 사용하여 raw text features를 뽑아내 ( word 단위 )

    \[\mathcal{D}=\left(E_{1}, E_{2}, \ldots, E_{m}\right)\]
    • \(E_i\) : \(i\)번째 sentence vector of document \(\mathcal{D} \in \mathbb{R}^{m \times d_{e}}\)

      ( by averaging token embeddings contained in each sentence )

    • (최대) \(m\)개의 문장이 1개의 document 형성


3-2) Sentence-level BERT Encoder

  • extract sentence structure information, for generating intermediate document representation \(S\)
  • 여러 identical BERT layer로 구성

figure2


BertAtt layer의 output :

\(\operatorname{BertAtt}(\mathcal{D})=\text { LayerNorm }(\mathcal{D}+\operatorname{MultiHead}(\mathcal{D}))\).

  • \(\operatorname{MultiHead}(\mathcal{D})=\operatorname{Concat}\left(\text { head }_{1}, \text { head }_{2}, \ldots, \text { head }_{h}\right) \times W^{O}\).
    • \(\text { head }_{i} =\text { Attention }\left(Q_{i}, K_{i}, V_{i}\right)=\operatorname{Softmax}\left(Q_{i} \times K_{i}^{\top} / \sqrt{d_{e}}\right) \times V_{i}\).
      • \(Q_{i} =\mathcal{D} \times W_{i}^{Q}\).
      • \[K_{i}=\mathcal{D} \times W_{i}^{K}\]
      • \(V_{i}=\mathcal{D} \times W_{i}^{V}\).


output of BertAtt : passed through FFNN

\(\mathcal{D}^{\prime}=\text { LayerNorm }\left(\operatorname{BertAtt}(\mathcal{D})+\operatorname{Relu}\left(\operatorname{BertAtt}(\mathcal{D}) \times W^{r}\right) \times W^{S}\right)\).

  • intermediate document representation \(\mathcal{S}\).

    \(\mathcal{S}=\operatorname{Tanh}\left(\operatorname{Avg}(\mathcal{Z}) \times W^{t}\right)\).

  • 이 \(\mathcal{S}\)가 final prediction layer로 pass


3-3) Prediction Layer

  • compute raw score for each class

  • \(\left[t_{0}, t_{1}, \ldots, t_{y}\right]=\mathcal{S} \times W\).

    • \(W \in \mathbb{R}^{d_{e} \times y}\).
    • \(t_{0}, t_{1}, \ldots, t_{y}\) : raw score of document for \(0^{t h}\) to \(y^{t h}\) class

    • 이 \(t\)가 softmax로 들어감


3-4) Training HBM

( pre-training + fine-tuning 구조 )

[ pre-training ]

  • several self-supervised methods ( MLM, NSP )

[ fine-tuning ]

  • task-specific head is added


이 모델은, fine-tuning시..

  • (freeze O) Roberta
  • (freeze X) sentence-level BERT & prediction layer