UNIT: Unifying Image and Text Recognition in One Vision Encoder

https://arxiv.org/pdf/2409.04095

Abstract
Language & Vision Decoder
1. “Text Recognition” Ability Enhancement
2. “Image Recognition” Ability Preservation
Two stages of Training
1. Intra-scale pretraining
2. Inter-scale fine-tuning

1. Abstract

Vision Transformers (ViTs)

Excel at image recognition tasks

\(\rightarrow\) But cannot simulataneously support text recognition!

Solution: UNIT

Novel training framework aimed at “UNifying Image and Text recognition” within a “single model”

Details

Start with a vision encoder (pre-trained with image recognition tasks)
Introduce two components
- [1] Lightweight “language” decoder: For predicting text outputs
- [2] Lightweight “vision” decoder: To prevent catastrophic forgetting of the original image encoding capabilities.
Two stages of training
- (1) Intra-scale pretraining
- (2) Inter-scale fine-tuning

2. Language & Vision Decoder

(a) “Text Recognition” Ability Enhancement

Predict language sequence in the form of text tokens

\(\mathcal{L}_{\operatorname{lan}}(\mathbf{y}, \hat{\mathbf{y}})=-\sum_{t=1}^T \log P\left(\hat{y}_t=y_t \mid \mathbf{y}_{1: t-1}, \mathbf{z}_t^L\right)\).

(b) “Image Recognition” Ability Preservation

Reconstruction task on natural image datasets,

\(\mathcal{L}_{\mathrm{vis}}(\mathbf{X}, \hat{\mathbf{X}})=\sum_{i \in \mathcal{C}} L_{\mathrm{cos}}\left(f_\pi\left(\mathbf{x}_i\right), \hat{\mathbf{x}}_i\right)+\mu L_{11}\left(f_\pi\left(\mathbf{x}_i\right), \hat{\mathbf{x}}_i\right)\).

3. Two stages of Training

(a) Intra-scale pretraining

Learns unified representations from multi-scale inputs
- How? Images and documents are at their commonly used resolution
- Why? To enable fundamental recognition capability.

(b) Inter-scale fine-tuning

Introduces “scale-exchanged” data
- How? Images and documents at resolutions different from the most commonly used ones
- Why? To enhance its scale robustness

(c) Loss functions

Intra-scale pretraining

\(\theta=\arg \min _\theta \mathcal{L}_{\mathrm{lan}}\left(\left\{\mathcal{D}_{\times 1}^I \cup \mathcal{D}_{\times 4}^T\right\}\right)+\lambda \mathcal{L}_{\mathrm{vis}}\left(\left\{\mathcal{D}_{\times 1}^t\right\}\right)\).

Inter-scale fine-tuning

\(\theta=\arg \min _\theta \mathcal{L}_{\operatorname{lan}}\left(\left\{\mathcal{D}_{\times 4}^I \cup \mathcal{D}_{\times 1}^T \cup \frac{1}{2} \mathcal{D}_{\times 1}^I \cup \frac{1}{2} \mathcal{D}_{\times 4}^T\right\}\right)+\lambda \mathcal{L}_{\mathrm{vis}}\left(\left\{\frac{1}{2} \mathcal{D}_{\times 1}^I \cup \mathcal{D}_{\times 4}^l\right\}\right)\).

Twitter Facebook LinkedIn

UNIT; Unifying Image and Text Recognition in One Vision Encoder

Seunghan Lee