LITA: Language Instructed Temporal-Localization Assistant

Abstract
Introduction
Related Work
Methodology - Language Instructed Temporal Localization
1. Architecture
2. Time Tokens
3. SlowFast Visual Tokens
4. Training Tasks
Reasoning Temporal Localization (RTL)
Experiments
Conclusion

0. Abstract

Existing Video LLMs cannot answer “When?” questions

\(\rightarrow\) Due to lack of temporal localization

Key shortcomings: Time representation, Architecture, Data.

Proposal: LITA

Relative time tokens: To represent timestamps.
SlowFast tokens: To capture temporal resolution efficiently
Temporal localization–focused data + New task/dataset:
- Reasoning Temporal Localization (RTL) with ActivityNet-RTL.

Experiments

Nearly doubles baseline temporal mIoU
Improves video-based text generation
- (e.g., +36% relative improvement in Temporal Understanding).

1. Introduction

Background

LLMs are strong instruction followers
Multimodal LLMs extend them to video.
Videos require temporal localization, absent in current Video LLMs (e.g., Video-LLaMA).

Challenges:

Time as Text: Ambiguous without frame rate
Architectures with insufficient temporal resolution
Lack of timestamp-labeled training data

Solution: LITA addresses these with

(1) Time tokens
(2) SlowFast architecture
(3) RTL data/task

Multimodal LLMs: Flamingo, LLaVA, etc.

→ Adapt vision to LLMs via cross-attention, projection, adapters
Video LLMs: Video-LLaMA, VideoChat, etc.

→ Effective for content questions, but poor at temporal localization
Temporal localization research: Action detection, event grounding, dense captioning

\(\rightarrow\) Bt not integrated with instruction-following LLMs.

Related works: VTimeLLM, TimeChat, Momentor.

LITA adds reasoning aspect to temporal localization.

3. Methodology – Language Instructed Temporal Localization (Detailed)

(1) Architecture

Base: LLaVA-like Video LLM
Input:
- Uniformly select \(T\) frames
- Each encoded into \(M\) visual tokens (1 frame: \(M\) Tokens)
Apply “SlowFast pooling”

→ To produce T + M tokens (To reduce token explosion (\(T×M\)))
- Fast tokens:
  - High temporal resolution
  - Low spatial detail (one per frame).
- Slow tokens:
  - Low temporal resolution
  - High spatial detail (from subsampled frames)
Concatenate with “language tokens” (including time tokens).
Output:
- Localized temporal answer in relative time tokens (<1> … ),
- Easily converted back to real timestamps
Advantage: scalable, efficient, and directly answers “When?” with timestamp tokens.

(2) Time Tokens

Replace (A) with (B)
- (A) Ambiguous absolute timestamps (e.g., “01:22”)
- (B) relative tokens <1>
Mapping formula:
- \(t = \text{round}\left(\frac{\tau (T-1)}{L}\right)+1, \quad \tau = \frac{L (t-1)}{T-1}\).
where \(\tau\) is a continuous timestamp, \(L\) = video length.
Benefit: model learns temporal reasoning independent of frame rate.

(3) SlowFast Visual Tokens

Fast pathway:
- \(T\) fast tokens (1 per frame x \(T\) frames)
- Each averaging all \(M\) tokens from one frame → Temporal detail
Slow pathway:
- Sample \(s^2\) frames (with \(s=2\)), apply spatial pooling → \(\frac{M}{s^2}\) tokens per frame → Spatial richness
Final representation = \(T\) fast + \(M\) slow tokens (instead of \(T×M\))
Example: \(T=100, M=256\) (CLIP-L/14)

→ Reduce from 25,600 tokens to 356 tokens.

(4) Training Tasks

Train on five tasks to build temporal + reasoning skills:
1. Dense video captioning (events with start/end timestamps).
2. Event localization (find [start, end] for a described event).
3. Video QA (short answers).
4. Natural language VQA (instruction-tuned, conversational).
5. Reasoning Temporal Localization (RTL) (timestamps + explanation).
Tasks 1, 2, 5: Explicitly enforce timestamp learning.
RTL answers = [start, end] + explanation

\(\rightarrow\) Requiring reasoning beyond direct description!

a) Dense Video Captioning

Definition: Describe a video with multiple sentences, each tied to a start and end timestamp
Format: <start time> <end time> Sentence
Purpose: Teach the model to produce temporally grounded descriptions, aligning language with video segments.
Example:
- Input: “Provide a detailed description of the given video. Each sentence should begin with start and end timestamps.”
- Output:
  - <1><5> A woman is standing.
  - <6><15> The woman starts dancing.
  - <16><20> She lies on the floor.

b) Event Localization

Definition: Localize the temporal boundaries of a specific event described in natural language.
Format: <start time> <end time>
Purpose: Teach precise timestamp prediction for queried events.
Example:
- Question: “When does the man jump into the pool?”
- Answer: <25><30>

c) Video Question Answering (VQA)

Definition: Answer natural language questions about video content.
Characteristic: Datasets often contain short answers (single words/phrases).
Prompting Trick: Add “Answer the question using a single word or phrase” to avoid LLM generating long sentences.
Purpose: Strengthen fine-grained video comprehension and retrieval.
Example:
- Q: “What is the man holding?”
- A: “A guitar.”

d) Natural Language VQA (NLVQA)

Definition: Visual instruction tuning datasets where answers are full natural language sentences.
Purpose: Encourage conversational, human-like answers beyond short phrases.
Effect: Without this, models trained only on standard VQA tend to output terse answers.
Example:
- Q: “What happens after the cat jumps onto the table?”
- A: “The cat knocks over a glass of water and then walks away.”

e) Reasoning Temporal Localization (RTL)

Definition: The most distinctive new task. The question asks “When?” but the event is not explicitly mentioned.
Answer Structure: [start end] Explanation
Purpose: Combine temporal localization with reasoning & world knowledge.
Dataset: ActivityNet-RTL, generated with GPT-4 from ActivityNet Captions.
Example
- Q: “When is the woman the least active in the video?”
- A: [32s 36s] The woman is sleeping during this time, which is less active compared to standing or dancing.
Impact: Forces LITA to use both temporal understanding and reasoning ability, unique compared to prior Video LLMs.

Summary of Roles

Dense Video Captioning → general temporal alignment.
Event Localization → precise timestamp extraction.
VQA → content recognition, short factual answers.
NLVQA → conversational fluency and explanation skills.
RTL → joint reasoning + temporal localization (LITA’s signature).

4. Reasoning Temporal Localization (RTL)

Problem: “When?” question
Answer = timestamps + explanation.
Dataset: ActivityNet-RTL
- Built from ActivityNet Captions (10k+ videos).
- GPT-4 generates reasoning-based questions/answers.
- Training set: 33,557 Q-A pairs; Evaluation: 229 Q-A pairs (manually curated for reasoning).
Metrics: mIoU, Precision@0.5, GPT-4 Relative Score (for explanations).

5. Experiments

Implementation:
- Visual encoder = CLIP ViT-L/14
- LLM = Vicuna.
- \(T=100\) frames, \(100\) time tokens
- Model sizes: 7B & 13B
Datasets: ActivityNet-Captions, YouCook2, NExT-QA, LLaVA-150K, ActivityNet-RTL.
Results (Table 1, p.6): LITA nearly doubles baseline mIoU/Precision@0.5, improves GPT explanation score.
Qualitative (Fig. 4, p.7): LITA localizes events with detail (e.g., “roasting marshmallows”), vs. generic outputs of Video-LLaMA-v2.
Video-based text generation benchmark (Table 2, p.7): LITA surpasses Video-ChatGPT (+22% correctness, +36% temporal understanding).
Ablations (Table 3, p.8):
- RTL-only training insufficient.
- Adding video tasks improves timestamp accuracy.
- Adding NLVQA improves reasoning & natural language fluency.

6. Conclusion

LITA introduces time tokens + SlowFast tokens + RTL data/task.
Achieves strong temporal localization + reasoning.
Improves general video QA and generation as well.
Future direction: more temporal datasets, extending to broader video understanding tasks.

Twitter Facebook LinkedIn

LITA; Language Instructed Temporal-Localization Assistant

Seunghan Lee

LITA: Language Instructed Temporal-Localization Assistant

Contents

0. Abstract

1. Introduction

3. Methodology – Language Instructed Temporal Localization (Detailed)

(1) Architecture

(2) Time Tokens

(3) SlowFast Visual Tokens

(4) Training Tasks

a) Dense Video Captioning

b) Event Localization

c) Video Question Answering (VQA)

d) Natural Language VQA (NLVQA)

e) Reasoning Temporal Localization (RTL)

Summary of Roles

4. Reasoning Temporal Localization (RTL)

5. Experiments

6. Conclusion

You May Also Enjoy

LITA; Language Instructed Temporal-Localization Assistant

Seunghan Lee

LITA: Language Instructed Temporal-Localization Assistant

Contents

0. Abstract

1. Introduction

2. Related Work

3. Methodology – Language Instructed Temporal Localization (Detailed)

(1) Architecture

(2) Time Tokens

(3) SlowFast Visual Tokens

(4) Training Tasks

a) Dense Video Captioning

b) Event Localization

c) Video Question Answering (VQA)

d) Natural Language VQA (NLVQA)

e) Reasoning Temporal Localization (RTL)

Summary of Roles

4. Reasoning Temporal Localization (RTL)

5. Experiments

6. Conclusion

You May Also Enjoy