LITA: Language Instructed Temporal-Localization Assistant
Contents
- Abstract
- Introduction
- Related Work
- Methodology - Language Instructed Temporal Localization
- Architecture
- Time Tokens
- SlowFast Visual Tokens
- Training Tasks
- Reasoning Temporal Localization (RTL)
- Experiments
- Conclusion
0. Abstract
Existing Video LLMs cannot answer “When?” questions
\(\rightarrow\) Due to lack of temporal localization
Key shortcomings: Time representation, Architecture, Data.
Proposal: LITA
- Relative time tokens: To represent timestamps.
- SlowFast tokens: To capture temporal resolution efficiently
- Temporal localization–focused data + New task/dataset:
- Reasoning Temporal Localization (RTL) with ActivityNet-RTL.
Experiments
- Nearly doubles baseline temporal mIoU
- Improves video-based text generation
- (e.g., +36% relative improvement in Temporal Understanding).
1. Introduction
Background
- LLMs are strong instruction followers
- Multimodal LLMs extend them to video.
- Videos require temporal localization, absent in current Video LLMs (e.g., Video-LLaMA).
Challenges:
- Time as Text: Ambiguous without frame rate
- Architectures with insufficient temporal resolution
- Lack of timestamp-labeled training data
Solution: LITA addresses these with
- (1) Time tokens
- (2) SlowFast architecture
- (3) RTL data/task
2. Related Work
-
Multimodal LLMs: Flamingo, LLaVA, etc.
→ Adapt vision to LLMs via cross-attention, projection, adapters
-
Video LLMs: Video-LLaMA, VideoChat, etc.
→ Effective for content questions, but poor at temporal localization
-
Temporal localization research: Action detection, event grounding, dense captioning
\(\rightarrow\) Bt not integrated with instruction-following LLMs.
Related works: VTimeLLM, TimeChat, Momentor.
LITA adds reasoning aspect to temporal localization.
3. Methodology – Language Instructed Temporal Localization (Detailed)
(1) Architecture
-
Base: LLaVA-like Video LLM
-
Input:
- Uniformly select \(T\) frames
- Each encoded into \(M\) visual tokens (1 frame: \(M\) Tokens)
-
Apply “SlowFast pooling”
→ To produce T + M tokens (To reduce token explosion (\(T×M\)))
- Fast tokens:
- High temporal resolution
- Low spatial detail (one per frame).
- Slow tokens:
- Low temporal resolution
- High spatial detail (from subsampled frames)
- Fast tokens:
-
Concatenate with “language tokens” (including time tokens).
-
Output:
- Localized temporal answer in relative time tokens (<1> … ),
- Easily converted back to real timestamps
-
Advantage: scalable, efficient, and directly answers “When?” with timestamp tokens.
(2) Time Tokens
-
Replace (A) with (B)
- (A) Ambiguous absolute timestamps (e.g., “01:22”)
- (B) relative tokens <1>
-
Mapping formula:
- \(t = \text{round}\left(\frac{\tau (T-1)}{L}\right)+1, \quad \tau = \frac{L (t-1)}{T-1}\).
where \(\tau\) is a continuous timestamp, \(L\) = video length.
-
Benefit: model learns temporal reasoning independent of frame rate.
(3) SlowFast Visual Tokens
-
Fast pathway:
- \(T\) fast tokens (1 per frame x \(T\) frames)
- Each averaging all \(M\) tokens from one frame → Temporal detail
-
Slow pathway:
- Sample \(s^2\) frames (with \(s=2\)), apply spatial pooling → \(\frac{M}{s^2}\) tokens per frame → Spatial richness
-
Final representation = \(T\) fast + \(M\) slow tokens (instead of \(T×M\))
-
Example: \(T=100, M=256\) (CLIP-L/14)
→ Reduce from 25,600 tokens to 356 tokens.
(4) Training Tasks
-
Train on five tasks to build temporal + reasoning skills:
- Dense video captioning (events with start/end timestamps).
- Event localization (find [start, end] for a described event).
- Video QA (short answers).
- Natural language VQA (instruction-tuned, conversational).
- Reasoning Temporal Localization (RTL) (timestamps + explanation).
-
Tasks 1, 2, 5: Explicitly enforce timestamp learning.
-
RTL answers = [start, end] + explanation
\(\rightarrow\) Requiring reasoning beyond direct description!
a) Dense Video Captioning
- Definition: Describe a video with multiple sentences, each tied to a start and end timestamp
- Format: <start time> <end time> Sentence
- Purpose: Teach the model to produce temporally grounded descriptions, aligning language with video segments.
- Example:
- Input: “Provide a detailed description of the given video. Each sentence should begin with start and end timestamps.”
- Output:
- <1><5> A woman is standing.
- <6><15> The woman starts dancing.
- <16><20> She lies on the floor.
b) Event Localization
- Definition: Localize the temporal boundaries of a specific event described in natural language.
- Format: <start time> <end time>
- Purpose: Teach precise timestamp prediction for queried events.
- Example:
- Question: “When does the man jump into the pool?”
- Answer: <25><30>
c) Video Question Answering (VQA)
- Definition: Answer natural language questions about video content.
- Characteristic: Datasets often contain short answers (single words/phrases).
- Prompting Trick: Add “Answer the question using a single word or phrase” to avoid LLM generating long sentences.
- Purpose: Strengthen fine-grained video comprehension and retrieval.
- Example:
- Q: “What is the man holding?”
- A: “A guitar.”
d) Natural Language VQA (NLVQA)
- Definition: Visual instruction tuning datasets where answers are full natural language sentences.
- Purpose: Encourage conversational, human-like answers beyond short phrases.
- Effect: Without this, models trained only on standard VQA tend to output terse answers.
- Example:
- Q: “What happens after the cat jumps onto the table?”
- A: “The cat knocks over a glass of water and then walks away.”
e) Reasoning Temporal Localization (RTL)
- Definition: The most distinctive new task. The question asks “When?” but the event is not explicitly mentioned.
- Answer Structure: [start end] Explanation
- Purpose: Combine temporal localization with reasoning & world knowledge.
- Dataset: ActivityNet-RTL, generated with GPT-4 from ActivityNet Captions.
- Example
- Q: “When is the woman the least active in the video?”
- A: [32s 36s] The woman is sleeping during this time, which is less active compared to standing or dancing.
- Impact: Forces LITA to use both temporal understanding and reasoning ability, unique compared to prior Video LLMs.
Summary of Roles
- Dense Video Captioning → general temporal alignment.
- Event Localization → precise timestamp extraction.
- VQA → content recognition, short factual answers.
- NLVQA → conversational fluency and explanation skills.
- RTL → joint reasoning + temporal localization (LITA’s signature).
4. Reasoning Temporal Localization (RTL)
-
Problem: “When?” question
-
Answer = timestamps + explanation.
- Dataset: ActivityNet-RTL
- Built from ActivityNet Captions (10k+ videos).
- GPT-4 generates reasoning-based questions/answers.
- Training set: 33,557 Q-A pairs; Evaluation: 229 Q-A pairs (manually curated for reasoning).
- Metrics: mIoU, Precision@0.5, GPT-4 Relative Score (for explanations).
5. Experiments
- Implementation:
- Visual encoder = CLIP ViT-L/14
- LLM = Vicuna.
- \(T=100\) frames, \(100\) time tokens
- Model sizes: 7B & 13B
-
Datasets: ActivityNet-Captions, YouCook2, NExT-QA, LLaVA-150K, ActivityNet-RTL.
-
Results (Table 1, p.6): LITA nearly doubles baseline mIoU/Precision@0.5, improves GPT explanation score.
-
Qualitative (Fig. 4, p.7): LITA localizes events with detail (e.g., “roasting marshmallows”), vs. generic outputs of Video-LLaMA-v2.
-
Video-based text generation benchmark (Table 2, p.7): LITA surpasses Video-ChatGPT (+22% correctness, +36% temporal understanding).
- Ablations (Table 3, p.8):
- RTL-only training insufficient.
- Adding video tasks improves timestamp accuracy.
- Adding NLVQA improves reasoning & natural language fluency.
6. Conclusion
- LITA introduces time tokens + SlowFast tokens + RTL data/task.
- Achieves strong temporal localization + reasoning.
- Improves general video QA and generation as well.
- Future direction: more temporal datasets, extending to broader video understanding tasks.