LITA: Language Instructed Temporal-Localization Assistant


Contents

  1. Abstract
  2. Introduction
  3. Related Work
  4. Methodology - Language Instructed Temporal Localization
    1. Architecture
    2. Time Tokens
    3. SlowFast Visual Tokens
    4. Training Tasks
  5. Reasoning Temporal Localization (RTL)
  6. Experiments
  7. Conclusion


0. Abstract

Existing Video LLMs cannot answer “When?” questions

\(\rightarrow\) Due to lack of temporal localization


Key shortcomings: Time representation, Architecture, Data.


Proposal: LITA

  1. Relative time tokens: To represent timestamps.
  2. SlowFast tokens: To capture temporal resolution efficiently
  3. Temporal localization–focused data + New task/dataset:
    • Reasoning Temporal Localization (RTL) with ActivityNet-RTL.


Experiments

  • Nearly doubles baseline temporal mIoU
  • Improves video-based text generation
    • (e.g., +36% relative improvement in Temporal Understanding).


1. Introduction

Background

  • LLMs are strong instruction followers
  • Multimodal LLMs extend them to video.
  • Videos require temporal localization, absent in current Video LLMs (e.g., Video-LLaMA).


Challenges:

  • Time as Text: Ambiguous without frame rate
  • Architectures with insufficient temporal resolution
  • Lack of timestamp-labeled training data

Solution: LITA addresses these with

  • (1) Time tokens
  • (2) SlowFast architecture
  • (3) RTL data/task


2. Related Work

  • Multimodal LLMs: Flamingo, LLaVA, etc.

    → Adapt vision to LLMs via cross-attention, projection, adapters

  • Video LLMs: Video-LLaMA, VideoChat, etc.

    → Effective for content questions, but poor at temporal localization

  • Temporal localization research: Action detection, event grounding, dense captioning

    \(\rightarrow\) Bt not integrated with instruction-following LLMs.


Related works: VTimeLLM, TimeChat, Momentor.

LITA adds reasoning aspect to temporal localization.


3. Methodology – Language Instructed Temporal Localization (Detailed)

figure2


(1) Architecture

  • Base: LLaVA-like Video LLM

  • Input:

    • Uniformly select \(T\) frames
    • Each encoded into \(M\) visual tokens (1 frame: \(M\) Tokens)
  • Apply “SlowFast pooling”

    → To produce T + M tokens (To reduce token explosion (\(T×M\)))

    • Fast tokens:
      • High temporal resolution
      • Low spatial detail (one per frame).
    • Slow tokens:
      • Low temporal resolution
      • High spatial detail (from subsampled frames)
  • Concatenate with “language tokens” (including time tokens).

  • Output:

    • Localized temporal answer in relative time tokens (<1> … ),
    • Easily converted back to real timestamps
  • Advantage: scalable, efficient, and directly answers “When?” with timestamp tokens.


(2) Time Tokens

  • Replace (A) with (B)

    • (A) Ambiguous absolute timestamps (e.g., “01:22”)
    • (B) relative tokens <1>
  • Mapping formula:

    • \(t = \text{round}\left(\frac{\tau (T-1)}{L}\right)+1, \quad \tau = \frac{L (t-1)}{T-1}\).

    where \(\tau\) is a continuous timestamp, \(L\) = video length.

  • Benefit: model learns temporal reasoning independent of frame rate.


(3) SlowFast Visual Tokens

  • Fast pathway:

    • \(T\) fast tokens (1 per frame x \(T\) frames)
    • Each averaging all \(M\) tokens from one frame → Temporal detail
  • Slow pathway:

    • Sample \(s^2\) frames (with \(s=2\)), apply spatial pooling → \(\frac{M}{s^2}\) tokens per frame → Spatial richness
  • Final representation = \(T\) fast + \(M\) slow tokens (instead of \(T×M\))

  • Example: \(T=100, M=256\) (CLIP-L/14)

    → Reduce from 25,600 tokens to 356 tokens.


(4) Training Tasks

  • Train on five tasks to build temporal + reasoning skills:

    1. Dense video captioning (events with start/end timestamps).
    2. Event localization (find [start, end] for a described event).
    3. Video QA (short answers).
    4. Natural language VQA (instruction-tuned, conversational).
    5. Reasoning Temporal Localization (RTL) (timestamps + explanation).
  • Tasks 1, 2, 5: Explicitly enforce timestamp learning.

  • RTL answers = [start, end] + explanation

    \(\rightarrow\) Requiring reasoning beyond direct description!


a) Dense Video Captioning

  • Definition: Describe a video with multiple sentences, each tied to a start and end timestamp
  • Format: <start time> <end time> Sentence
  • Purpose: Teach the model to produce temporally grounded descriptions, aligning language with video segments.
  • Example:
    • Input: “Provide a detailed description of the given video. Each sentence should begin with start and end timestamps.”
    • Output:
      • <1><5> A woman is standing.
      • <6><15> The woman starts dancing.
      • <16><20> She lies on the floor.


b) Event Localization

  • Definition: Localize the temporal boundaries of a specific event described in natural language.
  • Format: <start time> <end time>
  • Purpose: Teach precise timestamp prediction for queried events.
  • Example:
    • Question: “When does the man jump into the pool?”
    • Answer: <25><30>


c) Video Question Answering (VQA)

  • Definition: Answer natural language questions about video content.
  • Characteristic: Datasets often contain short answers (single words/phrases).
  • Prompting Trick: Add “Answer the question using a single word or phrase” to avoid LLM generating long sentences.
  • Purpose: Strengthen fine-grained video comprehension and retrieval.
  • Example:
    • Q: “What is the man holding?”
    • A: “A guitar.”


d) Natural Language VQA (NLVQA)

  • Definition: Visual instruction tuning datasets where answers are full natural language sentences.
  • Purpose: Encourage conversational, human-like answers beyond short phrases.
  • Effect: Without this, models trained only on standard VQA tend to output terse answers.
  • Example:
    • Q: “What happens after the cat jumps onto the table?”
    • A: “The cat knocks over a glass of water and then walks away.”


e) Reasoning Temporal Localization (RTL)

  • Definition: The most distinctive new task. The question asks “When?” but the event is not explicitly mentioned.
  • Answer Structure: [start end] Explanation
  • Purpose: Combine temporal localization with reasoning & world knowledge.
  • Dataset: ActivityNet-RTL, generated with GPT-4 from ActivityNet Captions.
  • Example
    • Q: “When is the woman the least active in the video?”
    • A: [32s 36s] The woman is sleeping during this time, which is less active compared to standing or dancing.
  • Impact: Forces LITA to use both temporal understanding and reasoning ability, unique compared to prior Video LLMs.


Summary of Roles

  • Dense Video Captioning → general temporal alignment.
  • Event Localization → precise timestamp extraction.
  • VQA → content recognition, short factual answers.
  • NLVQA → conversational fluency and explanation skills.
  • RTL → joint reasoning + temporal localization (LITA’s signature).


4. Reasoning Temporal Localization (RTL)

  • Problem: “When?” question

  • Answer = timestamps + explanation.

  • Dataset: ActivityNet-RTL
    • Built from ActivityNet Captions (10k+ videos).
    • GPT-4 generates reasoning-based questions/answers.
    • Training set: 33,557 Q-A pairs; Evaluation: 229 Q-A pairs (manually curated for reasoning).
  • Metrics: mIoU, Precision@0.5, GPT-4 Relative Score (for explanations).

figure2


5. Experiments

  • Implementation:
    • Visual encoder = CLIP ViT-L/14
    • LLM = Vicuna.
    • \(T=100\) frames, \(100\) time tokens
    • Model sizes: 7B & 13B
  • Datasets: ActivityNet-Captions, YouCook2, NExT-QA, LLaVA-150K, ActivityNet-RTL.

  • Results (Table 1, p.6): LITA nearly doubles baseline mIoU/Precision@0.5, improves GPT explanation score.

  • Qualitative (Fig. 4, p.7): LITA localizes events with detail (e.g., “roasting marshmallows”), vs. generic outputs of Video-LLaMA-v2.

  • Video-based text generation benchmark (Table 2, p.7): LITA surpasses Video-ChatGPT (+22% correctness, +36% temporal understanding).

  • Ablations (Table 3, p.8):
    • RTL-only training insufficient.
    • Adding video tasks improves timestamp accuracy.
    • Adding NLVQA improves reasoning & natural language fluency.


6. Conclusion

  • LITA introduces time tokens + SlowFast tokens + RTL data/task.
  • Achieves strong temporal localization + reasoning.
  • Improves general video QA and generation as well.
  • Future direction: more temporal datasets, extending to broader video understanding tasks.

Categories: , ,

Updated: