Multimodal Video Large Language Models

Multimodal Video LLM Study Notes

1. Background

Multimodal video large models (MVMs) integrate visual, linguistic, and audio information.
Compared to image models, video tasks are significantly harder. The main pain points are:

Fine-grained perception: Requires precise localization of objects, actions, and relationships within a video.
Long-video understanding: Requires spanning long time horizons, capturing key events, and maintaining global coherence.

2. Fine-Grained Perception

2.1 Observed Issues

Action recognition is too coarse — e.g., "pick up" vs. "lift" are easily confused.
Model answers lack timestamps or keyframes as evidence.
Different modalities (subtitles vs. video frames) are not synchronized.

2.2 Challenges

Data: Fine-grained annotations are expensive; long-tail action samples are scarce.
Compute: High-resolution video causes the token count to explode.
Representation: Tension between local details and global semantics.

2.3 Methods

Hierarchical action/event modeling: Global → local → atomic actions.
Token pruning / compression: Retain keyframes or key patches; reduce redundancy.
Multi-task training: Joint learning of classification + localization + QA.
Evidence visualization: Output frame indices or bounding boxes.

3. Long-Video Understanding

3.1 Observed Issues

Models can only handle a few seconds of video; content is "forgotten" beyond a few minutes.
QA answers lack global context and are often off-topic.
Summaries or retrieval results are biased toward "local" content, ignoring the main thread.

3.2 Challenges

Token budget: Minute-level videos can have hundreds of thousands of frames — impossible to feed in directly.
Structural complexity: Films, lectures, and meetings often contain multiple scenes, characters, and events.
Cross-modal asynchrony: Subtitles, audio, and actions do not necessarily occur simultaneously.

3.3 Methods

Hierarchical modeling: Shot → Scene → Event → Video.
Retrieval-augmented processing: First locate key segments via indexing, then feed them into the large model for reasoning.
Memory mechanisms: Sliding window + cache, external memory banks, chain-of-summary.
Multimodal collaboration: Combine subtitles (ASR), OCR, and audio to assist the visual stream.

4. Common Evaluation Metrics

Fine-grained perception:
- Action localization mAP@tIoU
- QA accuracy + timestamp consistency
- Interpretability (match between evidence frames and answers)
Long-video understanding:
- QA accuracy (long context)
- Retrieval Recall@K / mAP
- Summary quality (ROUGE, human evaluation)
- Computational efficiency (fps, GPU memory usage)

5. Additional Thoughts

Fine-grained perception feels more like a small joint task at the intersection of computer vision and NLP (detection/segmentation/action recognition + QA).
Long-video understanding feels more like systems engineering: requiring data preprocessing (segmentation/indexing), models (hierarchical), and inference (retrieval + planning).
A likely future direction: multimodal collaborative memory + interpretable reasoning.

6. References

Feather the Throttle: Revisiting Visual Token Pruning — arXiv:2412.13180
Token Activation Map to Visually Explain Multimodal LLMs — arXiv:2506.23270
GLM-4.5V / 4.1V-Thinking — arXiv:2507.01006
Thinking with Images for Multimodal Reasoning — arXiv:2506.23918
Vision as a Dialect — arXiv:2506.18898

贡献者

这篇文章有帮助吗？

最近更新

Involution Hell© 2026 byCommunityunderCC BY-NC-SA 4.0

On this page

Multimodal Video LLM Study Notes 1. Background 2. Fine-Grained Perception 2.1 Observed Issues 2.2 Challenges 2.3 Methods 3. Long-Video Understanding 3.1 Observed Issues 3.2 Challenges 3.3 Methods 4. Common Evaluation Metrics 5. Additional Thoughts 6. References