PAPER_TITLE

FIRST_AUTHOR_LAST, FIRST_AUTHOR_FIRST; SECOND_AUTHOR_LAST, SECOND_AUTHOR_FIRST

Divid: Disentangled Spatial-Temporal Modeling within LLMs for Temporally Grounded Video Understanding

Yepeng Tang^1,2, Weining Wang^3,4, Longteng Guo^3,4, Tongtian Yue^3,4, Wenxuan Wang^3,4, Chunjie Zhang^1,2✉, Jing Liu^3,4✉

¹Institute of Information Science, School of Computer Science and Technology, Beijing Jiaotong University
²Visual Intelligence + X International Cooperation Joint Laboratory of MOE, School of Computer Science and Technology, Beijing Jiaotong University
³Institute of Automation, Chinese Academy of Sciences
⁴University of Chinese Academy of Sciences
^✉Corresponding authors

Paper GitHub

Abstract

Recent advances in Video LLMs have improved video understanding performance, but temporally grounded understanding in long-form videos remains challenging. Most models encode video frames into a flat sequence of visual tokens, which are then processed together with textual input by the LLM. While effective for short videos, this approach becomes inefficient for long-form videos due to lengthy token sequences that exceed context limits and incur high computational costs. Slow-Fast architectures partially address this by separating temporal and spatial features during encoding, but these features are still processed jointly within the LLM, lacking true spatio-temporal disentanglement. Moreover, spatial features are typically sampled in a query-agnostic manner, risking the loss of task-relevant content. To address these limitations, we propose Divid, a novel dual-branch framework that explicitly disentangles spatial and temporal modeling within the LLM decoder. Specifically, the temporal branch processes densely sampled, low-resolution frames to effectively capture long-range motion dynamics, while the spatial branch selects a sparse set of high-resolution keyframes guided by temporal attention. To unify the two branches, we design a lightweight spatio-temporal soft-router that adaptively fuses temporal and spatial cues at the token level, conditioned on the input query. This disentangled architecture not only improves temporal alignment accuracy but also leads to computational savings by minimizing redundant visual processing. Furthermore, we introduce TempGCap, a large-scale dataset consisting of 559K timestamp-grounded video-text pairs, providing rich temporal supervision. Extensive experiments on temporal grounding and grounded videoQA benchmarks demonstrate the superior performance and efficiency of our proposed Divid.

Poster

BibTeX

@inproceedings{tang2026divid,
    title={Divid: Disentangled Spatial-Temporal Modeling within LLMs for Temporally Grounded Video Understanding},
    author={Tang, Yepeng and Wang, Weining and Guo, Longteng and Yue, Tongtian and Wang, Wenxuan and Zhang, Chunjie and Liu, Jing},
    booktitle={The Fourteenth International Conference on Learning Representations},
    year={2026}
  }
}

More Works from Our Lab

Paper Title 1

Paper Title 2

Paper Title 3

Divid: Disentangled Spatial-Temporal Modeling within LLMs for Temporally Grounded Video Understanding

Abstract

Poster

BibTeX