{AVA}: Towards Agentic Video Analytics with Vision Language Models

Yuxuan Yan; Shiqi Jiang; Ting Cao; Yifan Yang; Qianqian Yang; Yuanchao Shu; Yuqing Yang; Lili Qiu

Yuxuan Yan, Zhejiang University; Shiqi Jiang, Microsoft Research; Ting Cao, Tsinghua University; Yifan Yang, Microsoft Research; Qianqian Yang and Yuanchao Shu, Zhejiang University; Yuqing Yang and Lili Qiu, Microsoft Research

AI-driven video analytics has become increasingly important across diverse domains. However, existing systems are often constrained to specific, predefined tasks, limiting their adaptability in open-ended analytical scenarios. The recent emergence of Vision Language Models (VLMs) as transformative technologies offers significant potential for enabling open-ended video understanding, reasoning, and analytics. Nevertheless, their limited context windows present challenges when processing ultra-long video content, which is prevalent in real-world applications. To address this, we introduce AVA, a VLM-powered system designed for open-ended, advanced video analytics. AVA incorporates two key innovations: (1) the near real-time construction of Event Knowledge Graphs (EKGs) for efficient indexing of long or continuous video streams, and (2) an agentic retrieval-generation mechanism that leverages EKGs to handle complex and diverse queries. Comprehensive evaluations on public benchmarks, LVBench and VideoMME-Long, demonstrate that AVA achieves state-of-the-art performance, attaining 62.3% and 64.1% accuracy, respectively, significantly surpassing existing VLM and video Retrieval-Augmented Generation (RAG) systems. Furthermore, to evaluate video analytics in ultra-long and open-world video scenarios, we introduce a new benchmark, AVA-100. This benchmark comprises 8 videos, each exceeding 10 hours in duration, along with 120 manually annotated, diverse, and complex question-answer pairs. On AVA-100, AVA achieves top-tier performance with an accuracy of 75.8%.

The source code of AVA is available at https://github.com/I-ESC/Project-Ava. The AVA-100 benchmark could be accessed at https://huggingface.co/datasets/iesc/Ava-100.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX

@inproceedings {316072,
author = {Yuxuan Yan and Shiqi Jiang and Ting Cao and Yifan Yang and Qianqian Yang and Yuanchao Shu and Yuqing Yang and Lili Qiu},
title = {{AVA}: Towards Agentic Video Analytics with Vision Language Models},
booktitle = {23rd USENIX Symposium on Networked Systems Design and Implementation (NSDI 26)},
year = {2026},
isbn = {978-1-939133-54-0},
address = {Renton, WA},
pages = {1939--1957},
url = {https://www.usenix.org/conference/nsdi26/presentation/yan},
publisher = {USENIX Association},
month = may
}

Download

Yan PDF

Yan Paper (Prepublication) PDF

View the slides

AVA: Towards Agentic Video Analytics with Vision Language Models

Open Access Media

Presentation Video