Towards Comprehensive Reasoning in Vision-Language Models
ICCV 2025 Tutorial
10/19/2025 8:30-12:00 (GMT-10, Honolulu Time Zone)
Room 318A, Hawai'i Convention Center
Introduction
Vision-Language Models (VLMs) have achieved remarkable progress in image captioning and visual question answering, yet developing genuine reasoning capabilities remains an open challenge. Unlike recent breakthroughs in reasoning-focused LLMs, many VLMs still rely primarily on pattern recognition and struggle with compositional logic. This tutorial provides a comprehensive overview of reasoning capabilities in VLMs, focusing on the transition from basic perception to complex inference. We will explore reasoning-oriented prompting and training techniques in multimodal contexts, reasoning-focused benchmarks, and architectural innovations for visual-textual fusion. Through lectures and hands-on demonstrations, participants will gain insights into current capabilities, persistent challenges in compositional generalization and explainability, and practical guidance for implementing reasoning mechanisms. This tutorial uniquely bridges advances in LLM reasoning with the visual domain, addressing the distinct challenges of spatial information processing and providing a roadmap toward more cognitively capable vision-language systems. Some statistics: The number of people who viewed our page on the day of our tutorial was 931. A total of 218 people bookmarked our event in the agenda builder. During the tutorial, 158 participants attended via Zoom. The onsite attendance ratio remained above 100% for most of our tutorial, with at least 150 participants joining our tutorial in person. We sincerely thank the audience for their great interest and enthusiastic participation. See onsite pictures at the bottom of this page.
Vision-Language Reasoning Tutorial Schedule
| Time | Session | Speaker |
|---|---|---|
| 8:30 - 8:35 |
Opening Remark: Motivation and Overview [Abstract][Slides]
Abstract: Welcome to our comprehensive tutorial on reasoning in vision-language models. This opening session will set the stage by discussing the motivation behind developing reasoning capabilities in VLMs and provide an overview of the day's agenda, highlighting key challenges and opportunities in the field.
|
|
| 8:35 - 9:10 |
Invited Talk: Native Multimodal Models: Architecture, Post-Training, and Evaluation [Abstract][Slides]
Abstract: As we strive toward more capable and general-purpose AI systems, the integration of vision and language in multimodal models (VLMs) has become a pivotal area of research. This talk explores three complementary advances that collectively push the boundaries of what native LMMs can achieve, spanning their architecture, post-training, and evaluation. We first introduce NEO, a family of native LMMs that seamlessly integrate visual and linguistic features within a shared semantic space, outperforming traditional modular models with only 390M image-text pairs. Next, we present Visual Jigsaw, a self-supervised post-training framework that enhances the visual reasoning capabilities of LMMs through a reinforcement learning approach, improving fine-grained perception, temporal reasoning, and 3D spatial understanding. Finally, we introduce RealUnify, a novel benchmark designed to assess the bidirectional synergy between understanding and generation in unified models, revealing that current architectures struggle to fully leverage this integration. Together, these works offer a comprehensive perspective on the challenges and opportunities in advancing unified multimodal models.
|
|
| 9:10 - 9:35 |
Video-TT Challenge: Towards Advanced Video Reasoning and Understanding [Abstract][Slides]
Abstract: Join us for the exciting conclusion of the Video-TT Challenge, where we will announce results and recognize outstanding achievements in video reasoning and understanding. This session will showcase innovative approaches from participating teams and highlight breakthrough methodologies that advance the state-of-the-art in video analysis and temporal reasoning. See more details at: https://sites.google.com/view/video-tt-challenge
|
Yuhao Dong, Yuanhan Zhang, Ziwei Liu, and Representative Teams |
| 9:35 - 10:10 |
Invited Talk: Reasoning in Multimodal GUI Agents: An Exploration-Driven Perspective [Abstract][Slides]
Abstract: This talk explores how multimodal GUI agents can develop sophisticated reasoning capabilities through exploration-driven learning. We will discuss novel approaches to understanding user interfaces, spatial relationships, and interaction patterns, highlighting how agents can learn to navigate complex digital environments through systematic exploration and reasoning.
|
|
| 10:10 - 10:45 |
Invited Talk: Mathematical Reasoning in Visual Contexts [Abstract][Slides]
Abstract: Mathematical reasoning presents unique challenges when combined with visual information. This talk will explore how vision-language models can be enhanced to solve mathematical problems that require understanding geometric relationships, interpreting charts and graphs, and reasoning about spatial configurations in mathematical contexts.
|
|
| 10:45 - 11:20 |
Invited Talk: Chain-of-Look Visual Reasoning [Abstract][Slides]
Abstract: While multi-modal foundation models excel at describing images and answering simple questions, they still struggle at tasks requiring deliberate, step-by-step visual reasoning—including a capability as basic as accurately counting objects in a visual scene. This limitation stems from a fundamental reliance on one-shot, end-to-end inference, which bypasses the structured, iterative process inherent to human visual cognition. We introduce Chain-of-Look, a new visual reasoning paradigm that addresses this weakness by modeling sequential visual understanding. Rather than relying on direct regression, our method guides the model through a structured chain of visual attention, mirroring the progressive nature of human analysis. This approach leads to more accurate, robust, and explainable reasoning in challenging scenes. In initial benchmarks, Chain-of-Look outperforms both specialized models and general-purpose multi-modal systems at visual counting of dense surgical instruments. We further demonstrate its potential to benefit higher-level tasks such as human-object interaction modeling, establishing a new path toward more interpretable and reliable visual AI.
|
|
| 11:20 - 11:55 |
Invited Talk: Grounding Anything in Images and Videos for Comprehensive Reasoning[Abstract][Slides]
Abstract: Understanding and reasoning about the visual world requires more than recognizing objects—it demands grounding language, actions, and abstract concepts in images and videos in a unified and interpretable way. This talk explores the emerging paradigm of comprehensive grounding, which bridges perception and reasoning by linking every visual element—objects, attributes, spatial relations, temporal dynamics, and textual cues—to structured representations that large models can reason over. I will present recent advances in grounding visual-language models that can “ground anything,” from static objects in images to dynamic events in videos, enabling consistent interpretation across modalities and tasks. The talk will also discuss how such grounding facilitates high-level reasoning, including question answering, action understanding, and causal inference, moving toward general-purpose vision-language agents that integrate perception, knowledge, and reasoning in a cohesive framework.
|
|
| 11:55 - 12:00 |
Closing Remark [Abstract][Slides]
Abstract: Our tutorial concludes with a synthesis of key insights, future research directions, and practical takeaways. This closing session will summarize the day's discussions, highlight emerging trends in vision-language reasoning, and provide guidance for researchers and practitioners looking to advance this exciting field.
|
Organizers
Yujun Cai
Lecturer at University of Queensland
Dr. Yujun Cai is a Lecturer (Assistant Professor) in the University of Queensland, Australia. Previously, she was a Research Scientist at Meta Reality Labs in Seattle. She obtained her Ph.D. degree from Nanyang Technological University in 2021. Her research interests lie in multi-modal understanding and trust-worthy large models.
Jun Liu
Professor at Lancaster University
Dr. Jun Liu is Professor and Chair in Digital Health at the School of Computing and Communications, Lancaster University. He earned his PhD from Nanyang Technological University in 2019, subsequently serving as faculty at Singapore University of Technology and Design from 2019 to 2024. Prior to his academic career, he worked at Tencent from 2014 to 2015.
Yiwei Wang
Assistant Professor at University of California, Merced
Dr. Yiwei Wang was an Applied Scientist in Amazon (Seattle) in 2023 and a Postdoc in UCLA NLP Group in 2024. He obtained his Ph.D. degree from National University of Singapore in 2023. Currently, he leads the UC Merced NLP Lab, where his team explores cutting-edge approaches to diffusion llms, reasoning multi-modal llms, and their applications in medicine, advertising, risk detection, signal processing, etc.
Invited Speakers
Kai-Wei Chang
Associate Professor at University of California, Los Angeles
Dr. Kai-Wei Chang is an Associate Professor at UCLA specializing in natural language processing, machine learning, and multimodal reasoning. His research particularly focuses on mathematical reasoning, structured prediction, and developing robust AI systems that can handle complex reasoning tasks.
Junsong Yuan
Professor at University at Buffalo, SUNY
Dr. Junsong Yuan is a Professor at University at Buffalo, specializing in computer vision, pattern recognition, and multimedia analysis. His research encompasses video understanding, human activity recognition, and multimodal learning with applications in surveillance, healthcare, and autonomous systems.
Ziwei Liu
Associate Professor at Nanyang Technological University
Dr. Ziwei Liu is an Assocaite Professor at NTU and leads the LMMs-Lab initiative. His research focuses on large multimodal models, computer vision, and machine learning. He has made significant contributions to visual understanding, generative models, and multimodal intelligence systems.
Chi Zhang
Assistant Professor at Westlake University
Dr. Chi Zhang is an Assistant Professor at Westlake University, focusing on multimodal AI and embodied intelligence. His research spans GUI automation, visual reasoning, and human-computer interaction, with particular expertise in developing intelligent agents that can understand and interact with digital interfaces.
Yuanhan Zhang
Ph.D. Student at Nanyang Technological University
Yuanhan Zhang's research interests lie in computer vision and deep learning. His work focuses on adapting foundation models—from vision to multimodal—for real-world applications, including benchmarking model performance and adapting models through parameter-efficient tuning, in-context learning, and instruction tuning.