Towards Comprehensive Reasoning in Vision-Language Models

ICCV 2025 Tutorial

10/19/2025 8:30-12:00 (GMT-10, Honolulu Time Zone)
Room 318A, Hawai'i Convention Center

Introduction

Vision-Language Models (VLMs) have achieved remarkable progress in image captioning and visual question answering, yet developing genuine reasoning capabilities remains an open challenge. Unlike recent breakthroughs in reasoning-focused LLMs, many VLMs still rely primarily on pattern recognition and struggle with compositional logic. This tutorial provides a comprehensive overview of reasoning capabilities in VLMs, focusing on the transition from basic perception to complex inference. We will explore reasoning-oriented prompting and training techniques in multimodal contexts, reasoning-focused benchmarks, and architectural innovations for visual-textual fusion. Through lectures and hands-on demonstrations, participants will gain insights into current capabilities, persistent challenges in compositional generalization and explainability, and practical guidance for implementing reasoning mechanisms. This tutorial uniquely bridges advances in LLM reasoning with the visual domain, addressing the distinct challenges of spatial information processing and providing a roadmap toward more cognitively capable vision-language systems. Some statistics: The number of people who viewed our page on the day of our tutorial was 931. A total of 218 people bookmarked our event in the agenda builder. During the tutorial, 158 participants attended via Zoom. The onsite attendance ratio remained above 100% for most of our tutorial, with at least 150 participants joining our tutorial in person. We sincerely thank the audience for their great interest and enthusiastic participation. See onsite pictures at the bottom of this page.

Vision-Language Reasoning Tutorial Schedule

Time	Session	Speaker
8:30 - 8:35	Opening Remark: Motivation and Overview [Abstract][Slides] Abstract: Welcome to our comprehensive tutorial on reasoning in vision-language models. This opening session will set the stage by discussing the motivation behind developing reasoning capabilities in VLMs and provide an overview of the day's agenda, highlighting key challenges and opportunities in the field.	Yujun Cai
8:35 - 9:10	Invited Talk: Native Multimodal Models: Architecture, Post-Training, and Evaluation [Abstract][Slides] Abstract: As we strive toward more capable and general-purpose AI systems, the integration of vision and language in multimodal models (VLMs) has become a pivotal area of research. This talk explores three complementary advances that collectively push the boundaries of what native LMMs can achieve, spanning their architecture, post-training, and evaluation. We first introduce NEO, a family of native LMMs that seamlessly integrate visual and linguistic features within a shared semantic space, outperforming traditional modular models with only 390M image-text pairs. Next, we present Visual Jigsaw, a self-supervised post-training framework that enhances the visual reasoning capabilities of LMMs through a reinforcement learning approach, improving fine-grained perception, temporal reasoning, and 3D spatial understanding. Finally, we introduce RealUnify, a novel benchmark designed to assess the bidirectional synergy between understanding and generation in unified models, revealing that current architectures struggle to fully leverage this integration. Together, these works offer a comprehensive perspective on the challenges and opportunities in advancing unified multimodal models.	Ziwei Liu
9:10 - 9:35	Video-TT Challenge: Towards Advanced Video Reasoning and Understanding [Abstract][Slides] Abstract: Join us for the exciting conclusion of the Video-TT Challenge, where we will announce results and recognize outstanding achievements in video reasoning and understanding. This session will showcase innovative approaches from participating teams and highlight breakthrough methodologies that advance the state-of-the-art in video analysis and temporal reasoning. See more details at: https://sites.google.com/view/video-tt-challenge	Yuhao Dong, Yuanhan Zhang, Ziwei Liu, and Representative Teams
9:35 - 10:10	Invited Talk: Reasoning in Multimodal GUI Agents: An Exploration-Driven Perspective [Abstract][Slides] Abstract: This talk explores how multimodal GUI agents can develop sophisticated reasoning capabilities through exploration-driven learning. We will discuss novel approaches to understanding user interfaces, spatial relationships, and interaction patterns, highlighting how agents can learn to navigate complex digital environments through systematic exploration and reasoning.	Chi Zhang
10:10 - 10:45	Invited Talk: Mathematical Reasoning in Visual Contexts [Abstract][Slides] Abstract: Mathematical reasoning presents unique challenges when combined with visual information. This talk will explore how vision-language models can be enhanced to solve mathematical problems that require understanding geometric relationships, interpreting charts and graphs, and reasoning about spatial configurations in mathematical contexts.	Kai-Wei Chang
10:45 - 11:20	Invited Talk: Chain-of-Look Visual Reasoning [Abstract][Slides] Abstract: While multi-modal foundation models excel at describing images and answering simple questions, they still struggle at tasks requiring deliberate, step-by-step visual reasoning—including a capability as basic as accurately counting objects in a visual scene. This limitation stems from a fundamental reliance on one-shot, end-to-end inference, which bypasses the structured, iterative process inherent to human visual cognition. We introduce Chain-of-Look, a new visual reasoning paradigm that addresses this weakness by modeling sequential visual understanding. Rather than relying on direct regression, our method guides the model through a structured chain of visual attention, mirroring the progressive nature of human analysis. This approach leads to more accurate, robust, and explainable reasoning in challenging scenes. In initial benchmarks, Chain-of-Look outperforms both specialized models and general-purpose multi-modal systems at visual counting of dense surgical instruments. We further demonstrate its potential to benefit higher-level tasks such as human-object interaction modeling, establishing a new path toward more interpretable and reliable visual AI.	Junsong Yuan
11:20 - 11:55	Invited Talk: Grounding Anything in Images and Videos for Comprehensive Reasoning[Abstract][Slides] Abstract: Understanding and reasoning about the visual world requires more than recognizing objects—it demands grounding language, actions, and abstract concepts in images and videos in a unified and interpretable way. This talk explores the emerging paradigm of comprehensive grounding, which bridges perception and reasoning by linking every visual element—objects, attributes, spatial relations, temporal dynamics, and textual cues—to structured representations that large models can reason over. I will present recent advances in grounding visual-language models that can “ground anything,” from static objects in images to dynamic events in videos, enabling consistent interpretation across modalities and tasks. The talk will also discuss how such grounding facilitates high-level reasoning, including question answering, action understanding, and causal inference, moving toward general-purpose vision-language agents that integrate perception, knowledge, and reasoning in a cohesive framework.	Ming-Hsuan Yang
11:55 - 12:00	Closing Remark [Abstract][Slides] Abstract: Our tutorial concludes with a synthesis of key insights, future research directions, and practical takeaways. This closing session will summarize the day's discussions, highlight emerging trends in vision-language reasoning, and provide guidance for researchers and practitioners looking to advance this exciting field.	Yiwei Wang

Organizers

Yujun Cai

Lecturer at University of Queensland

Dr. Yujun Cai is a Lecturer (Assistant Professor) in the University of Queensland, Australia. Previously, she was a Research Scientist at Meta Reality Labs in Seattle. She obtained her Ph.D. degree from Nanyang Technological University in 2021. Her research interests lie in multi-modal understanding and trust-worthy large models.

Jun Liu

Professor at Lancaster University

Dr. Jun Liu is Professor and Chair in Digital Health at the School of Computing and Communications, Lancaster University. He earned his PhD from Nanyang Technological University in 2019, subsequently serving as faculty at Singapore University of Technology and Design from 2019 to 2024. Prior to his academic career, he worked at Tencent from 2014 to 2015.

Yiwei Wang

Assistant Professor at University of California, Merced

Dr. Yiwei Wang was an Applied Scientist in Amazon (Seattle) in 2023 and a Postdoc in UCLA NLP Group in 2024. He obtained his Ph.D. degree from National University of Singapore in 2023. Currently, he leads the UC Merced NLP Lab, where his team explores cutting-edge approaches to diffusion llms, reasoning multi-modal llms, and their applications in medicine, advertising, risk detection, signal processing, etc.

Ming-Hsuan Yang

Professor at University of California, Merced

Ming-Hsuan Yang is a Professor in Electrical Engineering and Computer Science at University of California, Merced. He is a Fellow of the IEEE, ACM and AAAI.

Invited Speakers

Kai-Wei Chang

Associate Professor at University of California, Los Angeles

Dr. Kai-Wei Chang is an Associate Professor at UCLA specializing in natural language processing, machine learning, and multimodal reasoning. His research particularly focuses on mathematical reasoning, structured prediction, and developing robust AI systems that can handle complex reasoning tasks.

Junsong Yuan

Professor at University at Buffalo, SUNY

Dr. Junsong Yuan is a Professor at University at Buffalo, specializing in computer vision, pattern recognition, and multimedia analysis. His research encompasses video understanding, human activity recognition, and multimodal learning with applications in surveillance, healthcare, and autonomous systems.

Ziwei Liu

Associate Professor at Nanyang Technological University

Dr. Ziwei Liu is an Assocaite Professor at NTU and leads the LMMs-Lab initiative. His research focuses on large multimodal models, computer vision, and machine learning. He has made significant contributions to visual understanding, generative models, and multimodal intelligence systems.

Chi Zhang

Assistant Professor at Westlake University

Dr. Chi Zhang is an Assistant Professor at Westlake University, focusing on multimodal AI and embodied intelligence. His research spans GUI automation, visual reasoning, and human-computer interaction, with particular expertise in developing intelligent agents that can understand and interact with digital interfaces.

Yuhao Dong

Ph.D. Student at Nanyang Technological University

Yuhao Dong currently working on the topic of multimodal learning and his long-term goal is to build general foundation models. Recently he is focusing on vision-language model and unified visual models.

Yuanhan Zhang