Towards Comprehensive Reasoning in Vision-Language Models
ICCV 2025 Tutorial
10/30/2025 9:00-17:00 (GMT-5)
Room 204
Introduction
Vision-Language Models (VLMs) have achieved remarkable progress in image captioning and visual question answering, yet developing genuine reasoning capabilities remains an open challenge. Unlike recent breakthroughs in reasoning-focused LLMs, many VLMs still rely primarily on pattern recognition and struggle with compositional logic. This tutorial provides a comprehensive overview of reasoning capabilities in VLMs, focusing on the transition from basic perception to complex inference. We will explore reasoning-oriented prompting and training techniques in multimodal contexts, reasoning-focused benchmarks, and architectural innovations for visual-textual fusion. Through lectures and hands-on demonstrations, participants will gain insights into current capabilities, persistent challenges in compositional generalization and explainability, and practical guidance for implementing reasoning mechanisms. This tutorial uniquely bridges advances in LLM reasoning with the visual domain, addressing the distinct challenges of spatial information processing and providing a roadmap toward more cognitively capable vision-language systems.
Vision-Language Reasoning Tutorial Schedule
Time | Programme |
---|---|
9:00 - 9:30 |
Introduction to Vision-Language Reasoning Comprehensive introduction covering definitions, taxonomy of reasoning types, key differences between perception and reasoning in Vision-Language Models (VLMs), and connections to reasoning-focused LLMs like O1 and DeepSeek-R1. This session establishes the foundational concepts and scope of vision-language reasoning. |
9:30 - 10:15 |
Enhancing Reasoning Capabilities in VLMs Deep dive into advanced techniques for improving reasoning in vision-language models, including chain-of-thought techniques, Visual Chain-of-Thought (CoT), reasoning engines, Visual Program Distillation (VPD), and interactive reasoning approaches. Practical examples and implementation strategies will be discussed. |
10:15 - 10:45 | Coffee Break & Demonstrations |
10:45 - 11:30 |
Dataset Design & Model Architectures Exploration of reasoning-focused benchmarks (MMBench, MMMU) and their design principles. Overview of key multimodal architectures including Flamingo, BLIP-2, and LLaVA, with focus on object-level grounding and spatial feature mixing techniques that enable better reasoning capabilities. |
11:30 - 12:15 |
Future Directions and Open Challenges Discussion of emerging challenges and opportunities in vision-language reasoning, including compositional generalization, explainability requirements, multi-modal reasoning across video/3D/audio modalities, integration of external knowledge bases, and approaches to improve model efficiency while maintaining reasoning capabilities. |
12:15 - 12:30 | Q&A and Closing Remarks |