From Perception to Simulation: The Emergence of World Models in Multi-modal Reasoning
CVPR 2026 Tutorial
TBD (GMT-7, Denver Time Zone)
TBD, Denver, Colorado
Introduction
World models are rapidly reshaping artificial intelligence, evolving from systems that passively perceive the world into engines capable of simulating, reasoning, and planning within it. This tutorial examines how recent advances in generative modeling, self-supervised learning, and multimodal architectures are enabling machines to move beyond recognition and prediction toward mental simulation, counterfactual reasoning, and decision making. We will explore the foundations of world models, approaches for learning dynamics from visual and multimodal data, and the integration of planning and reasoning. The tutorial highlights connections between video generation, diffusion models, discrete representations, and embodied AI, while addressing key challenges such as grounding, causality, physical consistency, and evaluation. Designed for researchers, practitioners, and students, this session provides both conceptual insights and practical perspectives on building AI systems that reason about environments rather than merely interpreting them.
World Model Tutorial Schedule
| Time | Session | Speaker |
|---|---|---|
| 8:30 - 8:35 |
Opening Remark: Motivation and Overview [Abstract][Slides]
Abstract: TBD
|
|
| 8:35 - 9:10 |
Invited Talk: TBD [Abstract][Slides]
Abstract: TBD
|
|
| 9:10 - 9:35 |
TBD [Abstract][Slides]
Abstract: TBD
|
|
| 9:35 - 10:10 |
Invited Talk: TBD [Abstract][Slides]
Abstract: TBD
|
|
| 10:10 - 10:45 |
Invited Talk: TBD [Abstract][Slides]
Abstract: TBD
|
|
| 10:45 - 11:20 |
Invited Talk: TBD [Abstract][Slides]
Abstract: TBD
|
|
| 11:20 - 11:55 |
Invited Talk: TBD[Abstract][Slides]
Abstract: TBD
|
|
| 11:55 - 12:00 |
Closing Remark [Abstract][Slides]
Abstract: TBD
|
Organizers
Yujun Cai
Staff Research Scientist in Ant Group USA
Yujun Cai is a Lecturer in the University of Queensland in Australia and a Staff Research Scientist in Ant Group USA. Before that, she was a Senior Research Scientist in Meta Reality Lab. Her research lies in multi-modal human perception, vision-language models, and natural language processing. She obtained her PhD. degree from Nanyang Technological University in Singapore.
Jianfei Cai
Professor at Monash University
Jianfei Cai is a Professor at the Faculty of IT, Monash University, where he was the inaugural Head of the Data Science & AI Department. Before that, he was Head of the Visual and Interactive Computing Division and Head of the Computer Communications Division at Nanyang Technological University (NTU). His major research interests include visual computing, computer vision, and multimedia. He is a co-recipient of paper awards in ACCV, ICCM, IEEE ICIP, and MMSP, and a winner of Monash FIT’s Dean’s Researcher of the Year Award and Dean’s Award for Excellence in Graduate Research Supervision. He is currently on the editorial board of TPAMI and IJCV. He has served as an Associate Editor for IEEE T-IP, T-MM, and T-CSVT, as well as serving as Senior/Area Chair for CVPR, ICCV, ECCV, ACM Multimedia, IJCAI, ICME, ICIP, and ISCAS. He was the Chair of IEEE CAS VSPC-TC during 2016-2018. He also served as the leading TPC Chair for IEEE ICME 2012, the best paper award committee chair/co-chair for IEEE T-MM 2020/2019, and the leading General Chair for ACM Multimedia 2024. He is a Fellow of IEEE.
Yiwei Wang
Assistant Professor at University of California, Merced
Dr. Yiwei Wang was an Applied Scientist in Amazon (Seattle) in 2023 and a Postdoc in UCLA NLP Group in 2024. He obtained his Ph.D. degree from National University of Singapore in 2023. Currently, he leads the UC Merced NLP Lab, where his team explores cutting-edge approaches to diffusion llms, reasoning multi-modal llms, and their applications in medicine, advertising, risk detection, signal processing, etc.