The Dawn of Video Generation: Preliminary Explorations with SORA-like Models (Talk)
High-quality video generation—encompassing text-to-video (T2V), image-to-video (I2V), and video-to-video (V2V) generation—plays a pivotal role in content creation and world simulation. While several DiT-based models have advanced rapidly in the past year, a thorough exploration of their capabilities, limitations, and alignment with human preferences remains incomplete. In this talk, I will present recent advancements in SORA-like T2V, I2V, and V2V models and products, bridging the gap between academic research and industry applications. Through live demonstrations and comparative analyses, I will highlight key insights across four core dimensions: i) Impact on vertical-domain applications, such as human-centric animation and robotics; ii) Core capabilities, including text alignment, motion diversity, composition, and stability; iii) Performance across ten real-world scenarios, showcasing practical utility; iv) Future potential, including usage scenarios, challenges, and directions for further research. Additionally, I will discuss recent advancements in automatic evaluation methods for generated videos, leveraging multimodal large language models to better adapt to the rapid development of generative and understanding models.
Biography: She is a technical staff member at Anuttacon, exploring AI games to evolve alongside players. Previously, Ailing spent three wonderful years at Tencent AI Lab and International Digital Economy Academy (IDEA), leading human-centric perception and generation research team. She obtained my Ph.D. from the Department of Computer Science and Engineering, the Chinese University of Hong Kong, supervised by Prof. Qiang Xu. She was a visiting scholar in the Robotics Institute, Carnegie Mellon University.