Scene Understanding through Space and Time: Novel Priors for 3D Reconstruction and Physical Dynamics (Talk)
This talk explores novel approaches to understanding and reconstructing scenes across both spatial and temporal dimensions. Extrapolating a scene from limited observations requires generative priors to generate 3D content in unobserved areas of a scene. Existing 3D generative literature relies on 3D-aware image or video diffusion models which require pretraining on million-scale real and synthetic 3D datasets. To address this challenge, we present low-cost generative techniques built on 2D diffusion priors that require only small-scale fine-tuning on multiview data. These finetuned priors can rectify novel view renders and depth maps by inpainting missing details and removing artifacts borne out of 3D representations fitted to sparse inputs. Through autoregressive fusion of multiple novel views, we build multiview consistent 3D representations that perform competitively with state-of-the-art methods for complex 360° scenes on the MipNeRF360 dataset. Building upon this foundation of static scene understanding, we extend our investigation to dynamic scenes where physical laws govern object interactions. While current video diffusion models like OpenAI's SoRA can generate visually compelling sequences, they often fail to capture underlying physical constraints due to their purely data-driven training objectives. As a result, the generated videos often lack physical plausibility. To address this limitation, we introduce a 4D dataset with per-frame force annotations that explicates the physical interactions driving object motion in scenes. Our physical simulator can both animate objects in static 3D scenes and record particle-level forces at each timestep. This dataset aims to enable the development of physics-informed video diffusion priors, marking a step toward more physically accurate world simulators.
Biography: Soumava Paul is currently a research intern at the Astra Vision Group in Inria Paris Center working on physics-based interpretation of 4D scene representations. He also collaborates remotely with the CCVL group at Johns Hopkins University working on pose-free scene reconstruction from single or few images. He is broadly interested in 3D and 4D generative vision. Recently, he completed his MSc. in Visual Computing from Saarland University, Germany in May 2024. Soumava did his Master's Thesis at D2, MPI-INF on sparse-view 3D reconstruction with diffusion priors under the supervision of Prof. Bernt Schiele. Previously, he obtained a Bachelor's degree in Electrical Engineering and Computer Science from the Indian Institute of Technology, Kharagpur, in 2020. His previous research spanned topics in zero-shot learning, image retrieval, domain generalization and music information retrieval. More info is available here - https://mvp18.github.io.
Scene Understanding
3D Reconstruction
Physical Dynamics