Perzeptive Systeme Conference Paper 2024

Generating Human Interaction Motions in Scenes with Text Control

Tesmo

We present TeSMo, a method for text-controlled scene-aware motion generation based on denoising diffusion models. Previous text-to-motion methods focus on characters in isolation without considering scenes due to the limited availability of datasets that include motion, text descriptions, and interactive scenes. Our approach begins with pre-training a scene-agnostic text-to-motion diffusion model, emphasizing goal-reaching constraints on large-scale motion-capture datasets. We then enhance this model with a scene-aware component, fine-tuned using data augmented with detailed scene information, including ground plane and object shapes. To facilitate training, we embed annotated navigation and interaction motions within scenes. The proposed method produces realistic and diverse human-object interactions, such as navigation and sitting, in different scenes with various object shapes, orientations, initial body positions, and poses. Extensive experiments demonstrate that our approach surpasses prior techniques in terms of the plausibility of human-scene interactions, as well as the realism and variety of the generated motions.

Author(s): Yi, Hongwei and Thies, Justus and Black, Michael J. and Peng, Xue Bin and Rempe, Davis
Book Title: European Conference on Computer Vision (ECCV 2024)
Pages: 246--263
Year: 2024
Month: September
Series: LNCS
Publisher: Springer Cham
Project(s):
Bibtex Type: Conference Paper (inproceedings)
DOI: https://doi.org/10.1007/978-3-031-73235-5_14
Event Place: Milan, Italy
State: Published
URL: https://research.nvidia.com/labs/toronto-ai/tesmo/
Electronic Archiving: grant_archive
Links:

BibTex

@inproceedings{tesmo:2024,
  title = {Generating Human Interaction Motions in Scenes with Text Control},
  booktitle = {European Conference on Computer Vision (ECCV 2024)},
  abstract = {We present TeSMo, a method for text-controlled scene-aware motion generation based on denoising diffusion models. Previous text-to-motion methods focus on characters in isolation without considering scenes due to the limited availability of datasets that include motion, text descriptions, and interactive scenes. Our approach begins with pre-training a scene-agnostic text-to-motion diffusion model, emphasizing goal-reaching constraints on large-scale motion-capture datasets. We then enhance this model with a scene-aware component, fine-tuned using data augmented with detailed scene information, including ground plane and object shapes. To facilitate training, we embed annotated navigation and interaction motions within scenes. The proposed method produces realistic and diverse human-object interactions, such as navigation and sitting, in different scenes with various object shapes, orientations, initial body positions, and poses. Extensive experiments demonstrate that our approach surpasses prior techniques in terms of the plausibility of human-scene interactions, as well as the realism and variety of the generated motions.},
  pages = {246--263},
  series = {LNCS},
  publisher = {Springer Cham},
  month = sep,
  year = {2024},
  slug = {tesmo-2024},
  author = {Yi, Hongwei and Thies, Justus and Black, Michael J. and Peng, Xue Bin and Rempe, Davis},
  url = {https://research.nvidia.com/labs/toronto-ai/tesmo/},
  month_numeric = {9}
}