Perceiving Systems Conference Paper 2023

SINC: Spatial Composition of 3D Human Motions for Simultaneous Action Generation

Mpi website

Our goal is to synthesize 3D human motions given textual inputs describing multiple simultaneous actions, for example ‘waving hand’ while ‘walking’ at the same time. We refer to generating such simultaneous movements as performing ‘spatial compositions’. In contrast to ‘temporal compositions’ that seek to transition from one action to another in a sequence, spatial compositing requires understanding which body parts are involved with which action. Motivated by the observation that the correspondence between actions and body parts is encoded in powerful language models, we extract this knowledge by prompting GPT-3 with text such as “what parts of the body are moving when someone is doing the action <action name>?”. Given this action-part mapping, we automatically create new training data by artificially combining body parts from multiple text-motion pairs together. We extend previous work on text-to-motions synthesis to train on spatial compositions, and introduce SINC (“SImultaneous actioN Compositions for 3D human motions”). We experimentally validate that our additional GPT-guided data helps to better learn compositionality compared to training only on existing real data of simultaneous actions, which is limited in quantity.

Author(s): Athanasiou, Nikos and Petrovich, Mathis and Black, Michael J. and Varol, Gül
Book Title: Proc. International Conference on Computer Vision (ICCV)
Pages: 9984--9995
Year: 2023
Month: October
Project(s):
Bibtex Type: Conference Paper (inproceedings)
Event Name: International Conference on Computer Vision 2023
Event Place: Paris, France
State: Published
Electronic Archiving: grant_archive
Links:

BibTex

@inproceedings{SINC:ICCV:2023,
  title = {{SINC}: Spatial Composition of {3D} Human Motions for Simultaneous Action Generation},
  booktitle = {Proc. International Conference on Computer Vision (ICCV)},
  abstract = {Our goal is to synthesize 3D human motions given textual inputs describing multiple simultaneous actions, for example ‘waving hand’ while ‘walking’ at the same time. We refer to generating such simultaneous movements as
  performing ‘spatial compositions’. In contrast to ‘temporal compositions’ that seek to transition from one action to another in a sequence, spatial compositing requires understanding which body parts are involved with which action. Motivated by the observation that the correspondence between actions and body parts is encoded in powerful language models, we extract this knowledge by prompting GPT-3 with text such as “what parts of the body are moving when someone is doing the action <action name>?”. Given this action-part mapping, we automatically create new training data by artificially combining body parts from multiple text-motion pairs together. We extend previous work on text-to-motions synthesis to train on spatial compositions, and introduce SINC (“SImultaneous actioN Compositions for 3D human motions”). We experimentally validate that our additional GPT-guided data helps to better learn compositionality compared to training only on existing real data of simultaneous actions, which is limited in
  quantity.},
  pages = {9984--9995},
  month = oct,
  year = {2023},
  slug = {sinc-iccv-2023-96f4db44-d7e5-4adc-a93f-4fb8987ec6da},
  author = {Athanasiou, Nikos and Petrovich, Mathis and Black, Michael J. and Varol, G\"{u}l},
  month_numeric = {10}
}