Back

Perceiving Systems

Research Overview

Thumb xl 20210929 mpiis tuebingen 366

Behavior generation

Given captured human behavior, our goal is to model it such that we can generate it. For example, TEMOS [File Icon]  is a text-conditioned generative model that leverages a variational autoencoder and transformer embeddings of text and motion. TEMOS is a foundation for TMR [File Icon], which embeds text and motion in latent spaces enabling text-based queries of large mocap libraries without manual labelling. SAMP  generates human movement conditioned on a scene while MIME [File Icon] does the opposite – it generates a full 3D scene from human movement. GOAL [File Icon] and GraspXL [File Icon] generate hand-object grasping, while EMOTE [File Icon] and AMUSE [File Icon] generate full body motion from audio.

Behavior understanding

Large multi-modal vision-language models (VLMs) understand a lot about humans. Can we leverage this for 3D behavior capture and can we train these models to understand 3D humans? ChatPose [File Icon] is the first method that fine-tunes a VLM to understand 3D human pose. When asked about human pose, the method is trained to output a special pose token and the embedding of this token is then decoded by a simple projection layer to produce continous SMPL pose and shape parameters. We think this is the future. ChatPose is able to reason beyond the image about what 3D human pose means and it can answer questions about what poses people might adopt in the future. It combines, for the first time, the broad general knowledge of large models with the 3D world of humans.

Broader Impact

While our focus is on capturing, generating and understanding 3D humans, our foundational work contributes to other disciplines. For example, we continue to push the state of the art in animal shape and motion capture (PFERD [File Icon], VAREN [File Icon], BARC [File Icon], BITE [File Icon]). We collaborate with doctors, biomechanics researchers, and psychologists so that our work has an impact outside vision and graphics (predicting the inside of the body from outside (SKEL [File Icon], OSSO [File Icon], HIT [File Icon]), treating eating disorders [File Icon][File Icon][File Icon][File Icon], or designing custom surgical plates [File Icon] to heal broken limbs). And, while we focus on behavior, the modeling of human appearance is also important. During the reporting period we have developed neural models of clothing (HOOD [File Icon], ControurCraft [File Icon], GaussianGarments [File Icon]), hair (HAAR [File Icon], MonoHair [File Icon], GaussianHaircut [File Icon]), and overall appearance (TeCH [File Icon], TECA [File Icon], ECON [File Icon]).

To have a wide impact on society, we make software and data available for research purposes. Our Software and Data Teams help acquire the best possible data and to share code widely; they are critical to our success. We also actively patent and license our technology. For the first time during this reporting period, several papers (TokenHMR [File Icon], ChatPose [File Icon], ChatHuman [File Icon]) were collaborations between MPI and Meshcapade (under the terms of a cooperation agreement); a joint patent was field for ChatHuman. Cooperations like this are increasingly important because they provide the scale necessary to be competitive today.