Research Overview

Institute Homepage

Institute Homepage Sign In

Back

Research Overview

Inferring and exploiting contact

Generative Proxemics: A Prior for 3D Social Interaction from Images

BITE -- Dog Shape and Pose from an Image

HOLD -- inferring 3D hand and object shape from video

MOVER -- Reconstructing 3D Scenes and People using Interaction

Datasets for understanding humans and animals

The Poses for Equine Research Dataset (PFERD)

BEAT2 Dataset for Holistic Co-Speech Gesture Generation

ARCTIC: A Dataset for Dexterous Bimanual Hand-Object Manipulation

The BioAMASS Dataset

OpenCapBench dataset

Human health and the 3D body

Body Shape Models in Treating Anorexia Nervosa

Customized Bone Plants for Humerus Shaft Fractures

Reconstructing Signing Avatars From Video Using Linguistic Priors

The AI animator

HAAR: Text-Conditioned Generative Model of 3D Strand-based Human Hairstyles

Gaussian Garments

PuzzleAvatar: Assembling 3D Avatars from Personal Albums

FLARE: Fast Learning of Animatable and Relightable Mesh Avatars

Language, Vision, and World Models

AWOL: Analysis WithOut synthesis using Language

Re-Thinking Inverse Graphics with Large Language Models

TeCH: Text-guided Reconstruction of Clothed Humans

Human pose, shape, and motion capture

WHAM: Reconstructing World-grounded Humans with Accurate 3D Motion

3D Human Pose Estimation via Intuitive Physics

Accurate 3D Body Shape Regression using Metric and Semantic Attributes

BEV

Generating human motion

Generating Human Interaction Motions in Scenes with Text Control

TEMOS: Generating Diverse Human Motions from Text

EMAGE: Full-body Gestures from Audio

TEACH: Temporal Action Compositions for 3D Humans

Robot Perception Group

AirCap: 3D Motion Capture

AirCap: Perception-Based Control

AirCapRL: Aerial Motion Capture Using Deep RL

Data Team

Lab Tours and Public Outreach

Collecting Data - From the Idea to the Publication

Capture Technologies Setup

Completed Projects

Human Pose, Shape and Action

3D Pose from Images

2D Pose from Images

Beyond Motion Capture

Action and Behavior

Body Perception

Body Applications

Pose and Motion Priors

Clothing Models (2011-2015)

Reflectance Filtering

Learning on Manifolds

Markerless Animal Motion Capture

Multi-Camera Capture

2D Pose from Optical Flow

Body Perception

Neural Prosthetics and Decoding

Part-based Body Models

Intrinsic Depth

Lie Bodies

Layers, Time and Segmentation

Understanding Action Recognition (JHMDB)

Intrinsic Video

Intrinsic Images

Action Recognition with Tracking

Neural Control of Grasping

Flowing Puppets

Faces

Deformable Structures

Model-based Anthropometry

Modeling 3D Human Breathing

Optical flow in the LGN

FlowCap

Smooth Loops from Unconstrained Video

PCA Flow

Efficient and Scalable Inference

Motion Blur in Layers

Facade Segmentation

Smooth Metric Learning

Robust PCA

3D Recognition

Object Detection

Perceiving Systems

Research Overview

Our goal is create 3D virtual humans that can see, move, and behave just like real people by capturing 3D human behavior at scale from video and using this to train 3D human foundation models.

Humans have evolved to interact with humans, not computers. Can we make our interactions with computers more human-like? To answer this question, we capture human behavior at scale and to use this data to train digital humans that see us, understand us, and behave like us. We call this the Human Foundation Agent. We believe that, in the near future, (1) computers will see us, (2) AI will be embodied in the form of digital humans, and (3) this will fundamentally change our relationship with machines. The research of the Perceiving Systems department is structured to achieve these goals.

The surprise of the last three years is the outsized impact that language has had on solving long-standing problems in vision. At the beginning of the reporting period it was already clear that nearly all our work would combine vision and language in one way or another. Today, large models are central to achieving our goals. For example, we exploit the physical world knowledge implicit in LLMs and video diffusion models for problems as diverse as inverse graphics programming [], image relighting, human-object interaction reasoning, and bounded video generation []. This is a trend that will only accelerate.

Behavior capture

In training AI systems, scale matters – in particular the scale (and quality) of data. Human behavior is typically captured in a motion capture (mocap) studio. In Perceiving Systems, we have built, and continue to expand, the world's largest mocap dataset (AMASS), which enabled the field of generative human motion to emerge. AMASS, however, is not enough. Studio data is neither realistic nor scalable. Consequently, our goal is to capture human behavior from video at a massive scale along with rich contextual information about the scene and people’s interactions in the scene. To that end, we are pushing the state of the art in 3D human capture from video. Key innovations include:

Humans in context:

Most methods that regress human shape and pose (HPS) from an image do so from a tightly cropped image region around the person. This means that the network cannot exploit scene context and this makes it hard to place people in the 3D scene. With BEV [] and TRACE, [] we introduced methods that exploit the full image and integrate the problem of detection with pose estimation and tracking, resulting in improved 3D reasoning.

Humans in world coordinates:

The key problem of prior methods is that they estimate humans in camera coordinates rather than world coordinates. To estimate humans in world coordinates we need information about the camera like its focal length. In scenes containing human motion however, traditional camera calibration methods can fail. We observe that humans themselves can serve as a form of “calibration object”. WHAM [] exploits human motion over time to estimate of the camera’s angular velocity and 3D human pose in a global coordinate system with minimal foot sliding. WHAM is the first video-based method to outperform all single-frame and video methods. With CameraHMR [] we train a method to regress the camera field of view from a single image of a person and integrate this into our training and inference methods, resulting in state-of-the-art accuracy for single image HPS.

Faces:

The face is critical for communication and our methods capture 3D emotional content (EMOCA []), metrically accurate faces (MICA []), facial details (SMIRK []), and perform precise 3D tracking from video (SPARK []).

Contact:

Human-object, human-scene, and human-human contact are foundational for understanding and modeling human behavior. We introduced datasets (DAMON [], RICH [], INTERCAP [], HOT [], ARCTIC []) that enable the study of contact in 3D and 2D as well as methods that reason about contact from images (DECO []) and exploit contact in inferring 3D humans and scenes (MOVER []).

Synthetic data:

To enable accurate behavior capture, we created BEDLAM [], the first synthetic training dataset that enables state-of-the-art HPS results without any real training data. Groups worldwide are using BEDLAM and have verified that it is the single most important dataset in the field for achieving accurate results. We are hard at work on a significantly expanded version.

Behavior generation

Given captured human behavior, our goal is to model it such that we can generate it. For example, TEMOS [] is a text-conditioned generative model that leverages a variational autoencoder and transformer embeddings of text and motion. TEMOS is a foundation for TMR [], which embeds text and motion in latent spaces enabling text-based queries of large mocap libraries without manual labelling. SAMP generates human movement conditioned on a scene while MIME [] does the opposite – it generates a full 3D scene from human movement. GOAL [] and GraspXL [] generate hand-object grasping, while EMOTE [] and AMUSE [] generate full body motion from audio.

Behavior understanding

Large multi-modal vision-language models (VLMs) understand a lot about humans. Can we leverage this for 3D behavior capture and can we train these models to understand 3D humans? ChatPose [] is the first method that fine-tunes a VLM to understand 3D human pose. When asked about human pose, the method is trained to output a special pose token and the embedding of this token is then decoded by a simple projection layer to produce continous SMPL pose and shape parameters. We think this is the future. ChatPose is able to reason beyond the image about what 3D human pose means and it can answer questions about what poses people might adopt in the future. It combines, for the first time, the broad general knowledge of large models with the 3D world of humans.

Broader Impact

While our focus is on capturing, generating and understanding 3D humans, our foundational work contributes to other disciplines. For example, we continue to push the state of the art in animal shape and motion capture (PFERD [], VAREN [], BARC [], BITE []). We collaborate with doctors, biomechanics researchers, and psychologists so that our work has an impact outside vision and graphics (predicting the inside of the body from outside (SKEL [], OSSO [], HIT []), treating eating disorders [][][][], or designing custom surgical plates [] to heal broken limbs). And, while we focus on behavior, the modeling of human appearance is also important. During the reporting period we have developed neural models of clothing (HOOD [], ControurCraft [], GaussianGarments []), hair (HAAR [], MonoHair [], GaussianHaircut []), and overall appearance (TeCH [], TECA [], ECON []).

To have a wide impact on society, we make software and data available for research purposes. Our Software and Data Teams help acquire the best possible data and to share code widely; they are critical to our success. We also actively patent and license our technology. For the first time during this reporting period, several papers (TokenHMR [], ChatPose [], ChatHuman []) were collaborations between MPI and Meshcapade (under the terms of a cooperation agreement); a joint patent was field for ChatHuman. Cooperations like this are increasingly important because they provide the scale necessary to be competitive today.