Human Pose, Shape and Action
3D Pose from Images
2D Pose from Images
Beyond Motion Capture
Action and Behavior
Body Perception
Body Applications
Pose and Motion Priors
Clothing Models (2011-2015)
Reflectance Filtering
Learning on Manifolds
Markerless Animal Motion Capture
Multi-Camera Capture
2D Pose from Optical Flow
Body Perception
Neural Prosthetics and Decoding
Part-based Body Models
Intrinsic Depth
Lie Bodies
Layers, Time and Segmentation
Understanding Action Recognition (JHMDB)
Intrinsic Video
Intrinsic Images
Action Recognition with Tracking
Neural Control of Grasping
Flowing Puppets
Faces
Deformable Structures
Model-based Anthropometry
Modeling 3D Human Breathing
Optical flow in the LGN
FlowCap
Smooth Loops from Unconstrained Video
PCA Flow
Efficient and Scalable Inference
Motion Blur in Layers
Facade Segmentation
Smooth Metric Learning
Robust PCA
3D Recognition
Object Detection
Research Overview

Our goal is create 3D virtual humans that can see, move, and behave just like real people by capturing 3D human behavior at scale from video and using this to train 3D human foundation models.
Humans have evolved to interact with humans, not computers. Can we make our interactions with computers more human-like? To answer this question, we capture human behavior at scale and to use this data to train digital humans that see us, understand us, and behave like us. We call this the Human Foundation Agent. We believe that, in the near future, (1) computers will see us, (2) AI will be embodied in the form of digital humans, and (3) this will fundamentally change our relationship with machines. The research of the Perceiving Systems department is structured to achieve these goals.
The surprise of the last three years is the outsized impact that language has had on solving long-standing problems in vision. At the beginning of the reporting period it was already clear that nearly all our work would combine vision and language in one way or another. Today, large models are central to achieving our goals. For example, we exploit the physical world knowledge implicit in LLMs and video diffusion models for problems as diverse as inverse graphics programming [], image relighting, human-object interaction reasoning, and bounded video generation [
]. This is a trend that will only accelerate.
Behavior capture
In training AI systems, scale matters – in particular the scale (and quality) of data. Human behavior is typically captured in a motion capture (mocap) studio. In Perceiving Systems, we have built, and continue to expand, the world's largest mocap dataset (AMASS), which enabled the field of generative human motion to emerge. AMASS, however, is not enough. Studio data is neither realistic nor scalable. Consequently, our goal is to capture human behavior from video at a massive scale along with rich contextual information about the scene and people’s interactions in the scene. To that end, we are pushing the state of the art in 3D human capture from video. Key innovations include:
Humans in context:
Most methods that regress human shape and pose (HPS) from an image do so from a tightly cropped image region around the person. This means that the network cannot exploit scene context and this makes it hard to place people in the 3D scene. With BEV [] and TRACE, [
] we introduced methods that exploit the full image and integrate the problem of detection with pose estimation and tracking, resulting in improved 3D reasoning.
Humans in world coordinates:
The key problem of prior methods is that they estimate humans in camera coordinates rather than world coordinates. To estimate humans in world coordinates we need information about the camera like its focal length. In scenes containing human motion however, traditional camera calibration methods can fail. We observe that humans themselves can serve as a form of “calibration object”. WHAM [] exploits human motion over time to estimate of the camera’s angular velocity and 3D human pose in a global coordinate system with minimal foot sliding. WHAM is the first video-based method to outperform all single-frame and video methods. With CameraHMR [
] we train a method to regress the camera field of view from a single image of a person and integrate this into our training and inference methods, resulting in state-of-the-art accuracy for single image HPS.
Faces:
The face is critical for communication and our methods capture 3D emotional content (EMOCA []), metrically accurate faces (MICA [
]), facial details (SMIRK [
]), and perform precise 3D tracking from video (SPARK [
]).
Contact:
Human-object, human-scene, and human-human contact are foundational for understanding and modeling human behavior. We introduced datasets (DAMON [], RICH [
], INTERCAP [
], HOT [
], ARCTIC [
]) that enable the study of contact in 3D and 2D as well as methods that reason about contact from images (DECO [
]) and exploit contact in inferring 3D humans and scenes (MOVER [
]).
Synthetic data:
To enable accurate behavior capture, we created BEDLAM [], the first synthetic training dataset that enables state-of-the-art HPS results without any real training data. Groups worldwide are using BEDLAM and have verified that it is the single most important dataset in the field for achieving accurate results. We are hard at work on a significantly expanded version.
Behavior generation
Given captured human behavior, our goal is to model it such that we can generate it. For example, TEMOS [] is a text-conditioned generative model that leverages a variational autoencoder and transformer embeddings of text and motion. TEMOS is a foundation for TMR [
], which embeds text and motion in latent spaces enabling text-based queries of large mocap libraries without manual labelling. SAMP generates human movement conditioned on a scene while MIME [
] does the opposite – it generates a full 3D scene from human movement. GOAL [
] and GraspXL [
] generate hand-object grasping, while EMOTE [
] and AMUSE [
] generate full body motion from audio.
Behavior understanding
Large multi-modal vision-language models (VLMs) understand a lot about humans. Can we leverage this for 3D behavior capture and can we train these models to understand 3D humans? ChatPose [] is the first method that fine-tunes a VLM to understand 3D human pose. When asked about human pose, the method is trained to output a special pose token and the embedding of this token is then decoded by a simple projection layer to produce continous SMPL pose and shape parameters. We think this is the future. ChatPose is able to reason beyond the image about what 3D human pose means and it can answer questions about what poses people might adopt in the future. It combines, for the first time, the broad general knowledge of large models with the 3D world of humans.
Broader Impact
While our focus is on capturing, generating and understanding 3D humans, our foundational work contributes to other disciplines. For example, we continue to push the state of the art in animal shape and motion capture (PFERD [], VAREN [
], BARC [
], BITE [
]). We collaborate with doctors, biomechanics researchers, and psychologists so that our work has an impact outside vision and graphics (predicting the inside of the body from outside (SKEL [
], OSSO [
], HIT [
]), treating eating disorders [
][
][
][
], or designing custom surgical plates [
] to heal broken limbs). And, while we focus on behavior, the modeling of human appearance is also important. During the reporting period we have developed neural models of clothing (HOOD [
], ControurCraft [
], GaussianGarments [
]), hair (HAAR [
], MonoHair [
], GaussianHaircut [
]), and overall appearance (TeCH [
], TECA [
], ECON [
]).
To have a wide impact on society, we make software and data available for research purposes. Our Software and Data Teams help acquire the best possible data and to share code widely; they are critical to our success. We also actively patent and license our technology. For the first time during this reporting period, several papers (TokenHMR [], ChatPose [
], ChatHuman [
]) were collaborations between MPI and Meshcapade (under the terms of a cooperation agreement); a joint patent was field for ChatHuman. Cooperations like this are increasingly important because they provide the scale necessary to be competitive today.