Humans from Video

Institute Homepage

Institute Homepage Sign In

Back

Research Overview

Inferring and exploiting contact

Generative Proxemics: A Prior for 3D Social Interaction from Images

BITE -- Dog Shape and Pose from an Image

HOLD -- inferring 3D hand and object shape from video

MOVER -- Reconstructing 3D Scenes and People using Interaction

Datasets for understanding humans and animals

The Poses for Equine Research Dataset (PFERD)

BEAT2 Dataset for Holistic Co-Speech Gesture Generation

ARCTIC: A Dataset for Dexterous Bimanual Hand-Object Manipulation

The BioAMASS Dataset

OpenCapBench dataset

Human health and the 3D body

Body Shape Models in Treating Anorexia Nervosa

Customized Bone Plants for Humerus Shaft Fractures

Reconstructing Signing Avatars From Video Using Linguistic Priors

The AI animator

HAAR: Text-Conditioned Generative Model of 3D Strand-based Human Hairstyles

Gaussian Garments

PuzzleAvatar: Assembling 3D Avatars from Personal Albums

FLARE: Fast Learning of Animatable and Relightable Mesh Avatars

Language, Vision, and World Models

AWOL: Analysis WithOut synthesis using Language

Re-Thinking Inverse Graphics with Large Language Models

TeCH: Text-guided Reconstruction of Clothed Humans

Human pose, shape, and motion capture

WHAM: Reconstructing World-grounded Humans with Accurate 3D Motion

3D Human Pose Estimation via Intuitive Physics

Accurate 3D Body Shape Regression using Metric and Semantic Attributes

BEV

Generating human motion

Generating Human Interaction Motions in Scenes with Text Control

TEMOS: Generating Diverse Human Motions from Text

EMAGE: Full-body Gestures from Audio

TEACH: Temporal Action Compositions for 3D Humans

Robot Perception Group

AirCap: 3D Motion Capture

AirCap: Perception-Based Control

AirCapRL: Aerial Motion Capture Using Deep RL

Data Team

Lab Tours and Public Outreach

Collecting Data - From the Idea to the Publication

Capture Technologies Setup

Completed Projects

Human Pose, Shape and Action

3D Pose from Images

2D Pose from Images

Beyond Motion Capture

Action and Behavior

Body Perception

Body Applications

Pose and Motion Priors

Clothing Models (2011-2015)

Reflectance Filtering

Learning on Manifolds

Markerless Animal Motion Capture

Multi-Camera Capture

2D Pose from Optical Flow

Body Perception

Neural Prosthetics and Decoding

Part-based Body Models

Intrinsic Depth

Lie Bodies

Layers, Time and Segmentation

Understanding Action Recognition (JHMDB)

Intrinsic Video

Intrinsic Images

Action Recognition with Tracking

Neural Control of Grasping

Flowing Puppets

Faces

Deformable Structures

Model-based Anthropometry

Modeling 3D Human Breathing

Optical flow in the LGN

FlowCap

Smooth Loops from Unconstrained Video

PCA Flow

Efficient and Scalable Inference

Motion Blur in Layers

Facade Segmentation

Smooth Metric Learning

Robust PCA

3D Recognition

Object Detection

Perceiving Systems Members Publications

Humans from Video

Videomethods — Top: VIBE regresses 3D human pose and shape from video using adversarial training by leveraging a large-scale human motion dataset (AMASS) to train a motion discriminator. Bottom left: output of VIBE. Bottom middle: SMIL estimates infant shape and motion from RGB-D videos to detect cerebral palsy. Bottom right: The 3DPW dataset combines IMU data with video to obtain high-quality pseudo ground truth 3D humans in video.

Humans are in constant motion. Interactions with the world and with each other involve movement. To capture, model, and synthesize human behavior we need to analyze it in video. Despite this, most methods for human 3D human pose and shape (HPS) estimation focus on single images. Intuitively, we should be able to exploit the regularity of human motion and the extra information provided by multiple video frames to improve HPS estimation compared to single-image methods. To that end, we are pursuing several lines or research to enable accurate markerless motion capture from unconstrained video "in the wild".

A key enabler of video-based analysis of motion is training data. To that end, we have exploited our 3D body models (SMPL, etc.) and MoSh, to create the large-scale AMASS dataset [] of human motions in a common 3D representation. We used an early version of this to generate the SURREAL dataset [], which contains rendered videos of people in motion. We used SURREAL, for example, to train methods to estimate the optical flow of people in video []. We also used AMASS to train a network to estimate 3D human pose from a sparse set of IMUs [].

Synthetic datasets like SURREAL are not fully representative of real-world video. Consequently, we created the 3D Poses in the Wild dataset (3DPW) by combining IMU data with monocular video. IMUs are prone to drift but give 3D pose information. Videos give precise 2D alignment with image pixels but lack 3D. By combining these sources of information, 3DPW provides class-leading pseudo ground truth and is, consequently, widely used for training and evaluation.

To estimate 3D humans from video, we have pursued both optimization and regression approaches. Multi-View-SMPLify [] optimizes 3D pose over time using a generic DCT temporal prior. In contrast, VIBE [] uses a GRU-based temporal architecture to regress SMPL from video. VIBE exploits discriminative training using AMASS [] to help the network generate motions that resemble true human movement.

With SMIL [], we capture the motion of infants in RGB-D sequences but go further to use the sequences to learn the 3D shape model. By analyzing the movements of the infants, we provide an assessment related to cerebral palsy [].

Members

Guest Scientist

Perceiving Systems

Nikos Athanasiou

Guest Scientist

Perceiving Systems

Yinghao Huang

Guest Scientist

Affiliated Researcher

Research Group Leader

Perceiving Systems

Javier Romero

Affiliated Researcher

Affiliated Researcher

Publications

Perceiving Systems Conference Paper VIBE: Video Inference for Human Body Pose and Shape Estimation Kocabas, M., Athanasiou, N., Black, M. J. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2020), :5252-5262, IEEE, Piscataway, NJ, IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2020), June 2020 (Published) arXiv code video supplemental video pdf DOI BibTeX

Perceiving Systems Conference Paper Towards Accurate Marker-less Human Shape and Pose Estimation over Time Huang, Y., Bogo, F., Lassner, C., Kanazawa, A., Gehler, P. V., Romero, J., Akhter, I., Black, M. J. In International Conference on 3D Vision (3DV), :421-430, 2017 () Code pdf DOI BibTeX