Capturing and Inferring Dense Full-Body Human-Scene Contact

Institute Homepage

Institute Homepage EN Sign In

Back

Perzeptive Systeme Optics and Sensing Laboratory Conference Paper 2022

Perzeptive Systeme

Chun-Hao Paul Huang

Guest Scientist

Perzeptive Systeme

Hongwei Yi

Guest Scientist

Perzeptive Systeme

Markus Höschle

Mechatronics Technician

Optics and Sensing Laboratory

Optics and Sensing Laboratory

Senya Polikovsky

Optics & Sensing Laboratory

Perzeptive Systeme

Daniel Scharstein

Affiliated Researcher

Perzeptive Systeme

Michael Black

Director

Inferring human-scene contact (HSC) is the first step toward understanding how humans interact with their surroundings. While detecting 2D human-object interaction (HOI) and reconstructing 3D human pose and shape (HPS) have enjoyed significant progress, reasoning about 3D human-scene contact from a single image is still challenging. Existing HSC detection methods consider only a few types of predefined contact, often reduce body and scene to a small number of primitives, and even overlook image evidence. To predict human-scene contact from a single image, we address the limitations above from both data and algorithmic perspectives. We capture a new dataset called RICH for “Real scenes, Interaction, Contact and Humans.” RICH contains multiview outdoor/indoor video sequences at 4K resolution, ground-truth 3D human bodies captured using markerless motion capture, 3D body scans, and high resolution 3D scene scans. A key feature of RICH is that it also contains accurate vertex-level contact labels on the body. Using RICH, we train a network that predicts dense body-scene contacts from a single RGB image. Our key insight is that regions in contact are always occluded so the network needs the ability to explore the whole image for evidence. We use a transformer to learn such non-local relationships and propose a new Body-Scene contact TRansfOrmer (BSTRO). Very few methods explore 3D contact; those that do focus on the feet only, detect foot contact as a post-processing step, or infer contact from body pose without looking at the scene. To our knowledge, BSTRO is the first method to directly estimate 3D body-scene contact from a single image. We demonstrate that BSTRO significantly outperforms the prior art. The code and dataset are available at https://rich.is.tue.mpg.de.

Author(s):	Huang, Chun-Hao P and Yi, Hongwei and Höschle, Markus and Safroshkin, Matvey and Alexiadis, Tsvetelina and Polikovsky, Senya and Scharstein, Daniel and Black, Michael J.
Book Title:	2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022)
Pages:	13264--13275
Year:	2022
Month:	June
Publisher:	IEEE

Project(s):	Capturing Contact
Bibtex Type:	Conference Paper (inproceedings)

Address:	Piscataway, NJ
DOI:	10.1109/CVPR52688.2022.01292
Event Name:	IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022)
Event Place:	New Orleans, Louisiana
State:	Published
URL:	https://rich.is.tue.mpg.de/

Electronic Archiving:	grant_archive
ISBN:	978-1-6654-6947-0

Links:	project arXiv BSTRO code video

BibTex

@inproceedings{huang2022rich,
  title = {Capturing and Inferring Dense Full-Body Human-Scene Contact},
  booktitle = {2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022)},
  abstract = {Inferring human-scene contact (HSC) is the first step toward understanding how humans interact with their surroundings. While detecting 2D human-object interaction (HOI) and reconstructing 3D human pose and shape (HPS) have enjoyed significant progress, reasoning about 3D human-scene contact from a single image is still challenging. Existing HSC detection methods consider only a few types of predefined contact, often reduce body and scene to a small number of primitives, and even overlook image evidence. To predict human-scene contact from a single image, we address the limitations above from both data and algorithmic perspectives. We capture a new dataset called RICH for “Real scenes, Interaction, Contact and Humans.” RICH contains multiview outdoor/indoor video sequences at 4K resolution, ground-truth 3D human bodies captured using markerless motion capture, 3D body scans, and high resolution 3D scene scans. A key feature of RICH is that it also contains accurate vertex-level contact labels on the body. Using RICH, we train a network that predicts dense body-scene contacts from a single RGB image. Our key insight is that regions in contact are always occluded so the network needs the ability to explore the whole image for evidence. We use a transformer to learn such non-local relationships and propose a new Body-Scene contact TRansfOrmer (BSTRO). Very few methods explore 3D contact; those that do focus on the feet only, detect foot contact as a post-processing step, or infer contact from body pose without looking at the scene. To our knowledge, BSTRO is the first method to directly estimate 3D body-scene contact from a single image. We demonstrate that BSTRO significantly outperforms the prior art. The code and dataset are available at https://rich.is.tue.mpg.de.},
  pages = {13264--13275},
  publisher = {IEEE},
  address = {Piscataway, NJ},
  month = jun,
  year = {2022},
  slug = {huang2022rich},
  author = {Huang, Chun-Hao P and Yi, Hongwei and H{\"o}schle, Markus and Safroshkin, Matvey and Alexiadis, Tsvetelina and Polikovsky, Senya and Scharstein, Daniel and Black, Michael J.},
  url = {https://rich.is.tue.mpg.de/},
  month_numeric = {6}
}

Forschung

Abteilungen

Forschungsgruppen

Personen

Kontakt

Our Institute

Unsere Geschichte

Karriere

Überblick über Promotionsprogramme

Karriere

Service-Einrichtungen

Zentrale Wissenschaftliche Einrichtungen

Werkstätten

Campus Services

Impact

Kooperationen

Initiativen und Partner

Forschung

Abteilungen

Forschungsgruppen

Personen

Kontakt

Our Institute

Unsere Geschichte

Karriere

Überblick über Promotionsprogramme

Karriere

Service-Einrichtungen

Zentrale Wissenschaftliche Einrichtungen

Werkstätten

Campus Services

Impact

Kooperationen

Initiativen und Partner

BibTex