Learning Digital Humans from Vision and Language

Institute Homepage

Institute Homepage EN Sign In

Back

Perzeptive Systeme Ph.D. Thesis 2024

Perzeptive Systeme

Yao Feng

Guest Scientist

The study of realistic digital humans has gained significant attention within the research communities of computer vision, computer graphics, and ma- chine learning. This growing interest is motivated by the crucial under- standing of human selves and the essential role digital humans play in enabling the metaverse. Applications span various sectors including vir- tual presence, fitness, digital fashion, entertainment, humanoid robots and healthcare. However, learning about 3D humans presents significant challenges due to data scarcity. In an era where scalability is crucial for AI, this raises the question: can we enhance the scalability of learning digital humans? To understand this, consider how humans interact: we observe and com- municate, forming impressions of others through these interactions. This thesis proposes a similar potential for computers: could they be taught to understand humans by observing and listening? Such an approach would involve processing visual data, like images and videos, and linguistic data from text descriptions. Thus, this research endeavors to enable machines to learn about digital humans from vision and language, both of which are readily available and scalable sources of data. Our research begins by developing a framework to create detailed 3D faces from in-the-wild images. This framework, capable of generating highly realistic and animatable 3D faces from single images, is trained without paired 3D supervision and achieves state-of-the-art accuracy in shape re- construction. It effectively disentangles identity and expression details, thereby enhancing facial animation. We then explore capturing the body, clothing, face, and hair from monocu- lar videos, using a novel hybrid explicit-implicit 3D representation. This iii approach facilitates the disentangled learning of digital humans from monocular videos and allows for the easy transfer of hair and clothing to different bodies, as demonstrated through experiments in disentangled re- construction, virtual try-ons, and hairstyle transfers. Next, we present a method that utilizes text-visual foundation models to generate highly realistic 3D faces, complete with hair and accessories, based on text descriptions. These foundation models are trained exclusively on in-the-wild images and efficiently produce detailed and realistic outputs, facilitating the creation of authentic avatars. Finally, we introduce a framework that employs Large Language Models (LLMs) to interpret and generate 3D human poses from both images and text. This method, inspired by how humans intuitively understand pos- tures, merges image interpretation with body language analysis. By em- bedding SMPL poses into a multimodal LLM, our approach not only in- tegrates semantic reasoning but also enhances the generation and under- standing of 3D poses, utilizing the comprehensive capabilities of LLMs. Additionally, the use of LLMs facilitates interactive discussions with users about human poses, enriching human-computer interactions. Our research on digital humans significantly boosts scalability and con- trollability. By generating digital humans from images, videos, and text, we democratize their creation, making it broadly accessible through ev- eryday imagery and straightforward text, while enhancing generalization. Disentangled modeling and interactive chatting with human poses increase the controllability of digital humans and improve user interactions and cus- tomizations, showcasing their potential to extend into various disciplines.

Author(s):	Yao Feng
Year:	2024
Month:	October

Bibtex Type:	Ph.D. Thesis (phdthesis)

Degree Type:	PhD
DOI:	10.3929/ethz-b-000712913
Institution:	ETH Zürich
State:	Published
URL:	https://www.research-collection.ethz.ch/handle/20.500.11850/712913

Links:	pdf

BibTex

@phdthesis{YaoFengThesis2024,
title = {Learning Digital Humans from Vision and Language},
abstract = {The study of realistic digital humans has gained significant attention within
the research communities of computer vision, computer graphics, and ma-
chine learning. This growing interest is motivated by the crucial under-
standing of human selves and the essential role digital humans play in
enabling the metaverse. Applications span various sectors including vir-
tual presence, fitness, digital fashion, entertainment, humanoid robots and
healthcare.
However, learning about 3D humans presents significant challenges due
to data scarcity. In an era where scalability is crucial for AI, this raises
the question: can we enhance the scalability of learning digital humans?
To understand this, consider how humans interact: we observe and com-
municate, forming impressions of others through these interactions. This
thesis proposes a similar potential for computers: could they be taught to
understand humans by observing and listening? Such an approach would
involve processing visual data, like images and videos, and linguistic data
from text descriptions. Thus, this research endeavors to enable machines to
learn about digital humans from vision and language, both of which are
readily available and scalable sources of data.
Our research begins by developing a framework to create detailed 3D faces
from in-the-wild images. This framework, capable of generating highly
realistic and animatable 3D faces from single images, is trained without
paired 3D supervision and achieves state-of-the-art accuracy in shape re-
construction. It effectively disentangles identity and expression details,
thereby enhancing facial animation.
We then explore capturing the body, clothing, face, and hair from monocu-
lar videos, using a novel hybrid explicit-implicit 3D representation. This
iii
approach facilitates the disentangled learning of digital humans from
monocular videos and allows for the easy transfer of hair and clothing to
different bodies, as demonstrated through experiments in disentangled re-
construction, virtual try-ons, and hairstyle transfers.
Next, we present a method that utilizes text-visual foundation models to
generate highly realistic 3D faces, complete with hair and accessories, based
on text descriptions. These foundation models are trained exclusively on
in-the-wild images and efficiently produce detailed and realistic outputs,
facilitating the creation of authentic avatars.
Finally, we introduce a framework that employs Large Language Models
(LLMs) to interpret and generate 3D human poses from both images and
text. This method, inspired by how humans intuitively understand pos-
tures, merges image interpretation with body language analysis. By em-
bedding SMPL poses into a multimodal LLM, our approach not only in-
tegrates semantic reasoning but also enhances the generation and under-
standing of 3D poses, utilizing the comprehensive capabilities of LLMs.
Additionally, the use of LLMs facilitates interactive discussions with users
about human poses, enriching human-computer interactions.
Our research on digital humans significantly boosts scalability and con-
trollability. By generating digital humans from images, videos, and text,
we democratize their creation, making it broadly accessible through ev-
eryday imagery and straightforward text, while enhancing generalization.
Disentangled modeling and interactive chatting with human poses increase
the controllability of digital humans and improve user interactions and cus-
tomizations, showcasing their potential to extend into various disciplines.},
degree_type = {PhD},
institution = {ETH Zürich},
month = oct,
year = {2024},
slug = {yaofengthesis2024},
author = {Feng, Yao},
url = {https://www.research-collection.ethz.ch/handle/20.500.11850/712913},
month_numeric = {10}
}

Forschung

Abteilungen

Forschungsgruppen

Personen

Kontakt

Our Institute

Unsere Geschichte

Karriere

Überblick über Promotionsprogramme

Karriere

Service-Einrichtungen

Zentrale Wissenschaftliche Einrichtungen

Werkstätten

Campus Services

Impact

Kooperationen

Initiativen und Partner

Forschung

Abteilungen

Forschungsgruppen

Personen

Kontakt

Our Institute

Unsere Geschichte

Karriere

Überblick über Promotionsprogramme

Karriere

Service-Einrichtungen

Zentrale Wissenschaftliche Einrichtungen

Werkstätten

Campus Services

Impact

Kooperationen

Initiativen und Partner

BibTex