Perzeptive Systeme Ph.D. Thesis 2024

Learning Digital Humans from Vision and Language

Thumb xxl yaothesis

The study of realistic digital humans has gained significant attention within the research communities of computer vision, computer graphics, and ma- chine learning. This growing interest is motivated by the crucial under- standing of human selves and the essential role digital humans play in enabling the metaverse. Applications span various sectors including vir- tual presence, fitness, digital fashion, entertainment, humanoid robots and healthcare. However, learning about 3D humans presents significant challenges due to data scarcity. In an era where scalability is crucial for AI, this raises the question: can we enhance the scalability of learning digital humans? To understand this, consider how humans interact: we observe and com- municate, forming impressions of others through these interactions. This thesis proposes a similar potential for computers: could they be taught to understand humans by observing and listening? Such an approach would involve processing visual data, like images and videos, and linguistic data from text descriptions. Thus, this research endeavors to enable machines to learn about digital humans from vision and language, both of which are readily available and scalable sources of data. Our research begins by developing a framework to create detailed 3D faces from in-the-wild images. This framework, capable of generating highly realistic and animatable 3D faces from single images, is trained without paired 3D supervision and achieves state-of-the-art accuracy in shape re- construction. It effectively disentangles identity and expression details, thereby enhancing facial animation. We then explore capturing the body, clothing, face, and hair from monocu- lar videos, using a novel hybrid explicit-implicit 3D representation. This iii approach facilitates the disentangled learning of digital humans from monocular videos and allows for the easy transfer of hair and clothing to different bodies, as demonstrated through experiments in disentangled re- construction, virtual try-ons, and hairstyle transfers. Next, we present a method that utilizes text-visual foundation models to generate highly realistic 3D faces, complete with hair and accessories, based on text descriptions. These foundation models are trained exclusively on in-the-wild images and efficiently produce detailed and realistic outputs, facilitating the creation of authentic avatars. Finally, we introduce a framework that employs Large Language Models (LLMs) to interpret and generate 3D human poses from both images and text. This method, inspired by how humans intuitively understand pos- tures, merges image interpretation with body language analysis. By em- bedding SMPL poses into a multimodal LLM, our approach not only in- tegrates semantic reasoning but also enhances the generation and under- standing of 3D poses, utilizing the comprehensive capabilities of LLMs. Additionally, the use of LLMs facilitates interactive discussions with users about human poses, enriching human-computer interactions. Our research on digital humans significantly boosts scalability and con- trollability. By generating digital humans from images, videos, and text, we democratize their creation, making it broadly accessible through ev- eryday imagery and straightforward text, while enhancing generalization. Disentangled modeling and interactive chatting with human poses increase the controllability of digital humans and improve user interactions and cus- tomizations, showcasing their potential to extend into various disciplines.

Author(s): Yao Feng
Year: 2024
Month: October
Bibtex Type: Ph.D. Thesis (phdthesis)
Degree Type: PhD
DOI: 10.3929/ethz-b-000712913
Institution: ETH Zürich
State: Published
URL: https://www.research-collection.ethz.ch/handle/20.500.11850/712913
Links:

BibTex

@phdthesis{YaoFengThesis2024,
  title = {Learning Digital Humans from Vision and Language},
  abstract = {The study of realistic digital humans has gained significant attention within
  the research communities of computer vision, computer graphics, and ma-
  chine learning. This growing interest is motivated by the crucial under-
  standing of human selves and the essential role digital humans play in
  enabling the metaverse. Applications span various sectors including vir-
  tual presence, fitness, digital fashion, entertainment, humanoid robots and
  healthcare.
  However, learning about 3D humans presents significant challenges due
  to data scarcity. In an era where scalability is crucial for AI, this raises
  the question: can we enhance the scalability of learning digital humans?
  To understand this, consider how humans interact: we observe and com-
  municate, forming impressions of others through these interactions. This
  thesis proposes a similar potential for computers: could they be taught to
  understand humans by observing and listening? Such an approach would
  involve processing visual data, like images and videos, and linguistic data
  from text descriptions. Thus, this research endeavors to enable machines to
  learn about digital humans from vision and language, both of which are
  readily available and scalable sources of data.
  Our research begins by developing a framework to create detailed 3D faces
  from in-the-wild images. This framework, capable of generating highly
  realistic and animatable 3D faces from single images, is trained without
  paired 3D supervision and achieves state-of-the-art accuracy in shape re-
  construction. It effectively disentangles identity and expression details,
  thereby enhancing facial animation.
  We then explore capturing the body, clothing, face, and hair from monocu-
  lar videos, using a novel hybrid explicit-implicit 3D representation. This
  iii
  approach facilitates the disentangled learning of digital humans from
  monocular videos and allows for the easy transfer of hair and clothing to
  different bodies, as demonstrated through experiments in disentangled re-
  construction, virtual try-ons, and hairstyle transfers.
  Next, we present a method that utilizes text-visual foundation models to
  generate highly realistic 3D faces, complete with hair and accessories, based
  on text descriptions. These foundation models are trained exclusively on
  in-the-wild images and efficiently produce detailed and realistic outputs,
  facilitating the creation of authentic avatars.
  Finally, we introduce a framework that employs Large Language Models
  (LLMs) to interpret and generate 3D human poses from both images and
  text. This method, inspired by how humans intuitively understand pos-
  tures, merges image interpretation with body language analysis. By em-
  bedding SMPL poses into a multimodal LLM, our approach not only in-
  tegrates semantic reasoning but also enhances the generation and under-
  standing of 3D poses, utilizing the comprehensive capabilities of LLMs.
  Additionally, the use of LLMs facilitates interactive discussions with users
  about human poses, enriching human-computer interactions.
  Our research on digital humans significantly boosts scalability and con-
  trollability. By generating digital humans from images, videos, and text,
  we democratize their creation, making it broadly accessible through ev-
  eryday imagery and straightforward text, while enhancing generalization.
  Disentangled modeling and interactive chatting with human poses increase
  the controllability of digital humans and improve user interactions and cus-
  tomizations, showcasing their potential to extend into various disciplines.},
  degree_type = {PhD},
  institution = {ETH Zürich},
  month = oct,
  year = {2024},
  slug = {yaofengthesis2024},
  author = {Feng, Yao},
  url = {https://www.research-collection.ethz.ch/handle/20.500.11850/712913},
  month_numeric = {10}
}