Back
Learning Digital Humans from Vision and Language
The study of realistic digital humans has gained significant attention within the research communities of computer vision, computer graphics, and ma- chine learning. This growing interest is motivated by the crucial under- standing of human selves and the essential role digital humans play in enabling the metaverse. Applications span various sectors including vir- tual presence, fitness, digital fashion, entertainment, humanoid robots and healthcare. However, learning about 3D humans presents significant challenges due to data scarcity. In an era where scalability is crucial for AI, this raises the question: can we enhance the scalability of learning digital humans? To understand this, consider how humans interact: we observe and com- municate, forming impressions of others through these interactions. This thesis proposes a similar potential for computers: could they be taught to understand humans by observing and listening? Such an approach would involve processing visual data, like images and videos, and linguistic data from text descriptions. Thus, this research endeavors to enable machines to learn about digital humans from vision and language, both of which are readily available and scalable sources of data. Our research begins by developing a framework to create detailed 3D faces from in-the-wild images. This framework, capable of generating highly realistic and animatable 3D faces from single images, is trained without paired 3D supervision and achieves state-of-the-art accuracy in shape re- construction. It effectively disentangles identity and expression details, thereby enhancing facial animation. We then explore capturing the body, clothing, face, and hair from monocu- lar videos, using a novel hybrid explicit-implicit 3D representation. This iii approach facilitates the disentangled learning of digital humans from monocular videos and allows for the easy transfer of hair and clothing to different bodies, as demonstrated through experiments in disentangled re- construction, virtual try-ons, and hairstyle transfers. Next, we present a method that utilizes text-visual foundation models to generate highly realistic 3D faces, complete with hair and accessories, based on text descriptions. These foundation models are trained exclusively on in-the-wild images and efficiently produce detailed and realistic outputs, facilitating the creation of authentic avatars. Finally, we introduce a framework that employs Large Language Models (LLMs) to interpret and generate 3D human poses from both images and text. This method, inspired by how humans intuitively understand pos- tures, merges image interpretation with body language analysis. By em- bedding SMPL poses into a multimodal LLM, our approach not only in- tegrates semantic reasoning but also enhances the generation and under- standing of 3D poses, utilizing the comprehensive capabilities of LLMs. Additionally, the use of LLMs facilitates interactive discussions with users about human poses, enriching human-computer interactions. Our research on digital humans significantly boosts scalability and con- trollability. By generating digital humans from images, videos, and text, we democratize their creation, making it broadly accessible through ev- eryday imagery and straightforward text, while enhancing generalization. Disentangled modeling and interactive chatting with human poses increase the controllability of digital humans and improve user interactions and cus- tomizations, showcasing their potential to extend into various disciplines.
@phdthesis{YaoFengThesis2024, title = {Learning Digital Humans from Vision and Language}, abstract = {The study of realistic digital humans has gained significant attention within the research communities of computer vision, computer graphics, and ma- chine learning. This growing interest is motivated by the crucial under- standing of human selves and the essential role digital humans play in enabling the metaverse. Applications span various sectors including vir- tual presence, fitness, digital fashion, entertainment, humanoid robots and healthcare. However, learning about 3D humans presents significant challenges due to data scarcity. In an era where scalability is crucial for AI, this raises the question: can we enhance the scalability of learning digital humans? To understand this, consider how humans interact: we observe and com- municate, forming impressions of others through these interactions. This thesis proposes a similar potential for computers: could they be taught to understand humans by observing and listening? Such an approach would involve processing visual data, like images and videos, and linguistic data from text descriptions. Thus, this research endeavors to enable machines to learn about digital humans from vision and language, both of which are readily available and scalable sources of data. Our research begins by developing a framework to create detailed 3D faces from in-the-wild images. This framework, capable of generating highly realistic and animatable 3D faces from single images, is trained without paired 3D supervision and achieves state-of-the-art accuracy in shape re- construction. It effectively disentangles identity and expression details, thereby enhancing facial animation. We then explore capturing the body, clothing, face, and hair from monocu- lar videos, using a novel hybrid explicit-implicit 3D representation. This iii approach facilitates the disentangled learning of digital humans from monocular videos and allows for the easy transfer of hair and clothing to different bodies, as demonstrated through experiments in disentangled re- construction, virtual try-ons, and hairstyle transfers. Next, we present a method that utilizes text-visual foundation models to generate highly realistic 3D faces, complete with hair and accessories, based on text descriptions. These foundation models are trained exclusively on in-the-wild images and efficiently produce detailed and realistic outputs, facilitating the creation of authentic avatars. Finally, we introduce a framework that employs Large Language Models (LLMs) to interpret and generate 3D human poses from both images and text. This method, inspired by how humans intuitively understand pos- tures, merges image interpretation with body language analysis. By em- bedding SMPL poses into a multimodal LLM, our approach not only in- tegrates semantic reasoning but also enhances the generation and under- standing of 3D poses, utilizing the comprehensive capabilities of LLMs. Additionally, the use of LLMs facilitates interactive discussions with users about human poses, enriching human-computer interactions. Our research on digital humans significantly boosts scalability and con- trollability. By generating digital humans from images, videos, and text, we democratize their creation, making it broadly accessible through ev- eryday imagery and straightforward text, while enhancing generalization. Disentangled modeling and interactive chatting with human poses increase the controllability of digital humans and improve user interactions and cus- tomizations, showcasing their potential to extend into various disciplines.}, degree_type = {PhD}, institution = {ETH Zürich}, month = oct, year = {2024}, slug = {yaofengthesis2024}, author = {Feng, Yao}, url = {https://www.research-collection.ethz.ch/handle/20.500.11850/712913}, month_numeric = {10} }