Back
Monocular 3D Shape and Pose Estimation for Humans and Animals
Accurately estimating the 3D shape and pose of humans and animals from images is a key problem in the computer vision field. These estimates have numerous potential applications in areas including virtual reality, health monitoring, sports analysis, and robotics. Although in recent years significant progress has been made on monocular 3D human reconstruction, research on animals has lagged behind, largely due to the scarcity of 3D scans and motion capture data, which hinders the development of expressive shape and pose priors. Exploiting such priors is a common approach for addressing the inherent ambiguities that arise when attempting to predict 3D articulated pose from 2D data. Additionally, the extreme appearance variability and frequent occlusions that occur with quadrupeds present further challenges for accurate 3D shape and pose recovery. With 3D animal reconstruction in mind, our goal is to advance monocular 3D shape and pose estimation for cases where data is hard to obtain. We begin by demonstrating a conceptually innovative solution to a problem setting with very little to no labeled data. Specifically, we learn the underlying relationship between a 3D parametric model and a set of unlabelled (no keypoints, no segmentation masks) images which show the object of interest. Our solution involves designing a chain of two unsupervised cycles that connect representations at three levels of abstraction – image, segmentation and finally a 3D mesh. We prove the feasibility of our approach on synthetic as well as real data for humans. Subsequently, we investigate the potential for enhanced results by leveraging 2D data that is readily available. Using the representative class of dogs as an example, we start with the key insight that animal class – or breed – is directly related to shape similarity. There is significant intra-class variability, but in general dogs of the same breed look more alike than dogs with different breed affiliation. A triplet loss, together with a classification loss, enables us to learn a structured latent shape space, which in turn enhances 3D dog shape estimation results at test time. Finally, we focus on 3D pose estimation. We show how a different cue, namely contact, can reduce the requirement for either images with 3D ground truth or expressive pose priors – both of which are not available for most of the animals species. We learn to predict 3D poses which are consistent with ground contact. To that aim, we define losses pulling contact vertices towards a common, estimated, ground plane and a constraint to penalize interpenetration of the floor. This results in significant advances compared to previous state-of-the-art. Furthermore, if desired, our predicted ground contact labels can be used in a test-time optimization loop, enhancing 3D shape and pose recovery even more.
@phdthesis{Ruegg:Thesis:2023, title = {Monocular {3D} Shape and Pose Estimation for Humans and Animals}, abstract = {Accurately estimating the 3D shape and pose of humans and animals from images is a key problem in the computer vision field. These estimates have numerous potential applications in areas including virtual reality, health monitoring, sports analysis, and robotics. Although in recent years significant progress has been made on monocular 3D human reconstruction, research on animals has lagged behind, largely due to the scarcity of 3D scans and motion capture data, which hinders the development of expressive shape and pose priors. Exploiting such priors is a common approach for addressing the inherent ambiguities that arise when attempting to predict 3D articulated pose from 2D data. Additionally, the extreme appearance variability and frequent occlusions that occur with quadrupeds present further challenges for accurate 3D shape and pose recovery. With 3D animal reconstruction in mind, our goal is to advance monocular 3D shape and pose estimation for cases where data is hard to obtain. We begin by demonstrating a conceptually innovative solution to a problem setting with very little to no labeled data. Specifically, we learn the underlying relationship between a 3D parametric model and a set of unlabelled (no keypoints, no segmentation masks) images which show the object of interest. Our solution involves designing a chain of two unsupervised cycles that connect representations at three levels of abstraction – image, segmentation and finally a 3D mesh. We prove the feasibility of our approach on synthetic as well as real data for humans. Subsequently, we investigate the potential for enhanced results by leveraging 2D data that is readily available. Using the representative class of dogs as an example, we start with the key insight that animal class – or breed – is directly related to shape similarity. There is significant intra-class variability, but in general dogs of the same breed look more alike than dogs with different breed affiliation. A triplet loss, together with a classification loss, enables us to learn a structured latent shape space, which in turn enhances 3D dog shape estimation results at test time. Finally, we focus on 3D pose estimation. We show how a different cue, namely contact, can reduce the requirement for either images with 3D ground truth or expressive pose priors – both of which are not available for most of the animals species. We learn to predict 3D poses which are consistent with ground contact. To that aim, we define losses pulling contact vertices towards a common, estimated, ground plane and a constraint to penalize interpenetration of the floor. This results in significant advances compared to previous state-of-the-art. Furthermore, if desired, our predicted ground contact labels can be used in a test-time optimization loop, enhancing 3D shape and pose recovery even more.}, degree_type = {PhD}, year = {2023}, slug = {ruegg-thesis-2023}, author = {Rueegg, Nadine} }