Embodied Vision Ph.D. Thesis 2024

Investigating Shape Priors, Relationships, and Multi-Task Cues for Object-level Scene Understanding

Humans are proficient at intuitively identifying objects and reasoning about their diverse properties from complex visual observations. Despite significant advances in artificial intelligence, computers have yet to achieve a comparable level of understanding, which is crucial for effective reasoning about tasks and interactions within an environment. In this thesis, we explore the benefits of various visual cues when dealing with key challenges in scene understanding, specifically focusing on weak supervision, finding view correspondence, and paradigms for simultaneously learning multiple tasks. We begin by investigating cues that reduce the need for full supervision. In particular, we propose an approach for learning multi-object 3D scene decomposition and object-wise properties from single images with only weak supervision. Our method utilizes a recurrent encoder to infer a latent representation for each object and a differentiable renderer to obtain a training signal. To guide the training process and constrain the search space of possible solutions, we leverage prior knowledge through pre-trained 3D shape spaces. Subsequently, we investigate the benefits of reasoning about relations between objects to learn more distinct object representations that allow for matching object detections across viewpoint changes. To address this, we introduce an approach that employs graph neural networks to learn matching features based on appearance as well as inter- and cross-frame relations. We conduct comparisons with keypoint-based methods and propose a methodology to combine these approaches, aiming to achieve overall improved performance. Finally, we consider the challenge of multi-task learning and analyze related paradigms in the context of basic single-task learning. In particular, we study the impact of the choice of optimizer, the role of gradient conflicts, and the effects on the transferability of features learned through either learning setup on common image corruptions. Our findings reveal surprising similarities between single-task and multi-task learning, suggesting that methods and techniques from one field could be advantageously applied to the other.

Author(s): Elich, Cathrin
Year: 2024
Bibtex Type: Ph.D. Thesis (phdthesis)
Address: Zurich
Degree Type: PhD
DOI: 10.3929/ethz-b-000706421
School: ETH Zürich
State: Published
URL: https://doi.org/10.3929/ethz-b-000706421

BibTex

@phdthesis{elich2024phdthesis,
  title = {Investigating Shape Priors, Relationships, and Multi-Task Cues for Object-level Scene Understanding},
  abstract = {Humans are proficient at intuitively identifying objects and reasoning about their diverse properties from complex visual observations. Despite significant advances in artificial intelligence, computers have yet to achieve a comparable level of understanding, which is crucial for effective reasoning about tasks and interactions within an environment. In this thesis, we explore the benefits of various visual cues when dealing with key challenges in scene understanding, specifically focusing on weak supervision, finding view correspondence, and paradigms for simultaneously learning multiple tasks.
  
  
  
  We begin by investigating cues that reduce the need for full supervision. In particular, we propose an approach for learning multi-object 3D scene decomposition and object-wise properties from single images with only weak supervision. Our method utilizes a recurrent encoder to infer a latent representation for each object and a differentiable renderer to obtain a training signal. To guide the training process and constrain the search space of possible solutions, we leverage prior knowledge through pre-trained 3D shape spaces. Subsequently, we investigate the benefits of reasoning about relations between objects to learn more distinct object representations that allow for matching object detections across viewpoint changes. To address this, we introduce an approach that employs graph neural networks to learn matching features based on appearance as well as inter- and cross-frame relations. We conduct comparisons with keypoint-based methods and propose a methodology to combine these approaches, aiming to achieve overall improved performance. Finally, we consider the challenge of multi-task learning and analyze related paradigms in the context of basic single-task learning. In particular, we study the impact of the choice of optimizer, the role of gradient conflicts, and the effects on the transferability of features learned through either learning setup on common image corruptions. Our findings reveal surprising similarities between single-task and multi-task learning, suggesting that methods and techniques from one field could be advantageously applied to the other.},
  degree_type = {PhD},
  school = {ETH Zürich},
  address = {Zurich},
  year = {2024},
  slug = {elich2024phdthesis},
  author = {Elich, Cathrin},
  url = {https://doi.org/10.3929/ethz-b-000706421}
}