Research Overview

Deep neural networks excel at complex cognitive tasks but have key limitations: They are vulnerable to minor perturbations, struggle to generalize beyond training data, and need extensive data for new tasks. These limitations stem from shortcut learning, meaning networks rely on statistical patterns rather than grasping causal structures. Our research combines multiple empirical and theoretical approaches—adversarial learning, disentanglement, interpretability, self-supervised learning, and nonlinear Independent Component Analysis—to develop sound techniques for learning visual representations that reveal underlying structures and bridge human-machine vision gaps.

The long-term research goal of the Robust Machine Learning group is to develop robust and adaptable vision models that reason about the world like humans. Today's machine vision systems require vast amounts of training data, are easily derailed by small input perturbations, infer highly entangled representations, and often behave incomprehensibly in novel situations. In comparison, the human visual system is trained on comparatively few data points, is robust to large perturbations, infers a highly structured representation of its environment, and generalizes effortlessly to novel situations.

The current gap between human and machine vision has severe consequences. Training a human to drive safely takes a few thousand kilometers. Training a machine to drive safely takes billions of kilometers. We argue that a root cause of this gap is the reliance of machines on statistical shortcuts. As a result, machine vision models lack a true understanding of the compositional and object-centric nature of our visual world, as well as the natural laws and causal relationships that govern it.

Our research program is organized into two parallel workstreams. In the first, we identify differences between the visual processing of humans and machines. Here, we focus on understanding what statistical patterns state-of-the-art machine vision models exploit and how this leads to behavioral differences relative to humans, particularly in scenarios involving novel objects or unusual contexts. In the second workstream, we work on closing these gaps. We focus on a wider range of learning signals that contain rich information about the structure of the world and use theory-driven approaches to enable vision and language models to extract the underlying rules of their environment. This includes developing new architectures and training paradigms that incorporate inductive biases inspired by human cognition.