Mechanistic Interpretability

The success of Deep Neural Networks comes at the expense of interpretability, as these models operate as complex, high-dimensional "black boxes" that obscure the reasoning behind their predictions. This lack of transparency poses significant challenges, particularly in safety-critical applications such as healthcare, autonomous systems, and justice, where understanding the rationale behind a model's decision is paramount. Enhancing our understanding of the inner workings of DNNs could unlock the ability to diagnose and correct errors, increase trust and accountability, and design models that align better with human reasoning. Moreover, by illuminating the pathways of information processing within these systems, we might be able to devise more efficient architectures and training strategies, paving the way for more robust and interpretable machine learning.

To better understand and analyze the inner workings of deep neural networks, the Robust Machine Learning group investigates visualization methods that aim to reveal the causes of intermediate neural responses. A widely used approach in this space is feature visualization, which synthesizes highly activating stimuli to highlight the causes of a unit’s activation.

However, in two large-scale psychophysics experiments [], we found that feature visualization does not convey more information to humans than simple baselines. Moreover, while many reasoning aspects scale favorably with data and model size, mechanistic per-unit interpretability does not improve—and may even decrease—for visual foundation models[].

In a subsequent study~[], we demonstrated that high-performance models can be built with nearly arbitrary feature visualizations, showing that these visualizations can lack any meaningful connection to the model’s underlying information processing.

Moving forward, we aim to develop models with more interpretable internal representations. As a step in this direction, we recently introduced an automatic interpretability estimator that predicts how well humans will understand a unit’s response based on minimally and maximally activating reference images~[]. In ongoing work, we are extending this method to serve as a regularization tool during training, with the goal of reorganizing internal representations to improve human interpretability.