Robust Machine Learning

Mechanistic Interpretability

Psychophysics experiment
In a typical psychophysics setup, we show participants two sets of feature visualizations (both maximally and minimally activating a certain unit in a deep neural network) and then let them decide which one of the two query images is highly exciting for that unit.