Science of Machine Learning Benchmarks

Machine learning as a scientific discipline has largely followed the “anything goes” principle to scientific discovery. From its roots in the cybernetics era of the 1940s, researchers have always designed and experimented freely, not following any apparent set of rules.

The only mechanism that moderates the community’s scientific activities is the mechanism of a benchmark. What’s less clear is why benchmarks should work in the sense of providing reliable and valid model rankings, despite the extreme incremental use many benchmarks experience.

A cornerstone benchmark in the deep learning era was ImageNet. Much has been written about the specifics of ImageNet and how they may have catalyzed progress in deep learning. In a recent work, we conducted an intriguing experiment. We created a dataset, called ImageNot [], as different as possible from ImageNet, matching only its scale and number of classes.

Whereas ImageNet was carefully curated by humans, ImageNot is based on a noisy web crawl, selecting images solely based on text similarity with surrounding captions, building on our earlier work []. Surprisingly, the ranking of popular computer vision architectures replicates perfectly from ImageNet to ImageNot, and the same is true for the relative improvements that each model makes over previous models. The results speak to the surprising external validity of computer vision benchmarks.

This surprising finding inspired theoretical work proving that noisy labels are, contrary to conventional wisdom, not a problem for reliable benchmarking. In fact, when we have access to unreliable annotators, it’s optimal from the perspective of benchmarking to pick only one noisy label per data point [].

Following the ImageNet era, researchers proposed dynamic benchmarks to address the potential issues of having a fixed test set frozen in time. Dynamic benchmarks interleave model fitting and adversarial data collection, aiming to ensure a benchmark that gets harder as models improve. Recently, we formalized this intuitive proposal and proved that, unfortunately, progress in dynamic benchmarks can stall after a few iterations [].

These and other results in the lab are defining contributions in the emerging science of machine learning benchmarks.