Social Foundations of Computation
Members
Publications
BenchBench is a Python package that makes it easy for practitioners to evaluate the diversity and stability of multi-task benchmarks.
Inherent Trade-Offs between Diversity and Stability in Multi-Task Benchmarks

Practitioners hope that multi-task benchmarks give a diverse picture of a model’s performance. But we show that diversity comes at the cost of decreased stability under irrelevant task transformations.
Members
Publications
Social Foundations of Computation
Conference Paper
Inherent Trade-Offs between Diversity and Stability in Multi-Task Benchmarks
Zhang, G., Hardt, M.
In Proceedings of the 41st International Conference on Machine Learning (ICML 2024), PMLR, The Forty-First International Conference on Machine Learning (ICML), July 2024 (Published)
ArXiv
Code
URL
BibTeX