Inherent Trade-Offs between Diversity and Stability in Multi-Task Benchmarks | Social Foundations of Computation – Max Planck Institute for Intelligent Systems

Institute Homepage

Institute Homepage Sign In

Research Overview

Social Prediction

Performative Prediction: Past and Future

Difficult Lessons on Social Prediction from Wisconsin Public Schools

Allocation Requires Prediction Only if Inequality Is Low

Digital Platforms, Power and Work

Performative Power

An Engine Not a Camera: Measuring Performative Power of Online Search

Causal Inference from Competing Treatments

Contesting Algorithmic Systems

Algorithmic Collective Action in Machine Learning

Decline Now: A Combinatorial Model for Algorithmic Collective Action

Algorithmic Collective Action in Recommender Systems: Promoting Songs by Reordering Playlists

Algorithmic Fairness

Fairness and Machine Learning: Limitations and Opportunities

Unprocessing Seven Years of Algorithmic Fairness

Science of Machine Learning Benchmarks

ImageNot: A Contrast with ImageNet Preserves Model Rankings

Don't Label Twice: Quantity Beats Quality when Comparing Binary Classifiers on a Budget

A Theory of Dynamic Benchmarks

Evaluating Large Language Models

Inherent Trade-Offs between Diversity and Stability in Multi-Task Benchmarks

Evaluating Language Models as Risk Scores

Fine-tuning Large Language Models

Training on the Test Task Confounds Evaluation and Emergence

Lawma: The Power of Specialization for Legal Tasks

Social Foundations of Computation Members Publications

Inherent Trade-Offs between Diversity and Stability in Multi-Task Benchmarks

Benchbench — BenchBench is a Python package that makes it easy for practitioners to evaluate the diversity and stability of multi-task benchmarks.

Practitioners hope that multi-task benchmarks give a diverse picture of a model’s performance. But we show that diversity comes at the cost of decreased stability under irrelevant task transformations.

Members

Thumb ticker sm profile9

Social Foundations of Computation

Thumb ticker sm 20241104 hardt moritz 12 cleaned kleiner

Social Foundations of Computation

Director

Publications

Social Foundations of Computation Conference Paper Inherent Trade-Offs between Diversity and Stability in Multi-Task Benchmarks Zhang, G., Hardt, M. In Proceedings of the 41st International Conference on Machine Learning (ICML 2024), PMLR, The Forty-First International Conference on Machine Learning (ICML), July 2024 (Published) ArXiv Code URL BibTeX