Social Foundations of Computation The MIT License 2024-04-29

BenchBench

Benchbench horizontal

BenchBench is a Python package to evaluate multi-task benchmarks.

BenchBench is a Python package that provides a suite of tools to evaluate multi-task benchmarks focusing on task diversity and sensitivity to irrelevant changes.

Research shows that for all multi-task benchmarks, there is a trade-off between task diversity and sensitivity. The more diverse a benchmark, the more sensitive its ranking is to irrelevant changes. Irrelevant changes are things like introducing weak models or changing the metric in ways that shouldn't matter.

Based on BenchBench, we're maintaining a living benchmark of multi-task benchmarks. Visit the project page to see the results or contribute your own benchmark.

Release Date: 29 April 2024
licence_type: The MIT License
Link (URL): https://socialfoundations.github.io/benchbench/
Repository: https://github.com/socialfoundations/benchbench?tab=readme-ov-file