Limits to Scalable Evaluation at the Frontier: LLM as Judge Won't Beat Twice the Data

Institute Homepage

Institute Homepage DE Sign In

Back

Social Foundations of Computation Conference Paper 2025

Limits to Scalable Evaluation at the Frontier: LLM as Judge Won’t Beat Twice the Data

Social Foundations of Computation

Florian Dorner

Doctoral Researcher

Social Foundations of Computation

Vivian Nastl

Doctoral Researcher

Social Foundations of Computation

Moritz Hardt

Director

High-quality annotations are increasingly a bottleneck in the explosively growing machine learning ecosystem. Scalable evaluation methods that avoid costly annotation have therefore become an important research ambition. Many hope to use strong existing models in lieu of costly labels to provide cheap model evaluations. Unfortunately, this method of using models as judges introduces biases, such as self-preferencing, that can distort model comparisons. An emerging family of debiasing tools promises to fix these issues by using a few high-quality labels to debias a large number of model judgments. In this paper, we study how far such debiasing methods, in principle, can go. Our main result shows that when the judge is no more accurate than the evaluated model, no debiasing method can decrease the required amount of ground truth labels by more than half. Our result speaks to the severe limitations of the LLM-as-a-judge paradigm at the evaluation frontier where the goal is to assess newly released models that are possibly better than the judge. Through an empirical evaluation, we demonstrate that the sample size savings achievable in practice are even more modest than what our theoretical limit suggests. Along the way, our work provides new observations about debiasing methods for model evaluation and points out promising avenues for future work.

Author(s):	Dorner, Florian E. and Nastl, Vivian Y. and Hardt, Moritz
Book Title:	The Thirteenth International Conference on Learning Representations (ICLR 2025 Oral)
Year:	2025
Month:	January

Bibtex Type:	Conference Paper (conference)

State:	Accepted
URL:	https://openreview.net/pdf?id=NO6Tv6QcDs

Links:	arXiv

BibTex

@conference{dorner2024limits,
  title = {Limits to Scalable Evaluation at the Frontier: LLM as Judge Won't Beat Twice the Data},
  booktitle = {The Thirteenth International Conference on Learning Representations (ICLR 2025 Oral)},
  abstract = {High-quality annotations are increasingly a bottleneck in the explosively growing machine learning ecosystem. Scalable evaluation methods that avoid costly annotation have therefore become an important research ambition. Many hope to use strong existing models in lieu of costly labels to provide cheap model evaluations. Unfortunately, this method of using models as judges introduces biases, such as self-preferencing, that can distort model comparisons. An emerging family of debiasing tools promises to fix these issues by using a few high-quality labels to debias a large number of model judgments. In this paper, we study how far such debiasing methods, in principle, can go. Our main result shows that when the judge is no more accurate than the evaluated model, no debiasing method can decrease the required amount of ground truth labels by more than half. Our result speaks to the severe limitations of the LLM-as-a-judge paradigm at the evaluation frontier where the goal is to assess newly released models that are possibly better than the judge. Through an empirical evaluation, we demonstrate that the sample size savings achievable in practice are even more modest than what our theoretical limit suggests. Along the way, our work provides new observations about debiasing methods for model evaluation and points out promising avenues for future work.},
  month = jan,
  year = {2025},
  slug = {dorner2024limits},
  author = {Dorner, Florian E. and Nastl, Vivian Y. and Hardt, Moritz},
  url = {https://openreview.net/pdf?id=NO6Tv6QcDs},
  month_numeric = {1}
}

Research

Departments

Research Groups

People

Contact

Our Institute

Our History

Career

Doctoral Programs

Training

Service Units

Central Scientific Facilities

Workshops

Campus Services

Impact

Cooperation

Partners and Initiatives

Research

Departments

Research Groups

People

Contact

Our Institute

Our History

Career

Doctoral Programs

Training

Service Units

Central Scientific Facilities

Workshops

Campus Services

Impact

Cooperation

Partners and Initiatives

BibTex