Evaluating Large Language Models

Evaluating large language models (LLMs) differs from benchmarking image classification models in the ImageNet era. There are a few reasons. LLMs don’t solve a single predefined task. Rather they solve an open-ended array of potential tasks that depend on how we prompt the model. In addition, different models are trained on different datasets, often unknown to the evaluator.

Multi-task benchmarks aim to provide a holistic assessment of a model’s performance across many tasks. Working from an analogy between multi-task benchmarks and voting systems, we show that multitask benchmarks inherently are subject to a trade-off between diversity and stability: The more diverse the rankings of different tasks are, the more sensitive the benchmark is to irrelevant task transformations [].

Increasingly, researchers hope to use survey responses of large language models to either understand the model or the population that it was trained on. We show that the aggregate responses of language models to surveys lack the statistical patterns found in human populations, limiting the use and interpretation of such survey responses []. This cautionary tale extends to personality tests [].

Current question-answering benchmarks fail to assess a model’s ability to quantify data uncertainty, making existing benchmarks unsuitable for evaluating language models as risk assessment tools. To address this shortcoming, we developed the "folktexts" package that enables systematic generation of risk scores for unrealizable prediction tasks []. Using our software package, we reveal significant calibration issues in instruction-tuned models.

Expert annotations are increasingly a bottleneck. Scalable evaluation methods that avoid costly annotation have therefore become an important research ambition. Many hope to use “LLMs as judges” to provide cheap labels. Unfortunately, using models as judges introduces biases that can distort model comparisons. An emerging family of debiasing tools promises relief by using a few expert labels to debias a large number of model judgments. We show, however, that these debiasing tools are never better than using twice the labeled data [].