Evaluating Language Models as Risk Scores

Institute Homepage

Institute Homepage DE Sign In

Back

Social Foundations of Computation Algorithms and Society Conference Paper 2024

Social Foundations of Computation

André Cruz

Doctoral Researcher

Social Foundations of Computation

Moritz Hardt

Director

Algorithms and Society

Celestine Mendler-Dünner

Hector Endowed Fellow of the ELLIS Institute

Current question-answering benchmarks predominantly focus on accuracy in realizable prediction tasks. Conditioned on a question and answer-key, does the most likely token match the ground truth? Such benchmarks necessarily fail to evaluate language models' ability to quantify outcome uncertainty. In this work, we focus on the use of language models as risk scores for unrealizable prediction tasks. We introduce folktexts, a software package to systematically generate risk scores using large language models, and evaluate them against benchmark prediction tasks. Specifically, the package derives natural language tasks from US Census data products, inspired by popular tabular data benchmarks. A flexible API allows for any task to be constructed out of 28 census features whose values are mapped to prompt-completion pairs. We demonstrate the utility of folktexts through a sweep of empirical insights on 16 recent large language models, inspecting risk scores, calibration curves, and diverse evaluation metrics. We find that zero-shot risk sores have high predictive signal while being widely miscalibrated: base models overestimate outcome uncertainty, while instruction-tuned models underestimate uncertainty and generate over-confident risk scores.

Author(s):	Cruz, André F and Hardt, Moritz and Mendler-Dünner, Celestine
Book Title:	Advances in Neural Information Processing Systems 37 (NeurIPS 2024)
Year:	2024
Month:	December

Project(s):	Folktexts Evaluating LLMs as risk scores Evaluating Language Models as Risk Scores
Bibtex Type:	Conference Paper (conference)

Event Name:	The Thirty-Eighth Annual Conference on Neural Information Processing Systems (NeurIPS)
State:	Published
URL:	https://openreview.net/attachment?id=qrZxL3Bto9&name=pdf

Electronic Archiving:	grant_archive

Links:	ArXiv Code

BibTex

@conference{cruz2024evaluating,
  title = {Evaluating Language Models as Risk Scores},
  booktitle = {Advances in Neural Information Processing Systems 37 (NeurIPS 2024)},
  abstract = {Current question-answering benchmarks predominantly focus on accuracy in realizable prediction tasks. Conditioned on a question and answer-key, does the most likely token match the ground truth? Such benchmarks necessarily fail to evaluate language models' ability to quantify outcome uncertainty. In this work, we focus on the use of language models as risk scores for unrealizable prediction tasks. We introduce folktexts, a software package to systematically generate risk scores using large language models, and evaluate them against benchmark prediction tasks. Specifically, the package derives natural language tasks from US Census data products, inspired by popular tabular data benchmarks. A flexible API allows for any task to be constructed out of 28 census features whose values are mapped to prompt-completion pairs. We demonstrate the utility of folktexts through a sweep of empirical insights on 16 recent large language models, inspecting risk scores, calibration curves, and diverse evaluation metrics. We find that zero-shot risk sores have high predictive signal while being widely miscalibrated: base models overestimate outcome uncertainty, while instruction-tuned models underestimate uncertainty and generate over-confident risk scores.},
  month = dec,
  year = {2024},
  slug = {cruz2024evaluating},
  author = {Cruz, Andr{\'e} F and Hardt, Moritz and Mendler-D{\"u}nner, Celestine},
  url = {https://openreview.net/attachment?id=qrZxL3Bto9&name=pdf},
  month_numeric = {12}
}

Research

Departments

Research Groups

People

Contact

Our Institute

Our History

Career

Doctoral Programs

Training

Service Units

Central Scientific Facilities

Workshops

Campus Services

Impact

Cooperation

Partners and Initiatives