Evaluating LLMs as risk scores

Institute Homepage

Institute Homepage Sign In

Back

Research Overview

Performativity in machine learning

Measuring performative power of online search

Performative Prediction: Past and Future

Applications of algorithmic collective action

Living artefact of collective action cases

Large language models and surveys

Questioning the survey responses of LLMs

Folktexts

Evaluating LLMs as risk scores

Algorithms and Society Members Publications

Evaluating LLMs as risk scores

Survey evaluation — Accuracy and calibration of LLMs on human prediction tasks

Evaluating 17 popular language models using folktexts on human outcome prediction tasks of varying uncertainty offers new insights into the suitability of LLMs for risk scoring [

Current question-answering benchmarks predominantly focus on accuracy in realizable prediction tasks. Conditioned on a question and answer-key, does the most likely token match the ground truth? Such benchmarks necessarily fail to evaluate LLMs' ability to quantify ground-truth outcome uncertainty. In this project, we focus on the use of LLMs as risk scores for unrealizable prediction tasks. We use folktexts [], a toolbox to derive human outcome prediction tasks from survey data, and evaluate 17 recent LLMs across five benchmark tasks. We find that zero-shot risk scores produced by multiple-choice question-answering have high predictive signal but are widely miscalibrated. Base models consistently overestimate outcome uncertainty, while instruction-tuned models underestimate uncertainty and produce over-confident risk scores. In fact, instruction-tuning polarizes answer distribution regardless of true underlying data uncertainty. This reveals a general inability of instruction-tuned LLMs to express data uncertainty using multiple-choice answers. A separate experiment using verbalized chat-style risk queries yields substantially improved calibration across instruction-tuned models. These differences in ability to quantify data uncertainty cannot be revealed in realizable settings, and highlight a blind-spot in the current evaluation ecosystem that folktexts covers.

Members

Soziale Grundlagen der Informatik

André Cruz

Doctoral Researcher

Soziale Grundlagen der Informatik

Moritz Hardt

Director

Algorithms and Society

Celestine Mendler-Dünner

Hector Endowed Fellow of the ELLIS Institute

Publications

Social Foundations of Computation Algorithms and Society Conference Paper Evaluating Language Models as Risk Scores Cruz, A. F., Hardt, M., Mendler-Dünner, C. Advances in Neural Information Processing Systems 37 (NeurIPS 2024), The Thirty-Eighth Annual Conference on Neural Information Processing Systems (NeurIPS), December 2024 (Published) ArXiv Code URL BibTeX