Benchmarking LLMs on prediction tasks derived from survey data
Folktexts offers a Python software package together with ready to use natural language question-answering datasets to evaluate accuracy, calibration and fairness of LLMs on human outcome prediction tasks.
>> pip install folktexts
Folktexts provides a suite of Q&A datasets for evaluating uncertainty, calibration, accuracy and fairness of LLMs on individual outcome prediction tasks. It provides a flexible framework to derive prediction tasks from survey data, translates them into natural text prompts, extracts LLM-generated risk scores, and computes statistical properties of these risk scores by comparing them to the ground truth outcomes.