Social Foundations of Computation Members Publications

Training on the Test Task Confounds Evaluation and Emergence

Training on the test task confounds
Top panel: Model accuracy on the MMLU benchmark as a function of pretraining compute, newer models (orange), older models (blue). Newer models appear to be better at utilizing compute. Also, high accuracy on MMLU appears to be emergent. Bottom panel: After adjusting for training on the test task, new and old models have the same scaling law. Moreover, accuracy picks up at much smaller model scale.

Members

Publications

Social Foundations of Computation Miscellaneous Training on the Test Task Confounds Evaluation and Emergence Dominguez-Olmedo, R., Dorner, F. E., Hardt, M. The Thirteenth International Conference on Learning Representations (ICLR 2025 Oral), January 2025, Accepted (Accepted) ArXiv BibTeX