Fine-tuning Large Language Models

A major problem with benchmarking language models is that different models were trained on different datasets. This can give those models an advantage that have seen a larger amount of training data relevant to the test task. We call this problem training on the test task [

By fine-tuning all models on the same task-specific data prior to evaluation, we can effectively level the playing field between different model fields. We show that this adjustment leads to fair model comparisons and makes model capabilities predictable at a much smaller model scale [].

Finetuning was also key to our recent work on large language models within the law, and more specifically, empirical legal work. The legal system generates a staggering volume of complex documents. Annotation and classification of legal text are central components of empirical legal research, traditionally delegated to research assistants.

What limits much empirical legal research is the cost and limited scale of human annotation. Empirical legal scholars are therefore increasingly turning to commercial language models, currently primarily GPT-4, for annotation.

In a recent collaboration involving legal experts and computer scientists from several institutions, we developed a model called Lawma that significantly outperforms GPT-4 on more than 100 important legal annotation tasks []. In addition, we made these tasks available as benchmarks.

Lawma demonstrates the surprising power of fine-tuning the open source model Llama-3 on relatively few expert labels. Finetuning is an underutilized resource in empirical legal work that we suggest researchers should make better use of. In addition, legal research provides interesting test cases for state of the art language models. Legal tasks are challenging, but the nature of the tasks makes accuracy possible in principle.

We’re currently working on realizing some of the benefits of fine-tuning more efficiently using just a few updates done at test-time. We published promising initial results []. First developed in the context of computer vision, the idea of test-time training is now taking off in the language model space due to recent commercial models adopting test-time computation.