Counterfactual Token Generation in Large Language Models

ORGANIZERS

Soziale Grundlagen der Informatik

Moritz Hardt

Director

Imagine the following story, generated by a large language model: "Captain Lyra stood at the helm of her trusty ship, the Maelstrom's Fury, gazing out at the endless sea. [...] Lyra's eyes welled up with tears as she realized the bitter truth—she had sacrificed everything for fleeting riches, and lost the love of her crew, her family, and herself.” Now, let’s conduct a thought experiment: how would the story have unfolded if the model had chosen “Captain Maeve” as the protagonist instead?

In this talk, I will begin by illustrating why the stateless nature of state-of-the-art large language models—i.e., their lack of internal memory or state—prevents us from answering such counterfactual questions. To address this limitation, I will introduce a causal model of token generation based on the Gumbel-Max structural causal model, along with a method that allows us to equip an arbitrary large language model with the ability to perform counterfactual token generation. I will then share experimental results from implementing this approach on Llama 3 8B-Instruct and Ministral-8B-Instruct, highlighting both qualitative and quantitative insights about counterfactual text. Finally, I will conclude with a demonstrative application of counterfactual token generation for bias detection, unveiling interesting insights about the (social) model of the world constructed by large language models.

Speaker Biography

Stratis Tsirtsis (MPI for Software Systems, Saarbrücken)

PhD Candidate, Computer Science Department

Stratis Tsirtsis is currently a final year PhD student in computer science at MPI for software system advised by Manuel Gomez-Rodriguez. He works on building AI systems to understand, inform and complement human decisions and judgments in uncertain and high-stakes environments. During his PhD, he have focused primarily on developing machine learning methods for (i) informing decision making in the presence of strategic human behavior and (ii) enhancing the counterfactual analysis of sequential decision-making tasks.