The Unreasonable Effectiveness of Distributional Reinforcement Learning

Distributional Reinforcement Learning (RL) learns the whole conditional distribution of rewards-to-go, given current state and action, but then only ever looks at the mean (e.g., C51, IQN). While this appears inefficient on its face, empirically it often improves on analogous approaches (e.g., DQN) that directly learn just the conditional mean (i.e., the Q-function). A principled understanding as to why and when this happens has been elusive.
In this talk, I resolve this question by showing that distributional RL enjoys first- and second-order regret bounds in both online and offline RL in general MDPs with function approximation. In many cases these are the first bounds of their kind for any RL algorithm, distributional or otherwise. First-order bounds scale with optimal average returns and, e.g., establish fast convergence in goal-based tasks when the optimal policy reliably reaches the goal. Second-order bounds scale with the variance of returns and, e.g., establish fast convergence under low stochasticity, as is often encountered in robotics. I explain how this phenomenon arises due to automatic sensitivity to heteroskedasticity, in contrast to previous heuristic explanations. Going beyond risk-neutral RL, where the benefits of being distributional may be surprising, I will conclude with new results for robust policy evaluation in the presence of unobserved confounding and for minimax optimal risk-sensitive RL, where being distributional is a necessity
Speaker Biography
Nathan Kallus (Cornell Tech, Cornell University)
Associate Professor
Nathan Kallus is an Associate Professor at the Cornell Tech campus of Cornell University in NYC and a Research Director at Netflix. Nathan's research interests include the statistics of optimization under uncertainty, causal inference especially when combined with machine learning, sequential and dynamic decision making, and algorithmic fairness.