Fenosoa Randrianjatovo

Social Foundations of Computation Intern Alumni

Reinforcement learning (RL) algorithms have so far been developed mainly for steady-state environments and have difficulty adapting when the system dynamics or the reward function changes. Posterior sampling and Thompson sampling were identified early on as efficient strategies in RL, in part due to their randomised strategies. While some algorithms such as UCRL have recently been adapted to non-stationary environments, no randomised strategy has been proposed yet. In our project, we aim to propose a randomised -and more practical-  algorithm that builds on posterior sampling and is capable of achieving sublinear regret. We will start by studying a (possibly context-dependent) bandit problem and then extend our findings to more complex RL models.