Empirical Inference Article 2011

Reinforcement Learning with Bounded Information Loss

Policy search is a successful approach to reinforcement learning. However, policy improvements often result in the loss of information. Hence, it has been marred by premature convergence and implausible solutions. As first suggested in the context of covariant or natural policy gradients, many of these problems may be addressed by constraining the information loss. In this paper, we continue this path of reasoning and suggest two reinforcement learning methods, i.e., a model‐based and a model free algorithm that bound the loss in relative entropy while maximizing their return. The resulting methods differ significantly from previous policy gradient approaches and yields an exact update step. It works well on typical reinforcement learning benchmark problems as well as novel evaluations in robotics. We also show a Bayesian bound motivation of this new approach [8].

Author(s): Peters, J. and Peters, J. and Mülling, K. and Altun, Y.
Journal: AIP Conference Proceedings
Volume: 1305
Number (issue): 1
Pages: 365-372
Year: 2011
Day: 0
Bibtex Type: Article (article)
DOI: 10.1063/1.3573639
Digital: 0
Electronic Archiving: grant_archive
Links:

BibTex

@article{PetersMSA2011,
  title = {Reinforcement Learning with Bounded Information Loss},
  journal = {AIP Conference Proceedings},
  abstract = {Policy search is a successful approach to reinforcement learning. However, policy improvements often result in the loss of information. Hence, it has been marred by premature convergence and implausible solutions. As first suggested in the context of covariant or natural policy gradients, many of these problems may be addressed by constraining the information loss. In this paper, we continue this path of reasoning and suggest two reinforcement learning methods, i.e., a model‐based and a model free algorithm that bound the loss in relative entropy while maximizing their return. The resulting methods differ significantly from previous policy gradient approaches and yields an exact update step. It works well on typical reinforcement learning benchmark problems as well as novel evaluations in robotics. We also show a Bayesian bound motivation of this new approach [8].},
  volume = {1305},
  number = {1},
  pages = {365-372},
  year = {2011},
  slug = {petersmsa2011},
  author = {Peters, J. and Peters, J. and M{\"u}lling, K. and Altun, Y.}
}