Relative Entropy Inverse Reinforcement Learning

Institute Homepage

Institute Homepage Sign In

Back

Empirische Inferenz Conference Paper 2011

Empirische Inferenz

Jan Peters

Research Group Leader

We consider the problem of imitation learning where the examples, demonstrated by an expert, cover only a small part of a large state space. Inverse Reinforcement Learning (IRL) provides an efficient tool for generalizing the demonstration, based on the assumption that the expert is optimally acting in a Markov Decision Process (MDP). Most of the past work on IRL requires that a (near)-optimal policy can be computed for different reward functions. However, this requirement can hardly be satisfied in systems with a large, or continuous, state space. In this paper, we propose a model-free IRL algorithm, where the relative entropy between the empirical distribution of the state-action trajectories under a uniform policy and their distribution under the learned policy is minimized by stochastic gradient descent. We compare this new approach to well-known IRL algorithms using approximate MDP models. Empirical results on simulated car racing, gridworld and ball-in-a-cup problems show that our approach is able to learn good policies from a small number of demonstrations.

Author(s):	Boularias, A. and Kober, J. and Peters, J.
Book Title:	JMLR Workshop and Conference Proceedings Volume 15: AISTATS 2011
Pages:	182-189
Year:	2011
Month:	April
Day:	0
Editors:	Gordon, G. , D. Dunson, M. Dudík
Publisher:	MIT Press

Bibtex Type:	Conference Paper (inproceedings)

Address:	Cambridge, MA, USA
Event Name:	Fourteenth International Conference on Artificial Intelligence and Statistics
Event Place:	Ft. Lauderdale, FL, USA

Digital:	0
Electronic Archiving:	grant_archive

Links:	PDF Web

BibTex

@inproceedings{BoulariasKP2011,
  title = {Relative Entropy Inverse Reinforcement Learning},
  booktitle = {JMLR Workshop and Conference Proceedings Volume 15: AISTATS 2011},
  abstract = {We consider the problem of imitation learning where the examples, demonstrated by an expert, cover only a small part of a large state space. Inverse Reinforcement Learning (IRL) provides an efficient tool for generalizing the demonstration, based on the assumption that the expert is optimally acting in a Markov Decision Process (MDP). Most of the past work on IRL requires that a (near)-optimal policy can be computed for different reward functions. However, this requirement can hardly be satisfied in systems with a large, or continuous, state space. In this paper, we propose a model-free IRL algorithm, where the relative entropy between the empirical distribution of the state-action trajectories under a uniform policy and their distribution under the learned policy is minimized by stochastic gradient descent. We compare this new approach to well-known IRL algorithms using approximate MDP models. Empirical results on simulated car racing, gridworld and ball-in-a-cup problems show that our approach is able to learn good policies from a small number of demonstrations. },
  pages = {182-189},
  editors = {Gordon, G. , D. Dunson, M. Dudík },
  publisher = {MIT Press},
  address = {Cambridge, MA, USA},
  month = apr,
  year = {2011},
  slug = {boulariaskp2011},
  author = {Boularias, A. and Kober, J. and Peters, J.},
  month_numeric = {4}
}