Back
Learning Policy Improvements with Path Integrals
With the goal to generate more scalable algo- rithms with higher efficiency and fewer open parameters, reinforcement learning (RL) has recently moved towards combining classi- cal techniques from optimal control and dy- namic programming with modern learning techniques from statistical estimation the- ory. In this vein, this paper suggests the framework of stochastic optimal control with path integrals to derive a novel approach to RL with parametrized policies. While solidly grounded in value function estimation and optimal control based on the stochastic Hamilton-Jacobi-Bellman (HJB) equations, policy improvements can be transformed into an approximation problem of a path inte- gral which has no open parameters other than the exploration noise. The resulting algorithm can be conceived of as model- based, semi-model-based, or even model free, depending on how the learning problem is structured. Our new algorithm demon- strates interesting similarities with previous RL research in the framework of proba- bility matching and provides intuition why the slightly heuristically motivated proba- bility matching approach can actually per- form well. Empirical evaluations demon- strate significant performance improvements over gradient-based policy learning and scal- ability to high-dimensional control problems. We believe that Policy Improvement with Path Integrals (PI2) offers currently one of the most efficient, numerically robust, and easy to implement algorithms for RL based on trajectory roll-outs.
@inproceedings{theodorouAISTATS2010, title = {Learning Policy Improvements with Path Integrals}, booktitle = {International Conference on Artificial Intelligence and Statistics (AISTATS 2010)}, abstract = {With the goal to generate more scalable algo- rithms with higher efficiency and fewer open parameters, reinforcement learning (RL) has recently moved towards combining classi- cal techniques from optimal control and dy- namic programming with modern learning techniques from statistical estimation the- ory. In this vein, this paper suggests the framework of stochastic optimal control with path integrals to derive a novel approach to RL with parametrized policies. While solidly grounded in value function estimation and optimal control based on the stochastic Hamilton-Jacobi-Bellman (HJB) equations, policy improvements can be transformed into an approximation problem of a path inte- gral which has no open parameters other than the exploration noise. The resulting algorithm can be conceived of as model- based, semi-model-based, or even model free, depending on how the learning problem is structured. Our new algorithm demon- strates interesting similarities with previous RL research in the framework of proba- bility matching and provides intuition why the slightly heuristically motivated proba- bility matching approach can actually per- form well. Empirical evaluations demon- strate significant performance improvements over gradient-based policy learning and scal- ability to high-dimensional control problems. We believe that Policy Improvement with Path Integrals (PI2) offers currently one of the most efficient, numerically robust, and easy to implement algorithms for RL based on trajectory roll-outs.}, year = {2010}, note = {clmc}, slug = {theodorouaistats2010}, author = {Theodorou, E. A. and Buchli, J. and Schaal, S.}, crossref = {p10310} }