Page - 29 - in Joint Austrian Computer Vision and Robotics Workshop 2020

Image of the Page - 29 -

Text of the Page - 29 -

Howdoesexplicit exploration influenceDeepReinforcementLearning? JakobJ.Hollenstein,ErwanRenaudo,MatteoSaveriano, JustusPiater Universityof Innsbruck {jakob.hollenstein,erwan.renaudo,matteo.saveriano,justus.piater}@uibk.ac.at Abstract. Most Deep Reinforcement Learning (D- RL)methods perform local searchand therefore are prone toget stuck innon-optimal solutions. Toover- come this issue, we exploit simulation models and kinodynamic planners as explorationmechanism in a model-based reinforcement learning method. We showthat, evenonasimple toydomain,D-RLmeth- ods are not immune to local optimaand require ad- ditional exploration mechanisms. In contrast, our planning-based exploration exhibits a better state space coveragewhich turns intobetter policies than theones learnedvia standardD-RLmethods. 1. Introduction Deep-Reinforcement Learning (D-RL) has shown promising results in challenging robotics domains (e.g. [4]), but can be resource demanding and diffi- cult to train. We assume that part of the difficulty of learning good policies is related to insufficient ex- ploration. Other D-RL methods like [1,3,6] par- tially address the problem by increasing the number of training steps, or by relying on the environment implementation to provide exploring-starts to cover a diverse enough state-space region. However, these solutions are impractical and potentially dangerous in robotics applications. In the robotic context, directed exploration via physically-basedsimulationappearsmorepromising tofindgoodsolutionsmore reliablyand in less time. Therefore, this work proposes thePlanning forPol- icySearch(PPS)methodthatexploitsakinodynamic planner intheexplorationphasetocollectdatawhich are then used to learn a policy, thereby eliminating the planning time during execution. PPS is tested on This research has received funding from the European Union’s Horizon 2020 research and innovation programme (grant agreementno. 731761, IMAGINE) Dynamics position velocity G1 Positionwrapping d1 G2 d2 M X= [ x x˙ ] A= [ 0 1 0 0 ] B= [ 0 1 ] x˙=Ax+Bu Reward max((1−tanh|X−G∗1|),2(1−tanh|X−G∗2|)) G1= [ −2.5 0.0 ] G2= [ 6.0 0.0 ] Limits u∈ [−1;1] x∈ [−10;10] x˙∈ [−2.5;2.5] Table 1. Description of the 1D double-integrator test en- vironment: a point mass M can be moved in a one- dimensional spaceposition-velocityX=[x,x˙]byapply- ing a continuous-valued force. Reward is received based on thedistance to two possiblegoal locations (G1,G2). Figure1. IllustrationofPPSMethod thepointmasssystemdescribed inTable1andcom- paredwithD-RLapproaches. 2.Planning forPolicySearch The presented PPS implementation (Figure 1) consists of a Linear Quadratic Regulator (LQR)- Rapidly Exploring Random Tree (RRT) [5] to cre- ate a tree of dataD= {(s,a,r,s′), .. .} from which Soft-Actor Critic (SAC) [1] learns a policy. In contrast to [5] quadratic programming-based finite- horizon steering is used to extend the tree. In our setup, all the environment interaction data created by RRT are used as training data for the policy rather than using only successful trajectories as ex- pertdemonstrations. 3.Evaluation PPS is evaluated in the one-dimensional goal reaching task presented in Table 1. The environment contains two distinct goal locations. The agent re- ceives a reward based on the distance to the goal 29

back to the book Joint Austrian Computer Vision and Robotics Workshop 2020"

Joint Austrian Computer Vision and Robotics Workshop 2020

Title: Joint Austrian Computer Vision and Robotics Workshop 2020
Editor: Graz University of Technology
Location: Graz
Date: 2020
Language: English
License: CC BY 4.0
ISBN: 978-3-85125-752-6
Size: 21.0 x 29.7 cm
Pages: 188
Categories: Informatik; Technik