Seite - 30 - in Joint Austrian Computer Vision and Robotics Workshop 2020

Bild der Seite - 30 -

Text der Seite - 30 -

Figure 2. The reward (heatmap) and reward distributions (plots above and on the right of the heatmap) for the double-integrator. The agent starts atx=0with x˙=0. The reward is based on the distance of the agent to the goalpositions1and2. Alg. DDPG PPO SAC PPS(RRT) Non-Ex. 15.5% 20.8% 20.4% 79.3% Ex. 59.0% 60.9% 61.0% - Table2.Final coverage aspercent ofvisitedbins. points. The goals (x1 = −2.5 and x2 = 6.0) are chosensuchthatsimplymaximizingtherewardfrom thestartingposition leads toa suboptimalpolicy, i.e. a localoptimum(seeFigure2). We compare the performance of PPS against the prominent D-RL algorithms Proximal Policy Gradi- ent (PPO) [6], Deep Deterministic Policy Gradient (DDPG) [3], and SAC [1], using the implementation in [2].The algorithms are run for 105 environment steps; the D-RL algorithms use 100-step episodes. Tohaveabroaderbaselineweincludedanexploring- startsmechanismwheretheinitialstateof thedouble integrator issampleduniformly. However,especially in robotic tasks, exploring starts are impractical and potentially dangerousandshouldbeavoided. We first compare the state-space coverage ob- tained from data collected during the exploration phase of the different D-RL approaches. The cov- erage is calculated as the percentage of non-empty, uniformly-shaped bins. The number of bins is set to √ 105/5 in each dimension, i.e. we expect5data points in each bin on average. See Table 2 for the final coverages. Second, Figure 3 depicts boxplots of the evalua- tion returns achieved by the D-RL algorithms after training for 105 steps. DDPG achieves higher re- wards without exploring starts, while PPO and SAC profits fromexploringstarts. OurPPSmethodshows improved performance compared to non-exploring starts methods. Moreover, the policies learned with PPSachieveperformancecomparable to thedirectly- trained SACpolicywithexploringstarts. MPC Reward-2 MPC Reward-1 0 20 40 60 80 100 120 140 Evaluation Returns SAC ex SAC DDPG ex DDPG PPS PPO PPO ex SAC ex SAC Figure3.Boxplotof the returndistributions (11 indepen- dent runs); each run consists of the mean of 10 evalua- tion runs. The evaluation runs are performed towards the end of the training process, equally spaced 10 learning episodes apart. 4.Discussion In this work, we highlighted that standard D-RL algorithms are not immune to getting stuck in sub- optimal policies even in a toy problem with two lo- cal optima. The agent controlled by PPS explores a widerpartof thestate space thanD-RLmethods that focus on reward accumulation, even with exploring starts. The data gathered by RRT are not biased by reward accumulation and is thus more representative of theenvironment. References [1] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. In Int.Conf.MachineLearning (ICML), 2018. [2] A.Hill,A.Raffin,M.Ernestus,A.Gleave,R.Traore, P. Dhariwal, C. Hesse, O. Klimov, A. Nichol, M. Plappert, A. Radford, J. Schulman, S. Sidor, and Y.Wu. StableBaselines.GitHubrepository, 2018. [3] T.P.Lillicrap,J.J.Hunt,A.Pritzel,N.Heess,T.Erez, Y.Tassa,D.Silver, andD.Wierstra. Continuouscon- trol with deep reinforcement learning. InProc. 4th Int.Conf.LearningRepresentations, (ICLR), 2016. [4] OpenAI, M. Andrychowicz, B. Baker, M. Chociej, R.Jo´zefowicz,B.McGrew,J.W.Pachocki,A.Petron, M.Plappert,G.Powell,A.Ray,J.Schneider,S.Sidor, J. Tobin, P. Welinder, L. Weng, and W. Zaremba. Learning dexterous in-hand manipulation. I. J. RoboticsRes., 39(1), 2020. [5] A. Perez, R. Platt, G. Konidaris, L. Kaelbling, and T. Lozano-Perez. LQR-RRT*: Optimal sampling- basedmotionplanningwithautomaticallyderivedex- tension heuristics. In IEEE Int. Conf. Robotics and Automation, 2012. [6] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal Policy Optimization Algo- rithms.CoRR, abs/1707.06347,2017. 30

zurück zum Buch Joint Austrian Computer Vision and Robotics Workshop 2020"

Joint Austrian Computer Vision and Robotics Workshop 2020

Titel: Joint Austrian Computer Vision and Robotics Workshop 2020
Herausgeber: Graz University of Technology
Ort: Graz
Datum: 2020
Sprache: englisch
Lizenz: CC BY 4.0
ISBN: 978-3-85125-752-6
Abmessungen: 21.0 x 29.7 cm
Seiten: 188
Kategorien: Informatik; Technik