Web-Books
in the Austria-Forum
Austria-Forum
Web-Books
Informatik
Joint Austrian Computer Vision and Robotics Workshop 2020
Page - 30 -
  • User
  • Version
    • full version
    • text only version
  • Language
    • Deutsch - German
    • English

Page - 30 - in Joint Austrian Computer Vision and Robotics Workshop 2020

Image of the Page - 30 -

Image of the Page - 30 - in Joint Austrian Computer Vision and Robotics Workshop 2020

Text of the Page - 30 -

Figure 2. The reward (heatmap) and reward distributions (plots above and on the right of the heatmap) for the double-integrator. The agent starts atx=0with x˙=0. The reward is based on the distance of the agent to the goalpositions1and2. Alg. DDPG PPO SAC PPS(RRT) Non-Ex. 15.5% 20.8% 20.4% 79.3% Ex. 59.0% 60.9% 61.0% - Table2.Final coverage aspercent ofvisitedbins. points. The goals (x1 = −2.5 and x2 = 6.0) are chosensuchthatsimplymaximizingtherewardfrom thestartingposition leads toa suboptimalpolicy, i.e. a localoptimum(seeFigure2). We compare the performance of PPS against the prominent D-RL algorithms Proximal Policy Gradi- ent (PPO) [6], Deep Deterministic Policy Gradient (DDPG) [3], and SAC [1], using the implementation in [2].The algorithms are run for 105 environment steps; the D-RL algorithms use 100-step episodes. Tohaveabroaderbaselineweincludedanexploring- startsmechanismwheretheinitialstateof thedouble integrator issampleduniformly. However,especially in robotic tasks, exploring starts are impractical and potentially dangerousandshouldbeavoided. We first compare the state-space coverage ob- tained from data collected during the exploration phase of the different D-RL approaches. The cov- erage is calculated as the percentage of non-empty, uniformly-shaped bins. The number of bins is set to √ 105/5 in each dimension, i.e. we expect5data points in each bin on average. See Table 2 for the final coverages. Second, Figure 3 depicts boxplots of the evalua- tion returns achieved by the D-RL algorithms after training for 105 steps. DDPG achieves higher re- wards without exploring starts, while PPO and SAC profits fromexploringstarts. OurPPSmethodshows improved performance compared to non-exploring starts methods. Moreover, the policies learned with PPSachieveperformancecomparable to thedirectly- trained SACpolicywithexploringstarts. MPC Reward-2 MPC Reward-1 0 20 40 60 80 100 120 140 Evaluation Returns SAC ex SAC DDPG ex DDPG PPS PPO PPO ex SAC ex SAC Figure3.Boxplotof the returndistributions (11 indepen- dent runs); each run consists of the mean of 10 evalua- tion runs. The evaluation runs are performed towards the end of the training process, equally spaced 10 learning episodes apart. 4.Discussion In this work, we highlighted that standard D-RL algorithms are not immune to getting stuck in sub- optimal policies even in a toy problem with two lo- cal optima. The agent controlled by PPS explores a widerpartof thestate space thanD-RLmethods that focus on reward accumulation, even with exploring starts. The data gathered by RRT are not biased by reward accumulation and is thus more representative of theenvironment. References [1] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. In Int.Conf.MachineLearning (ICML), 2018. [2] A.Hill,A.Raffin,M.Ernestus,A.Gleave,R.Traore, P. Dhariwal, C. Hesse, O. Klimov, A. Nichol, M. Plappert, A. Radford, J. Schulman, S. Sidor, and Y.Wu. StableBaselines.GitHubrepository, 2018. [3] T.P.Lillicrap,J.J.Hunt,A.Pritzel,N.Heess,T.Erez, Y.Tassa,D.Silver, andD.Wierstra. Continuouscon- trol with deep reinforcement learning. InProc. 4th Int.Conf.LearningRepresentations, (ICLR), 2016. [4] OpenAI, M. Andrychowicz, B. Baker, M. Chociej, R.Jo´zefowicz,B.McGrew,J.W.Pachocki,A.Petron, M.Plappert,G.Powell,A.Ray,J.Schneider,S.Sidor, J. Tobin, P. Welinder, L. Weng, and W. Zaremba. Learning dexterous in-hand manipulation. I. J. RoboticsRes., 39(1), 2020. [5] A. Perez, R. Platt, G. Konidaris, L. Kaelbling, and T. Lozano-Perez. LQR-RRT*: Optimal sampling- basedmotionplanningwithautomaticallyderivedex- tension heuristics. In IEEE Int. Conf. Robotics and Automation, 2012. [6] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal Policy Optimization Algo- rithms.CoRR, abs/1707.06347,2017. 30
back to the  book Joint Austrian Computer Vision and Robotics Workshop 2020"
Joint Austrian Computer Vision and Robotics Workshop 2020
Title
Joint Austrian Computer Vision and Robotics Workshop 2020
Editor
Graz University of Technology
Location
Graz
Date
2020
Language
English
License
CC BY 4.0
ISBN
978-3-85125-752-6
Size
21.0 x 29.7 cm
Pages
188
Categories
Informatik
Technik
Web-Books
Library
Privacy
Imprint
Austria-Forum
Austria-Forum
Web-Books
Joint Austrian Computer Vision and Robotics Workshop 2020