Web-Books
im Austria-Forum
Austria-Forum
Web-Books
Informatik
Joint Austrian Computer Vision and Robotics Workshop 2020
Seite - 30 -
  • Benutzer
  • Version
    • Vollversion
    • Textversion
  • Sprache
    • Deutsch
    • English - Englisch

Seite - 30 - in Joint Austrian Computer Vision and Robotics Workshop 2020

Bild der Seite - 30 -

Bild der Seite - 30 - in Joint Austrian Computer Vision and Robotics Workshop 2020

Text der Seite - 30 -

Figure 2. The reward (heatmap) and reward distributions (plots above and on the right of the heatmap) for the double-integrator. The agent starts atx=0with x˙=0. The reward is based on the distance of the agent to the goalpositions1and2. Alg. DDPG PPO SAC PPS(RRT) Non-Ex. 15.5% 20.8% 20.4% 79.3% Ex. 59.0% 60.9% 61.0% - Table2.Final coverage aspercent ofvisitedbins. points. The goals (x1 = −2.5 and x2 = 6.0) are chosensuchthatsimplymaximizingtherewardfrom thestartingposition leads toa suboptimalpolicy, i.e. a localoptimum(seeFigure2). We compare the performance of PPS against the prominent D-RL algorithms Proximal Policy Gradi- ent (PPO) [6], Deep Deterministic Policy Gradient (DDPG) [3], and SAC [1], using the implementation in [2].The algorithms are run for 105 environment steps; the D-RL algorithms use 100-step episodes. Tohaveabroaderbaselineweincludedanexploring- startsmechanismwheretheinitialstateof thedouble integrator issampleduniformly. However,especially in robotic tasks, exploring starts are impractical and potentially dangerousandshouldbeavoided. We first compare the state-space coverage ob- tained from data collected during the exploration phase of the different D-RL approaches. The cov- erage is calculated as the percentage of non-empty, uniformly-shaped bins. The number of bins is set to √ 105/5 in each dimension, i.e. we expect5data points in each bin on average. See Table 2 for the final coverages. Second, Figure 3 depicts boxplots of the evalua- tion returns achieved by the D-RL algorithms after training for 105 steps. DDPG achieves higher re- wards without exploring starts, while PPO and SAC profits fromexploringstarts. OurPPSmethodshows improved performance compared to non-exploring starts methods. Moreover, the policies learned with PPSachieveperformancecomparable to thedirectly- trained SACpolicywithexploringstarts. MPC Reward-2 MPC Reward-1 0 20 40 60 80 100 120 140 Evaluation Returns SAC ex SAC DDPG ex DDPG PPS PPO PPO ex SAC ex SAC Figure3.Boxplotof the returndistributions (11 indepen- dent runs); each run consists of the mean of 10 evalua- tion runs. The evaluation runs are performed towards the end of the training process, equally spaced 10 learning episodes apart. 4.Discussion In this work, we highlighted that standard D-RL algorithms are not immune to getting stuck in sub- optimal policies even in a toy problem with two lo- cal optima. The agent controlled by PPS explores a widerpartof thestate space thanD-RLmethods that focus on reward accumulation, even with exploring starts. The data gathered by RRT are not biased by reward accumulation and is thus more representative of theenvironment. References [1] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. In Int.Conf.MachineLearning (ICML), 2018. [2] A.Hill,A.Raffin,M.Ernestus,A.Gleave,R.Traore, P. Dhariwal, C. Hesse, O. Klimov, A. Nichol, M. Plappert, A. Radford, J. Schulman, S. Sidor, and Y.Wu. StableBaselines.GitHubrepository, 2018. [3] T.P.Lillicrap,J.J.Hunt,A.Pritzel,N.Heess,T.Erez, Y.Tassa,D.Silver, andD.Wierstra. Continuouscon- trol with deep reinforcement learning. InProc. 4th Int.Conf.LearningRepresentations, (ICLR), 2016. [4] OpenAI, M. Andrychowicz, B. Baker, M. Chociej, R.Jo´zefowicz,B.McGrew,J.W.Pachocki,A.Petron, M.Plappert,G.Powell,A.Ray,J.Schneider,S.Sidor, J. Tobin, P. Welinder, L. Weng, and W. Zaremba. Learning dexterous in-hand manipulation. I. J. RoboticsRes., 39(1), 2020. [5] A. Perez, R. Platt, G. Konidaris, L. Kaelbling, and T. Lozano-Perez. LQR-RRT*: Optimal sampling- basedmotionplanningwithautomaticallyderivedex- tension heuristics. In IEEE Int. Conf. Robotics and Automation, 2012. [6] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal Policy Optimization Algo- rithms.CoRR, abs/1707.06347,2017. 30
zurück zum  Buch Joint Austrian Computer Vision and Robotics Workshop 2020"
Joint Austrian Computer Vision and Robotics Workshop 2020
Titel
Joint Austrian Computer Vision and Robotics Workshop 2020
Herausgeber
Graz University of Technology
Ort
Graz
Datum
2020
Sprache
englisch
Lizenz
CC BY 4.0
ISBN
978-3-85125-752-6
Abmessungen
21.0 x 29.7 cm
Seiten
188
Kategorien
Informatik
Technik
Web-Books
Bibliothek
Datenschutz
Impressum
Austria-Forum
Austria-Forum
Web-Books
Joint Austrian Computer Vision and Robotics Workshop 2020