# Seite - 30 - in Joint Austrian Computer Vision and Robotics Workshop 2020

## Bild der Seite - 30 -

## Text der Seite - 30 -

Figure 2. The reward (heatmap) and reward distributions
(plots above and on the right of the heatmap) for the
double-integrator. The agent starts atx=0with x˙=0.
The reward is based on the distance of the agent to the
goalpositions1and2.
Alg. DDPG PPO SAC PPS(RRT)
Non-Ex. 15.5% 20.8% 20.4% 79.3%
Ex. 59.0% 60.9% 61.0% -
Table2.Final coverage aspercent ofvisitedbins.
points. The goals (x1 = −2.5 and x2 = 6.0) are
chosensuchthatsimplymaximizingtherewardfrom
thestartingposition leads toa suboptimalpolicy, i.e.
a localoptimum(seeFigure2).
We compare the performance of PPS against the
prominent D-RL algorithms Proximal Policy Gradi-
ent (PPO) [6], Deep Deterministic Policy Gradient
(DDPG) [3], and SAC [1], using the implementation
in [2].The algorithms are run for 105 environment
steps; the D-RL algorithms use 100-step episodes.
Tohaveabroaderbaselineweincludedanexploring-
startsmechanismwheretheinitialstateof thedouble
integrator issampleduniformly. However,especially
in robotic tasks, exploring starts are impractical and
potentially dangerousandshouldbeavoided.
We first compare the state-space coverage ob-
tained from data collected during the exploration
phase of the different D-RL approaches. The cov-
erage is calculated as the percentage of non-empty,
uniformly-shaped bins. The number of bins is set
to √
105/5 in each dimension, i.e. we expect5data
points in each bin on average. See Table 2 for the
final coverages.
Second, Figure 3 depicts boxplots of the evalua-
tion returns achieved by the D-RL algorithms after
training for 105 steps. DDPG achieves higher re-
wards without exploring starts, while PPO and SAC
profits fromexploringstarts. OurPPSmethodshows
improved performance compared to non-exploring
starts methods. Moreover, the policies learned with
PPSachieveperformancecomparable to thedirectly-
trained SACpolicywithexploringstarts. MPC Reward-2
MPC Reward-1
0 20 40 60 80 100 120 140
Evaluation Returns
SAC ex
SAC
DDPG ex
DDPG
PPS
PPO
PPO ex
SAC ex
SAC
Figure3.Boxplotof the returndistributions (11 indepen-
dent runs); each run consists of the mean of 10 evalua-
tion runs. The evaluation runs are performed towards the
end of the training process, equally spaced 10 learning
episodes apart.
4.Discussion
In this work, we highlighted that standard D-RL
algorithms are not immune to getting stuck in sub-
optimal policies even in a toy problem with two lo-
cal optima. The agent controlled by PPS explores a
widerpartof thestate space thanD-RLmethods that
focus on reward accumulation, even with exploring
starts. The data gathered by RRT are not biased by
reward accumulation and is thus more representative
of theenvironment.
References
[1] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft
Actor-Critic: Off-Policy Maximum Entropy Deep
Reinforcement Learning with a Stochastic Actor. In
Int.Conf.MachineLearning (ICML), 2018.
[2] A.Hill,A.Raffin,M.Ernestus,A.Gleave,R.Traore,
P. Dhariwal, C. Hesse, O. Klimov, A. Nichol,
M. Plappert, A. Radford, J. Schulman, S. Sidor, and
Y.Wu. StableBaselines.GitHubrepository, 2018.
[3] T.P.Lillicrap,J.J.Hunt,A.Pritzel,N.Heess,T.Erez,
Y.Tassa,D.Silver, andD.Wierstra. Continuouscon-
trol with deep reinforcement learning. InProc. 4th
Int.Conf.LearningRepresentations, (ICLR), 2016.
[4] OpenAI, M. Andrychowicz, B. Baker, M. Chociej,
R.Jo´zefowicz,B.McGrew,J.W.Pachocki,A.Petron,
M.Plappert,G.Powell,A.Ray,J.Schneider,S.Sidor,
J. Tobin, P. Welinder, L. Weng, and W. Zaremba.
Learning dexterous in-hand manipulation. I. J.
RoboticsRes., 39(1), 2020.
[5] A. Perez, R. Platt, G. Konidaris, L. Kaelbling, and
T. Lozano-Perez. LQR-RRT*: Optimal sampling-
basedmotionplanningwithautomaticallyderivedex-
tension heuristics. In IEEE Int. Conf. Robotics and
Automation, 2012.
[6] J. Schulman, F. Wolski, P. Dhariwal, A. Radford,
and O. Klimov. Proximal Policy Optimization Algo-
rithms.CoRR, abs/1707.06347,2017.
30

Joint Austrian Computer Vision and Robotics Workshop 2020

- Titel
- Joint Austrian Computer Vision and Robotics Workshop 2020
- Herausgeber
- Graz University of Technology
- Ort
- Graz
- Datum
- 2020
- Sprache
- englisch
- Lizenz
- CC BY 4.0
- ISBN
- 978-3-85125-752-6
- Abmessungen
- 21.0 x 29.7 cm
- Seiten
- 188
- Kategorien
- Informatik
- Technik