Seite - 29 - in Joint Austrian Computer Vision and Robotics Workshop 2020
Bild der Seite - 29 -
Text der Seite - 29 -
Howdoesexplicit exploration influenceDeepReinforcementLearning?
JakobJ.Hollenstein,ErwanRenaudo,MatteoSaveriano, JustusPiater
Universityof Innsbruck
{jakob.hollenstein,erwan.renaudo,matteo.saveriano,justus.piater}@uibk.ac.at
Abstract. Most Deep Reinforcement Learning (D-
RL)methods perform local searchand therefore are
prone toget stuck innon-optimal solutions. Toover-
come this issue, we exploit simulation models and
kinodynamic planners as explorationmechanism in
a model-based reinforcement learning method. We
showthat, evenonasimple toydomain,D-RLmeth-
ods are not immune to local optimaand require ad-
ditional exploration mechanisms. In contrast, our
planning-based exploration exhibits a better state
space coveragewhich turns intobetter policies than
theones learnedvia standardD-RLmethods.
1. Introduction
Deep-Reinforcement Learning (D-RL) has shown
promising results in challenging robotics domains
(e.g. [4]), but can be resource demanding and diffi-
cult to train. We assume that part of the difficulty of
learning good policies is related to insufficient ex-
ploration. Other D-RL methods like [1,3,6] par-
tially address the problem by increasing the number
of training steps, or by relying on the environment
implementation to provide exploring-starts to cover
a diverse enough state-space region. However, these
solutions are impractical and potentially dangerous
in robotics applications.
In the robotic context, directed exploration via
physically-basedsimulationappearsmorepromising
tofindgoodsolutionsmore reliablyand in less time.
Therefore, this work proposes thePlanning forPol-
icySearch(PPS)methodthatexploitsakinodynamic
planner intheexplorationphasetocollectdatawhich
are then used to learn a policy, thereby eliminating
the planning time during execution. PPS is tested on
This research has received funding from the European
Union’s Horizon 2020 research and innovation programme
(grant agreementno. 731761, IMAGINE) Dynamics
position
velocity G1
Positionwrapping
d1 G2
d2
M
X= [
x
xË™
]
A= [
0 1
0 0 ]
B= [
0
1
]
xË™=Ax+Bu
Reward
max((1−tanh|X−G∗1|),2(1−tanh|X−G∗2|)) G1= [
−2.5
0.0 ]
G2= [
6.0
0.0
]
Limits
u∈ [−1;1] x∈ [−10;10] x˙∈ [−2.5;2.5]
Table 1. Description of the 1D double-integrator test en-
vironment: a point mass M can be moved in a one-
dimensional spaceposition-velocityX=[x,xË™]byapply-
ing a continuous-valued force. Reward is received based
on thedistance to two possiblegoal locations (G1,G2).
Figure1. IllustrationofPPSMethod
thepointmasssystemdescribed inTable1andcom-
paredwithD-RLapproaches.
2.Planning forPolicySearch
The presented PPS implementation (Figure 1)
consists of a Linear Quadratic Regulator (LQR)-
Rapidly Exploring Random Tree (RRT) [5] to cre-
ate a tree of dataD= {(s,a,r,s′), .. .} from which
Soft-Actor Critic (SAC) [1] learns a policy. In
contrast to [5] quadratic programming-based finite-
horizon steering is used to extend the tree. In our
setup, all the environment interaction data created
by RRT are used as training data for the policy
rather than using only successful trajectories as ex-
pertdemonstrations.
3.Evaluation
PPS is evaluated in the one-dimensional goal
reaching task presented in Table 1. The environment
contains two distinct goal locations. The agent re-
ceives a reward based on the distance to the goal
29
Joint Austrian Computer Vision and Robotics Workshop 2020
- Titel
- Joint Austrian Computer Vision and Robotics Workshop 2020
- Herausgeber
- Graz University of Technology
- Ort
- Graz
- Datum
- 2020
- Sprache
- englisch
- Lizenz
- CC BY 4.0
- ISBN
- 978-3-85125-752-6
- Abmessungen
- 21.0 x 29.7 cm
- Seiten
- 188
- Kategorien
- Informatik
- Technik