Page - 44 - in Joint Austrian Computer Vision and Robotics Workshop 2020
Image of the Page - 44 -
Text of the Page - 44 -
mounted ASUS Xtion camera are recorded with the
end-effectorposeof the robot. Manydemonstrations
are shown to create a dataset that is used to train
a deep imitation learning model (Section 3.3). The
learned policy is then executed by the robot using
only the liveRGB-Dimagesandend-effectorposes.
3.1.Hand Tracking
The hand tracking method developed by Pan-
teleris et al. [17] is used to estimate the 3D pose of
the human demonstrator from RGB images in real-
time. Thisapproachconsistsof threesteps: (1)Crop-
ping the user hand in the image, (2) passing the
cropped image to a 2D joint position estimator and
(3) mapping the 2D joints on a 3D hand model to
recover3Dpositionsof the joints.
For finding an initial bounding box of the hand,
a deep neural network model [25] to detect hands in
real-time is applied. Afterwards, the cropped image
of the hand is passed through the hand key-point lo-
calizationmodelof [7] toestimate the2Dlocationof
the hand joints. It localizes the 21 key-points for the
wrist, 5 fingertips and 5Ć3=15 finger joints. This
specific model was selected because it matches or
outperforms other state-of-the-art methods but with
much lower computational requirements. In the end,
the 2D locations of the joints are mapped to the 3D
hand model via non-linear least-squares optimiza-
tion. The 3D positions are then used as the initial
step for theoptimizationof thenext frame.
The 3D positions of the joints are also used to up-
date the bounding box of the hand, which eliminates
the need to use the hand detector model for each
frame. However, failure of the hand tracking mod-
ule based on the hand position and movement in the
previous frames (e.g. due to sudden movements, oc-
clusion, or failure in 2D localization), results in poor
optimization. Therefore, to make the tracker more
robust, the optimization score is checked to reset the
optimizerās initial state and to use the hand detector
tofindanewboundingboxfor thehand ifnecessary.
3.2.Robot Control
For the teleoperationof the robotend-effector, the
3D hand position from the hand tracking system is
compared with an initial hand position. If the differ-
ence between the current and the initial position for
any Cartesian coordinate is above a certain thresh-
oldĪŗ, a new end-effector position is calculated and
commanded to the robot. This difference is then
transformedfromthecamera frame to the robotbase frame and denoted ash. The transformation aligns
thedirectionsofthehandandend-effectormovement
toallowintuitive teleoperation.
The desired end-effector positionpā is calculated
byadding thecurrentend-effectorpositionpand the
valueāp. This is calculated for each Cartesian co-
ordinatewithpi,hiāp,haccording to:
āpi= ó°»
ó°
ó°æ
ó°
ó°½ Ī±(min{hmax,hiāĪŗ}) āhi>Īŗ,
Ī±(max{hmin,āhiāĪŗ}) āhi<āĪŗ,
0 otherwise,
(1)
where hmax expressing the upper, hmin the lower
limit andĪ± as a parameter that indirectly allows the
sensitivity to speed tobe tuned.
As described in Section 3.3, āp is directly
learned. When executing the learned policy,āp is
used to calculate the desired end-effector position.
For either the teleoperation or the task execution by
the learned policy, the desired end-effector position
isupdatedcontinuouslyandcommandedtotherobot.
Theorientationof theend-effector couldbechanged
similarlybutwasnotnecessary forour specific task.
3.3. Deep Imitation Learning
Weemployed thealgorithmpresentedby[27]and
adapted it in several ways to work with our robotic
setupinvolvingimperfectdemonstrationsandchang-
ing environment conditions (e.g. brightness). The
adapted network can be seen in Figure 2. The in-
putot at each timestep t consists of the cropped and
scaled color image ItāR120Ć160Ć3, depth image
DtāR120Ć160Ć1, andthe5mostrecentend-effector
positionsptā4:tāR15. After3convolutional layers,
the data is passed through a spatial softmax layer in-
troduced in [14]. During training, the output of this
layer is used for auxiliary predictions of the current
end-effectorpositionand theend-effectorpositionat
the end of each demonstration with two fully con-
nected layers per auxiliary prediction. The output of
thenetworkis thechangeof theend-effectorposition
of the robotāp in millimetres. Compared to [27],
we omit one convolutional layer but use more units
in our dense layers, which slightly reduces training
timewithoutdeterioratingperformance. Sincewedo
notchange theorientationof theend-effector for this
simple task, we can simplify the output of the net-
workat time t tobeāpt=ĻĪø(ot)āR3.
The input data is augmented by randomly chang-
ing the brightness during training and batch normal-
ization is added after each layer to better cope with
44
Joint Austrian Computer Vision and Robotics Workshop 2020
- Title
- Joint Austrian Computer Vision and Robotics Workshop 2020
- Editor
- Graz University of Technology
- Location
- Graz
- Date
- 2020
- Language
- English
- License
- CC BY 4.0
- ISBN
- 978-3-85125-752-6
- Size
- 21.0 x 29.7 cm
- Pages
- 188
- Categories
- Informatik
- Technik