Web-Books
in the Austria-Forum
Austria-Forum
Web-Books
Informatik
Joint Austrian Computer Vision and Robotics Workshop 2020
Page - 44 -
  • User
  • Version
    • full version
    • text only version
  • Language
    • Deutsch - German
    • English

Page - 44 - in Joint Austrian Computer Vision and Robotics Workshop 2020

Image of the Page - 44 -

Image of the Page - 44 - in Joint Austrian Computer Vision and Robotics Workshop 2020

Text of the Page - 44 -

mounted ASUS Xtion camera are recorded with the end-effectorposeof the robot. Manydemonstrations are shown to create a dataset that is used to train a deep imitation learning model (Section 3.3). The learned policy is then executed by the robot using only the liveRGB-Dimagesandend-effectorposes. 3.1.Hand Tracking The hand tracking method developed by Pan- teleris et al. [17] is used to estimate the 3D pose of the human demonstrator from RGB images in real- time. Thisapproachconsistsof threesteps: (1)Crop- ping the user hand in the image, (2) passing the cropped image to a 2D joint position estimator and (3) mapping the 2D joints on a 3D hand model to recover3Dpositionsof the joints. For finding an initial bounding box of the hand, a deep neural network model [25] to detect hands in real-time is applied. Afterwards, the cropped image of the hand is passed through the hand key-point lo- calizationmodelof [7] toestimate the2Dlocationof the hand joints. It localizes the 21 key-points for the wrist, 5 fingertips and 5Ɨ3=15 finger joints. This specific model was selected because it matches or outperforms other state-of-the-art methods but with much lower computational requirements. In the end, the 2D locations of the joints are mapped to the 3D hand model via non-linear least-squares optimiza- tion. The 3D positions are then used as the initial step for theoptimizationof thenext frame. The 3D positions of the joints are also used to up- date the bounding box of the hand, which eliminates the need to use the hand detector model for each frame. However, failure of the hand tracking mod- ule based on the hand position and movement in the previous frames (e.g. due to sudden movements, oc- clusion, or failure in 2D localization), results in poor optimization. Therefore, to make the tracker more robust, the optimization score is checked to reset the optimizer’s initial state and to use the hand detector tofindanewboundingboxfor thehand ifnecessary. 3.2.Robot Control For the teleoperationof the robotend-effector, the 3D hand position from the hand tracking system is compared with an initial hand position. If the differ- ence between the current and the initial position for any Cartesian coordinate is above a certain thresh- oldĪŗ, a new end-effector position is calculated and commanded to the robot. This difference is then transformedfromthecamera frame to the robotbase frame and denoted ash. The transformation aligns thedirectionsofthehandandend-effectormovement toallowintuitive teleoperation. The desired end-effector positionpāˆ— is calculated byadding thecurrentend-effectorpositionpand the valueāˆ†p. This is calculated for each Cartesian co- ordinatewithpi,hi∈p,haccording to: āˆ†pi= 󰀻 󰁁 󰀿 󰁁 󰀽 α(min{hmax,hiāˆ’Īŗ}) āˆ€hi>Īŗ, α(max{hmin,āˆ’hiāˆ’Īŗ}) āˆ€hi<āˆ’Īŗ, 0 otherwise, (1) where hmax expressing the upper, hmin the lower limit andα as a parameter that indirectly allows the sensitivity to speed tobe tuned. As described in Section 3.3, āˆ†p is directly learned. When executing the learned policy,āˆ†p is used to calculate the desired end-effector position. For either the teleoperation or the task execution by the learned policy, the desired end-effector position isupdatedcontinuouslyandcommandedtotherobot. Theorientationof theend-effector couldbechanged similarlybutwasnotnecessary forour specific task. 3.3. Deep Imitation Learning Weemployed thealgorithmpresentedby[27]and adapted it in several ways to work with our robotic setupinvolvingimperfectdemonstrationsandchang- ing environment conditions (e.g. brightness). The adapted network can be seen in Figure 2. The in- putot at each timestep t consists of the cropped and scaled color image It∈R120Ɨ160Ɨ3, depth image Dt∈R120Ɨ160Ɨ1, andthe5mostrecentend-effector positionsptāˆ’4:t∈R15. After3convolutional layers, the data is passed through a spatial softmax layer in- troduced in [14]. During training, the output of this layer is used for auxiliary predictions of the current end-effectorpositionand theend-effectorpositionat the end of each demonstration with two fully con- nected layers per auxiliary prediction. The output of thenetworkis thechangeof theend-effectorposition of the robotāˆ†p in millimetres. Compared to [27], we omit one convolutional layer but use more units in our dense layers, which slightly reduces training timewithoutdeterioratingperformance. Sincewedo notchange theorientationof theend-effector for this simple task, we can simplify the output of the net- workat time t tobeāˆ†pt=πθ(ot)∈R3. The input data is augmented by randomly chang- ing the brightness during training and batch normal- ization is added after each layer to better cope with 44
back to the  book Joint Austrian Computer Vision and Robotics Workshop 2020"
Joint Austrian Computer Vision and Robotics Workshop 2020
Title
Joint Austrian Computer Vision and Robotics Workshop 2020
Editor
Graz University of Technology
Location
Graz
Date
2020
Language
English
License
CC BY 4.0
ISBN
978-3-85125-752-6
Size
21.0 x 29.7 cm
Pages
188
Categories
Informatik
Technik
Web-Books
Library
Privacy
Imprint
Austria-Forum
Austria-Forum
Web-Books
Joint Austrian Computer Vision and Robotics Workshop 2020