Page - 44 - in Joint Austrian Computer Vision and Robotics Workshop 2020

Image of the Page - 44 -

Text of the Page - 44 -

mounted ASUS Xtion camera are recorded with the end-effectorposeof the robot. Manydemonstrations are shown to create a dataset that is used to train a deep imitation learning model (Section 3.3). The learned policy is then executed by the robot using only the liveRGB-Dimagesandend-effectorposes. 3.1.Hand Tracking The hand tracking method developed by Pan- teleris et al. [17] is used to estimate the 3D pose of the human demonstrator from RGB images in real- time. Thisapproachconsistsof threesteps: (1)Crop- ping the user hand in the image, (2) passing the cropped image to a 2D joint position estimator and (3) mapping the 2D joints on a 3D hand model to recover3Dpositionsof the joints. For finding an initial bounding box of the hand, a deep neural network model [25] to detect hands in real-time is applied. Afterwards, the cropped image of the hand is passed through the hand key-point lo- calizationmodelof [7] toestimate the2Dlocationof the hand joints. It localizes the 21 key-points for the wrist, 5 fingertips and 5×3=15 finger joints. This specific model was selected because it matches or outperforms other state-of-the-art methods but with much lower computational requirements. In the end, the 2D locations of the joints are mapped to the 3D hand model via non-linear least-squares optimiza- tion. The 3D positions are then used as the initial step for theoptimizationof thenext frame. The 3D positions of the joints are also used to up- date the bounding box of the hand, which eliminates the need to use the hand detector model for each frame. However, failure of the hand tracking mod- ule based on the hand position and movement in the previous frames (e.g. due to sudden movements, oc- clusion, or failure in 2D localization), results in poor optimization. Therefore, to make the tracker more robust, the optimization score is checked to reset the optimizer’s initial state and to use the hand detector tofindanewboundingboxfor thehand ifnecessary. 3.2.Robot Control For the teleoperationof the robotend-effector, the 3D hand position from the hand tracking system is compared with an initial hand position. If the differ- ence between the current and the initial position for any Cartesian coordinate is above a certain thresh- oldκ, a new end-effector position is calculated and commanded to the robot. This difference is then transformedfromthecamera frame to the robotbase frame and denoted ash. The transformation aligns thedirectionsofthehandandend-effectormovement toallowintuitive teleoperation. The desired end-effector positionp∗ is calculated byadding thecurrentend-effectorpositionpand the value∆p. This is calculated for each Cartesian co- ordinatewithpi,hi∈p,haccording to: ∆pi= 󰀻 󰁁 󰀿 󰁁 󰀽 α(min{hmax,hi−κ}) ∀hi>κ, α(max{hmin,−hi−κ}) ∀hi<−κ, 0 otherwise, (1) where hmax expressing the upper, hmin the lower limit andα as a parameter that indirectly allows the sensitivity to speed tobe tuned. As described in Section 3.3, ∆p is directly learned. When executing the learned policy,∆p is used to calculate the desired end-effector position. For either the teleoperation or the task execution by the learned policy, the desired end-effector position isupdatedcontinuouslyandcommandedtotherobot. Theorientationof theend-effector couldbechanged similarlybutwasnotnecessary forour specific task. 3.3. Deep Imitation Learning Weemployed thealgorithmpresentedby[27]and adapted it in several ways to work with our robotic setupinvolvingimperfectdemonstrationsandchang- ing environment conditions (e.g. brightness). The adapted network can be seen in Figure 2. The in- putot at each timestep t consists of the cropped and scaled color image It∈R120×160×3, depth image Dt∈R120×160×1, andthe5mostrecentend-effector positionspt−4:t∈R15. After3convolutional layers, the data is passed through a spatial softmax layer in- troduced in [14]. During training, the output of this layer is used for auxiliary predictions of the current end-effectorpositionand theend-effectorpositionat the end of each demonstration with two fully con- nected layers per auxiliary prediction. The output of thenetworkis thechangeof theend-effectorposition of the robot∆p in millimetres. Compared to [27], we omit one convolutional layer but use more units in our dense layers, which slightly reduces training timewithoutdeterioratingperformance. Sincewedo notchange theorientationof theend-effector for this simple task, we can simplify the output of the net- workat time t tobe∆pt=πθ(ot)∈R3. The input data is augmented by randomly chang- ing the brightness during training and batch normal- ization is added after each layer to better cope with 44

back to the book Joint Austrian Computer Vision and Robotics Workshop 2020"