Page - 98 - in Proceedings of the OAGM&ARW Joint Workshop - Vision, Automation and Robotics

Image of the Page - 98 -

Text of the Page - 98 -

by the synthesizer. We iterate this update several times to improve the initial pose estimate. Again, the updater function is implemented as a CNN. The architecture is inspired by a Siamese network with two identical paths. One path is fed with the observed depth image and the second path is fed with the image from the synthesizer. Ideally, the output of the updater should bring the pose estimate to the correct pose in a single step, which is a very difficult problem in practice. However, our only requirement from the updater is to predict an update resulting in a pose closer to theground truth.The introductionof thesynthesizer allows us to virtually augment the training data and to add arbitrary poses to the set of poses, which the updater might perceive during testing and be asked to correct. We refer to our paper [4] for more details. III. CREATING TRAINING DATA EFFICIENTLY Since the presented hand pose estimation method critically relies on labeled training frames, we present a method for the creation of such frames. Given a sequence of depth maps capturing a hand in motion, we want to estimate the 3D joint locations for each depth map with minimal effort. We start by automatically selecting a subset of depth frames, we will refer to as reference frames, for which a user is asked to provide annotations. Our method selects these reference frames based on the appearances of the frames over the whole sequence. For this, we train an autoencoder that learns an unsupervised representation that is sensitive to image nuances due to hand articulation. We use this representation to formalize the frame selection as a submodular optimization. A user is then asked to provide the 2D reprojections of the joints with visibility information in these reference frames, and whether these joints are closer or farther from the camera than the parent joint in the hand skeleton tree. We use this information to automatically re- cover the3Dlocations of the joints bysolvinga least-squares problem. Next, we iteratively propagate these 3D locations from the reference frames to the remaining frames. We initialize the pose of the frame with the pose of the visually closest reference frame and optimize the local appearance together with spatial constraints. This gives us an initial- ization for the joint locations in all the frames. However, each frame is processed independently. We can improve the estimates further by introducing temporal constraints on the 3D locations and perform a global optimization, enforcing appearance, temporal, and spatial constraints over all 3D locations for all frames. If this inference fails for some frames, the user can still provide additional 2D reprojections; by running the global inference again, a single additional annotation typically fixes many frames. See our paper [2] for more details. IV. EVALUATION We evaluate our hand pose estimation method on the NYU Hand Pose Dataset [7], a challenging real-world benchmark for hand pose estimation. This dataset is publicly available, is backed up by a huge quantity of annotated samples, and also shows a high variability of poses. We show the benefit of using our proposed feedback loop to increase the accuracy of the 3D joint localization in Fig. 2. While [7] and [3] have an average 3D joint error of 21 mm and 20 mm respectively, our proposed method reaches an error reduction to16.5mm.The initializationwith the simple andefficientproposedpredictorhasanerrorof27mm.When we use a more complex initialization [3] with an error of 23 mm, we can decrease the average error to 16 mm. 0 10 20 30 40 50 60 70 80 Distance threshold / mm 0 20 40 60 80 100 Tompson et al. Oberweger et al. Initialization Ours Fig. 2. Quantitative evaluation of hand pose estimation. The figure shows the fraction of frames where all joints are within a maximum distance. A higher area under the curve denotes better results. We compare our method to the baseline of Tompson et al. [7] and Oberweger et al. [3]. Although our initialization is worse than both baselines, we can boost the accuracy of the joint locations using our proposed feedback loop. To demonstrate our training data creation approach, we evaluate it on a synthetic dataset, which is the only way to have depth maps with ground truth 3D locations of the joints. On this dataset we evaluate the accuracy of the automatically inferred 3D locations for the reference frames. We obtain an average 3D joint error of 3.6mm only from 2D reprojections with visibility and depth order. Our method is also robust to annotation noise. We can propagate the 3D joint locations to the remaining frames, for which we achieve an average 3D joint error of 5.5mm over the full sequence by only requiring manual 2D annotations for 10% of all frames. REFERENCES [1] A. Dosovitskiy, J. T. Springenberg, and T. Brox, “Learning to Generate Chairs with Convolutional Neural Networks,” inCVPR, 2015. [2] M. Oberweger, G. Riegler, P. Wohlhart, and V. Lepetit, “Efficiently Creating 3D Training Data for Fine Hand Pose Estimation,” inCVPR, 2016. [3] M. Oberweger, P. Wohlhart, and V. Lepetit, “Hands Deep in Deep Learning for Hand Pose Estimation,” inProc. of CVWW, 2015. [4] ——, “Training a Feedback Loop for Hand Pose Estimation,” in ICCV, 2015. [5] C. Qian, X. Sun, Y. Wei, X. Tang, and J. Sun, “Realtime and Robust Hand Tracking from Depth,” inCVPR, 2014. [6] D. Tang, H. J. Chang, A. Tejani, and T.-K. Kim, “Latent Regression Forest: Structured Estimation of 3D Articulated Hand Posture,” in CVPR, 2014. [7] J. Tompson, M. Stein, Y. LeCun, and K. Perlin, “Real-Time Continuous Pose Recovery of Human Hands Using Convolutional Networks,”ACM Transactions onGraphics, vol. 33, no. 5, pp. 169–179, 2014. 98

back to the book Proceedings of the OAGM&ARW Joint Workshop - Vision, Automation and Robotics"

Proceedings of the OAGM&ARW Joint Workshop Vision, Automation and Robotics

Title: Proceedings of the OAGM&ARW Joint Workshop
Subtitle: Vision, Automation and Robotics
Authors: Peter M. Roth; Markus Vincze; Wilfried Kubinger; Andreas Müller; Bernhard Blaschitz; Svorad Stolc
Publisher: Verlag der Technischen Universität Graz
Location: Wien
Date: 2017
Language: English
License: CC BY 4.0
ISBN: 978-3-85125-524-9
Size: 21.0 x 29.7 cm
Pages: 188
Keywords: Tagungsband
Categories: International; Tagungsbände

Page - 98 - in Proceedings of the OAGM&ARW Joint Workshop - Vision, Automation and Robotics

Image of the Page - 98 -

Text of the Page - 98 -

Table of contents