Page - 98 - in Proceedings of the OAGM&ARW Joint Workshop - Vision, Automation and Robotics
Image of the Page - 98 -
Text of the Page - 98 -
by the synthesizer. We iterate this update several times to
improve the initial pose estimate. Again, the updater function
is implemented as a CNN. The architecture is inspired by a
Siamese network with two identical paths. One path is fed
with the observed depth image and the second path is fed
with the image from the synthesizer.
Ideally, the output of the updater should bring the pose
estimate to the correct pose in a single step, which is a very
difficult problem in practice. However, our only requirement
from the updater is to predict an update resulting in a pose
closer to theground truth.The introductionof thesynthesizer
allows us to virtually augment the training data and to add
arbitrary poses to the set of poses, which the updater might
perceive during testing and be asked to correct. We refer to
our paper [4] for more details.
III. CREATING TRAINING DATA EFFICIENTLY
Since the presented hand pose estimation method critically
relies on labeled training frames, we present a method for
the creation of such frames. Given a sequence of depth maps
capturing a hand in motion, we want to estimate the 3D joint
locations for each depth map with minimal effort.
We start by automatically selecting a subset of depth
frames, we will refer to as reference frames, for which a
user is asked to provide annotations. Our method selects
these reference frames based on the appearances of the
frames over the whole sequence. For this, we train an
autoencoder that learns an unsupervised representation that
is sensitive to image nuances due to hand articulation. We
use this representation to formalize the frame selection as a
submodular optimization. A user is then asked to provide the
2D reprojections of the joints with visibility information in
these reference frames, and whether these joints are closer
or farther from the camera than the parent joint in the hand
skeleton tree. We use this information to automatically re-
cover the3Dlocations of the joints bysolvinga least-squares
problem. Next, we iteratively propagate these 3D locations
from the reference frames to the remaining frames. We
initialize the pose of the frame with the pose of the visually
closest reference frame and optimize the local appearance
together with spatial constraints. This gives us an initial-
ization for the joint locations in all the frames. However,
each frame is processed independently. We can improve the
estimates further by introducing temporal constraints on the
3D locations and perform a global optimization, enforcing
appearance, temporal, and spatial constraints over all 3D
locations for all frames. If this inference fails for some
frames, the user can still provide additional 2D reprojections;
by running the global inference again, a single additional
annotation typically fixes many frames. See our paper [2]
for more details.
IV. EVALUATION
We evaluate our hand pose estimation method on the NYU
Hand Pose Dataset [7], a challenging real-world benchmark
for hand pose estimation. This dataset is publicly available, is backed up by a huge quantity of annotated samples, and
also shows a high variability of poses.
We show the benefit of using our proposed feedback loop
to increase the accuracy of the 3D joint localization in Fig. 2.
While [7] and [3] have an average 3D joint error of 21 mm
and 20 mm respectively, our proposed method reaches an
error reduction to16.5mm.The initializationwith the simple
andefficientproposedpredictorhasanerrorof27mm.When
we use a more complex initialization [3] with an error of
23 mm, we can decrease the average error to 16 mm.
0 10 20 30 40 50 60 70 80
Distance threshold / mm
0
20
40
60
80
100
Tompson et al.
Oberweger et al.
Initialization
Ours
Fig. 2. Quantitative evaluation of hand pose estimation. The figure shows
the fraction of frames where all joints are within a maximum distance. A
higher area under the curve denotes better results. We compare our method
to the baseline of Tompson et al. [7] and Oberweger et al. [3]. Although
our initialization is worse than both baselines, we can boost the accuracy
of the joint locations using our proposed feedback loop.
To demonstrate our training data creation approach, we
evaluate it on a synthetic dataset, which is the only way to
have depth maps with ground truth 3D locations of the joints.
On this dataset we evaluate the accuracy of the automatically
inferred 3D locations for the reference frames. We obtain an
average 3D joint error of 3.6mm only from 2D reprojections
with visibility and depth order. Our method is also robust to
annotation noise. We can propagate the 3D joint locations to
the remaining frames, for which we achieve an average 3D
joint error of 5.5mm over the full sequence by only requiring
manual 2D annotations for 10% of all frames.
REFERENCES
[1] A. Dosovitskiy, J. T. Springenberg, and T. Brox, “Learning to Generate
Chairs with Convolutional Neural Networks,” inCVPR, 2015.
[2] M. Oberweger, G. Riegler, P. Wohlhart, and V. Lepetit, “Efficiently
Creating 3D Training Data for Fine Hand Pose Estimation,” inCVPR,
2016.
[3] M. Oberweger, P. Wohlhart, and V. Lepetit, “Hands Deep in Deep
Learning for Hand Pose Estimation,” inProc. of CVWW, 2015.
[4] ——, “Training a Feedback Loop for Hand Pose Estimation,” in ICCV,
2015.
[5] C. Qian, X. Sun, Y. Wei, X. Tang, and J. Sun, “Realtime and Robust
Hand Tracking from Depth,” inCVPR, 2014.
[6] D. Tang, H. J. Chang, A. Tejani, and T.-K. Kim, “Latent Regression
Forest: Structured Estimation of 3D Articulated Hand Posture,” in
CVPR, 2014.
[7] J. Tompson, M. Stein, Y. LeCun, and K. Perlin, “Real-Time Continuous
Pose Recovery of Human Hands Using Convolutional Networks,”ACM
Transactions onGraphics, vol. 33, no. 5, pp. 169–179, 2014.
98
Proceedings of the OAGM&ARW Joint Workshop
Vision, Automation and Robotics
- Title
- Proceedings of the OAGM&ARW Joint Workshop
- Subtitle
- Vision, Automation and Robotics
- Authors
- Peter M. Roth
- Markus Vincze
- Wilfried Kubinger
- Andreas MĂĽller
- Bernhard Blaschitz
- Svorad Stolc
- Publisher
- Verlag der Technischen Universität Graz
- Location
- Wien
- Date
- 2017
- Language
- English
- License
- CC BY 4.0
- ISBN
- 978-3-85125-524-9
- Size
- 21.0 x 29.7 cm
- Pages
- 188
- Keywords
- Tagungsband
- Categories
- International
- Tagungsbände