Seite - 97 - in Proceedings of the OAGM&ARW Joint Workshop - Vision, Automation and Robotics
Bild der Seite - 97 -
Text der Seite - 97 -
Feedback Loop and Accurate Training Data for 3D Hand Pose
Estimation†
Markus Oberweger1, Gernot Riegler1, Paul Wohlhart1 and Vincent Lepetit1,2
Abstract—In this work, we present an entirely data-driven
approach to estimating the 3D pose of a hand given a depth
image. We show that we can correct the mistakes made by
a Convolutional Neural Network (CNN) trained to predict an
estimate of the 3D pose by using a feedback loop of Deep
Networks, also utilizing a CNN architecture.
Since this approach critically relies on a training set of
labeled frames, we further present a method for creating the
required training data. We propose a semi-automated method
for efficiently and accurately labeling each frame of a depth
video of a hand with the 3D locations of the joints.
I. INTRODUCTION
Accuratehandposeestimation isan important requirement
for many Human Computer Interaction or Augmented Real-
ity tasks. Due to the emergence of 3D sensors, there has been
an increased research interest in hand pose estimation in the
past few years [3], [6], [7]. Despite the additionally available
information from 3D sensors, it is still a very challenging
problem, because of the large number of degrees of freedom,
and because images of hands exhibit self-similarity and self-
occlusions.
A popular approach to predict the position of the joints
is to use a discriminative method [3], [7], which are now
robust and fast. To further refine the pose, such methods are
often used to initialize a complex optimization where a 3D
model of the hand is fit to the input depth data [5].
In this paper, we build upon recent work that learns to
generate images from training data [1] in order to remove the
requirement of a 3D hand model. We introduce a method that
learns to provide updates for improving the current estimate
of the pose, given the input image and the image generated
for this pose estimate. Running these steps iteratively, we
can correct the mistakes of an initial estimate provided by a
simple discriminative method.
However, this approach, amongst other recent work
(e.g. [6], [7]), has shown that a large amount of accurate
training data is required for reliable and precise pose estima-
tion. Although having accurate training data is very impor-
tant, there was only limited scientific interest in the creation
of such, and authors have had to rely on ad hoc ways that
are prone to errors [6]. These errors result in noisy training
and test data, and make training and evaluating uncertain.
Therefore, we developed a semi-automated approach that
†This work is based on published work in ICCV’15 [4] and CVPR’16 [2].
1The authors are with the Institute for Computer Graphics
and Vision, Graz University of Technology, Graz, Austria
{oberweger,riegler,wohlhart,lepetit}@icg.tugraz.at
2The author is with the Laboratoire Bordelais de Recherche en Informa-
tique, Universite´ de Bordeaux, Bordeaux, France makes it easy to annotate sequences of articulated poses in
3D from a single depth sensor only.
In the next two sections, we first describe our proposed
feedback loop,and thenwepresentourmethodforefficiently
creating training data.
II. TRAINING A FEEDBACK LOOP
We aim at estimating the pose of a hand in the form of
the 3D locations of its joints from a single depth image. We
assume that a training set of depth images labeled with the
corresponding 3D joint locations is available. An overview
of our method is shown in Fig. 1.
Synthesizer
CNN
Synthesized Image
Input Image
Pose
Pose Update
Predictor
CNN Updater
CNN
Initialize
Add update
1
2 3
Fig. 1. Overview of our method: We use a CNN (1) to predict an initial
estimate of the 3D pose given an input depth image of the hand. The pose
is then used to synthesize an image (2), which is used together with the
input depth image to derive a pose update (3). The update is applied to the
pose and the process is iterated.
We first train apredictor to predict an initial pose estimate
in a discriminative manner given an input depth image.
We use a Convolutional Neural Network to implement this
predictor with a very simple architecture [3].
In practice, the initial pose is never perfect, and following
the motivation provided in the introduction, we introduce
a hand model learned from the training data. This CNN-
based model, referred to as synthesizer, can synthesize the
depth image corresponding to a given pose. The network
architecture is strongly inspired by [1]. It predicts an initial
latent representationof featuremaps, followedbysubsequent
unpooling and convolution layers to generate a depth image.
Further, we introduce a third function that we call the
updater. It learns to predict updates to improve the pose
estimate, given the input image and the image produced
97
Proceedings of the OAGM&ARW Joint Workshop
Vision, Automation and Robotics
- Titel
- Proceedings of the OAGM&ARW Joint Workshop
- Untertitel
- Vision, Automation and Robotics
- Autoren
- Peter M. Roth
- Markus Vincze
- Wilfried Kubinger
- Andreas Müller
- Bernhard Blaschitz
- Svorad Stolc
- Verlag
- Verlag der Technischen Universität Graz
- Ort
- Wien
- Datum
- 2017
- Sprache
- englisch
- Lizenz
- CC BY 4.0
- ISBN
- 978-3-85125-524-9
- Abmessungen
- 21.0 x 29.7 cm
- Seiten
- 188
- Schlagwörter
- Tagungsband
- Kategorien
- International
- Tagungsbände