Seite - 97 - in Proceedings of the OAGM&ARW Joint Workshop - Vision, Automation and Robotics

Bild der Seite - 97 -

Text der Seite - 97 -

Feedback Loop and Accurate Training Data for 3D Hand Pose Estimation† Markus Oberweger1, Gernot Riegler1, Paul Wohlhart1 and Vincent Lepetit1,2 Abstract—In this work, we present an entirely data-driven approach to estimating the 3D pose of a hand given a depth image. We show that we can correct the mistakes made by a Convolutional Neural Network (CNN) trained to predict an estimate of the 3D pose by using a feedback loop of Deep Networks, also utilizing a CNN architecture. Since this approach critically relies on a training set of labeled frames, we further present a method for creating the required training data. We propose a semi-automated method for efficiently and accurately labeling each frame of a depth video of a hand with the 3D locations of the joints. I. INTRODUCTION Accuratehandposeestimation isan important requirement for many Human Computer Interaction or Augmented Real- ity tasks. Due to the emergence of 3D sensors, there has been an increased research interest in hand pose estimation in the past few years [3], [6], [7]. Despite the additionally available information from 3D sensors, it is still a very challenging problem, because of the large number of degrees of freedom, and because images of hands exhibit self-similarity and self- occlusions. A popular approach to predict the position of the joints is to use a discriminative method [3], [7], which are now robust and fast. To further refine the pose, such methods are often used to initialize a complex optimization where a 3D model of the hand is fit to the input depth data [5]. In this paper, we build upon recent work that learns to generate images from training data [1] in order to remove the requirement of a 3D hand model. We introduce a method that learns to provide updates for improving the current estimate of the pose, given the input image and the image generated for this pose estimate. Running these steps iteratively, we can correct the mistakes of an initial estimate provided by a simple discriminative method. However, this approach, amongst other recent work (e.g. [6], [7]), has shown that a large amount of accurate training data is required for reliable and precise pose estima- tion. Although having accurate training data is very impor- tant, there was only limited scientific interest in the creation of such, and authors have had to rely on ad hoc ways that are prone to errors [6]. These errors result in noisy training and test data, and make training and evaluating uncertain. Therefore, we developed a semi-automated approach that †This work is based on published work in ICCV’15 [4] and CVPR’16 [2]. 1The authors are with the Institute for Computer Graphics and Vision, Graz University of Technology, Graz, Austria {oberweger,riegler,wohlhart,lepetit}@icg.tugraz.at 2The author is with the Laboratoire Bordelais de Recherche en Informa- tique, Universite´ de Bordeaux, Bordeaux, France makes it easy to annotate sequences of articulated poses in 3D from a single depth sensor only. In the next two sections, we first describe our proposed feedback loop,and thenwepresentourmethodforefficiently creating training data. II. TRAINING A FEEDBACK LOOP We aim at estimating the pose of a hand in the form of the 3D locations of its joints from a single depth image. We assume that a training set of depth images labeled with the corresponding 3D joint locations is available. An overview of our method is shown in Fig. 1. Synthesizer CNN Synthesized Image Input Image Pose Pose Update Predictor CNN Updater CNN Initialize Add update 1 2 3 Fig. 1. Overview of our method: We use a CNN (1) to predict an initial estimate of the 3D pose given an input depth image of the hand. The pose is then used to synthesize an image (2), which is used together with the input depth image to derive a pose update (3). The update is applied to the pose and the process is iterated. We first train apredictor to predict an initial pose estimate in a discriminative manner given an input depth image. We use a Convolutional Neural Network to implement this predictor with a very simple architecture [3]. In practice, the initial pose is never perfect, and following the motivation provided in the introduction, we introduce a hand model learned from the training data. This CNN- based model, referred to as synthesizer, can synthesize the depth image corresponding to a given pose. The network architecture is strongly inspired by [1]. It predicts an initial latent representationof featuremaps, followedbysubsequent unpooling and convolution layers to generate a depth image. Further, we introduce a third function that we call the updater. It learns to predict updates to improve the pose estimate, given the input image and the image produced 97

zurück zum Buch Proceedings of the OAGM&ARW Joint Workshop - Vision, Automation and Robotics"

Proceedings of the OAGM&ARW Joint Workshop Vision, Automation and Robotics

Titel: Proceedings of the OAGM&ARW Joint Workshop
Untertitel: Vision, Automation and Robotics
Autoren: Peter M. Roth; Markus Vincze; Wilfried Kubinger; Andreas Müller; Bernhard Blaschitz; Svorad Stolc
Verlag: Verlag der Technischen Universität Graz
Ort: Wien
Datum: 2017
Sprache: englisch
Lizenz: CC BY 4.0
ISBN: 978-3-85125-524-9
Abmessungen: 21.0 x 29.7 cm
Seiten: 188
Schlagwörter: Tagungsband
Kategorien: International; Tagungsbände

Seite - 97 - in Proceedings of the OAGM&ARW Joint Workshop - Vision, Automation and Robotics

Bild der Seite - 97 -

Text der Seite - 97 -

Inhaltsverzeichnis