Page - 88 - in Proceedings of the OAGM&ARW Joint Workshop - Vision, Automation and Robotics
Image of the Page - 88 -
Text of the Page - 88 -
which are collected with a stereo camera widely used
in industrial applications.
The remainder of the paper is organized as follows. In
Section II we provide an overview of related work. Our
proposed approach for Deep CNN based pose estimation
is introduced in Section III. In Section IV, we present
experiments with our trained pose estimator with test images
containing real bananas. We conclude the paper with final
remarks and plans for further work in Section V.
II. RELATED WORKS
Object detection and its pose estimation is an essential
task for robots and industrial applications, especially for
picking and placing tasks. The exact 6D pose information
of an object is required to decide about grasp points for
picking and to define proper locations for placing. Therefore,
pose estimation in 3D space has received a lot of attention
with various approaches which dominantly include feature
matching based methods and recently convolutional neural
network based methods. State of the art methods are able to
perform classification of objects and pose estimation at the
same time [1], [18]. In the brief review below we focus on
feature based approaches with a local or global descriptor
and CNN based approaches.
A. Feature based approaches
Extracting features from training and test data, matching
correspondence and calculating single transformation from
a trained model to target scenes are typical processes of
feature based approaches. Features for the 3D domain are
designed toprovideageneralized representationof theobject
shape using local attributes. One popular example is SHOT
developed by Tombari et al. [17]. In [1] Aldoma et al. de-
veloped an approach which uses various features to generate
possible hypotheses and select hypotheses which minimizes
a cost function in order to remove false-positives. These
feature based pose estimation approaches generally compute
rigid transformations, which implicitly assumes that training
models and target objects have the same shape. Wohlkinger
et al. [19] uses CAD model to train global features to
recognize real objects. This method shows robustness to
shape variations, but it needs a large number of template
images.
B. CNN based approaches
To employ recent convolutional neural networks, success-
fully used in the 2D image domain, to the 3D domain,
which does not have enough training data, researchers tried
to use pre-trained CNNs as a feature descriptor and trained
additional classifiers for recognition and linear regression
for pose estimation [16]. But [16] constrains object poses
to in-plane rotation on the table, with one single degree
of freedom. Generation of synthetic data is an option for
training a CNN with depth images as input. [3] uses a 3D
CAD models in order to train the typical CNN structure and
finally gains a descriptor for a single channel depth images.
This model was used for object classification tasks. Also, [3] considers object classification tasks, but this approach
generates depth images from CAD models containing both,
varied view points and randomly morphed shapes. CNN
based 6D pose estimation is also described in [18], [5]. Both
use pair-wise training to guide intermediate features to have
larger distances for larger pose deviations. They design a
small CNN network, which has only two convolution layers
in order to train the CNN using a small number of training
examples. In contrast to these approaches, we use a deep
CNN which has five convolutional layers and pre-trained
weightscomputedbya largenumberof2Dimages.However,
we refer to their pairwise training approaches to get a robust
pose estimation performance.
III. METHOD
In the following paragraph we provide a detailed de-
scription of the proposed pose estimation approach, which
consists of a deep CNN, generation of synthetic images and
a pose refinement step for the final result, shown in Fig.1.
To be able to exploit the structure and pre-trained weights
of well-established and tested CNNs taking three channels
of a 2D color image as input, we transform single-channel
depth images to three-channel color images. Finally, the pose
estimation procedure at test time is described, including the
refinement step to minimize the translational error.
A. Deep CNN for pose estimation with depth images
We employ Alexnet, which has proven results for 2D
image classification tasks. The only different part is the last
fully connected layer, which in our case has only four out-
put channels for estimating the rotational transformation in
quaternions, instead of a thousand channels for classification.
Also, the final output is filtered by tanh function to provide
normalized results between -1 and 1. The reason why we
use a quaternion representation instead of Euler angles with
three parameter is, the non-linearity and periodicity of Euler
angles. For example, the numerical difference between 0 and
359 degrees is large, although the difference of the angles
is small. However, the quaternion representation allows to
calculate the pose difference as distance of each component
of the quaternion values [12]. Most of the state of the art
CNN models including Alexnet uses a 2D color image as
input. State of the art for CNNs applied to depth images is to
convert the depth image in the one channel to a color coded
image in the three channels [6]. Among the possible color
coding methods, directly matching each axis component of
a surface normal to separate image channels has shown a
superior performance [6]. Optionally, we use the depth value
to scale the values of each pixel as described in (1) and (2).
ID=1.0− Pz−minz+δmaxz−minz+2δ (1)
Pdata= ID[Nx Ny Nz] (2)
where Pdata describes a single data point represented in the
threechannels. ID is the scaleddepthvalueand the remaining
three values Nx, Ny and Nz are the individual axis of the
88
Proceedings of the OAGM&ARW Joint Workshop
Vision, Automation and Robotics
- Title
- Proceedings of the OAGM&ARW Joint Workshop
- Subtitle
- Vision, Automation and Robotics
- Authors
- Peter M. Roth
- Markus Vincze
- Wilfried Kubinger
- Andreas Müller
- Bernhard Blaschitz
- Svorad Stolc
- Publisher
- Verlag der Technischen Universität Graz
- Location
- Wien
- Date
- 2017
- Language
- English
- License
- CC BY 4.0
- ISBN
- 978-3-85125-524-9
- Size
- 21.0 x 29.7 cm
- Pages
- 188
- Keywords
- Tagungsband
- Categories
- International
- Tagungsbände