Page - 89 - in Proceedings of the OAGM&ARW Joint Workshop - Vision, Automation and Robotics
Image of the Page - 89 -
Text of the Page - 89 -
surface normal. The depth value ID is normalized using the
maximum and the minimum value of the point cloud with a
margin δ to avoid zero values. Furthermore, the normalized
depth value is subtracted from one to get higher values for
closer points. Finally, every point is projected to a pixel in
the 2D image.
B. Generation of synthetic training data
To train a CNN a large number of training examples
is required, which cover each possible viewpoint of the
object. We developed a fully autonomous data generation
framework, which is able to cover all possible poses and
shape variations. A 3D CAD model, e.g. from a public web
resource or a reconstructed 3D scanned model, can be used
as a reference model for this framework. The first step is
to convert the CAD model to a point cloud format and to
transform the reference coordinate system to the centroid of
the model. After that, rotations for each axis are defined with
5 degree increments, which results in about 373K possible
poses. In addition to the pose transformation, the shape
transformation, i.e., scaling and shear is also defined for
each pose. Scale and shear factors for each axis is randomly
selected between a specified range in order to cover possible
variations of the object. The reference model is transformed
with the defined transformation matrix. Then it is placed to
a location with a proper distance – usually found in the pose
estimation scenario – to the camera. Self-occluded points
are removed using a standard ray tracing of a camera view.
Additionally, a randomly placed 2D rectangle is used to
remove small parts of the object, in order to simulate partial
occlusions and segmentation errors. Finally, the remaining
points are used to render a depth image and non-object
points or background points are filled with mean values
(e.g. Pdata=[0.50.50.5] in case the normalized values are
within [0..1]).Thefinallygenerated image is stored including
the pose transformation using quaternions, i.e. in the same
format the deep CNN provides.
C. Pairwise training for robust pose estimation
As proposed in [18], [5] our network is trained with
input pairs to minimize feature distances of similar poses
and maximize feature distances of different poses. The pose
difference of a training pair is defined as the Euclidean
distance between each quaternion component. Hence, a pair
of training examples with a pose distance less than ρs is
regarded as as positive pair and if the distance is larger than
ρd it is regarded as a negative example (cf. 3).
ω= {
1, if ||qanchor−qpair||2<ρs,
0, if ||qanchor−qpair||2>ρd . (3)
ω is given to the loss function to determine whether the
current pair of images is positive or not, as described in (6).
qanchor,qpair denote four-dimensional vectors of each pose
transformation serialized from quaternion representation.
The whole input batch for each iteration is filled with
positive and negative pairs. As described in Fig. 2, a data Fig.2:Streamlines forpairwise trainingusingsharedweights
for the CNNs. Output from both streamlines, i.e. the 7th
layers and the last layers are used to compute the loss for
the annotated training pairs.
pair is fed into the CNNs with the same weights and
computed separately. To calculate the loss in each iteration,
we use the output of the seventh fully connected layer with
4096 dimensions and the last fully connected layer with 4
dimensions, which is furthermore used to predict the rotation
information in quaternion.
The loss function L for training can be separated into two
part as described in (4).
L= lr+lf (4)
For N batch images per each iteration, lr represents a re-
gression error between the annotated pose and the estimated
pose which is defined as Euclidean distance (cf. 5), while lf
of 6 represents contrastive loss to guide features to have a
smaller distance for similar poses and a larger distance for
different poses.
lr= 1
2N N
∑
n=1 ||qest−qexact||22 (5)
lf= 1
2N N
∑
n=1
(ω)d2+2(1−ω)max(1−d,0)2 (6)
d= ||fachor− fpair||2 denotes the Euclidean distance between
features computed from the seventh fully connected layer.
ω, the parameter to classify training pairs as positive or
negativeexamples,withsimilarordifferentposes is set in the
data generation process. This contrastive loss has generally
been used to train Siamese networks, which compare pairs
of images [8]. In each iteration weights of the CNNs are
updated to minimize the loss function using a stochastic
gradient descent (SGD) solver. For this lr is used to update
all weights of the CNN, while lf effects all weights except
those of the last fully connected layer.
D. Estimation procedure
In contrast to the training, for pose estimation only a
single stream line with one deep CNN is used. The last
fullyconnected layerdirectlypredicts thepose represented in
quaternion. Given a depth image or a point cloud we classify
89
Proceedings of the OAGM&ARW Joint Workshop
Vision, Automation and Robotics
- Title
- Proceedings of the OAGM&ARW Joint Workshop
- Subtitle
- Vision, Automation and Robotics
- Authors
- Peter M. Roth
- Markus Vincze
- Wilfried Kubinger
- Andreas Müller
- Bernhard Blaschitz
- Svorad Stolc
- Publisher
- Verlag der Technischen Universität Graz
- Location
- Wien
- Date
- 2017
- Language
- English
- License
- CC BY 4.0
- ISBN
- 978-3-85125-524-9
- Size
- 21.0 x 29.7 cm
- Pages
- 188
- Keywords
- Tagungsband
- Categories
- International
- Tagungsbände