Page - 89 - in Proceedings of the OAGM&ARW Joint Workshop - Vision, Automation and Robotics

Image of the Page - 89 -

Text of the Page - 89 -

surface normal. The depth value ID is normalized using the maximum and the minimum value of the point cloud with a margin δ to avoid zero values. Furthermore, the normalized depth value is subtracted from one to get higher values for closer points. Finally, every point is projected to a pixel in the 2D image. B. Generation of synthetic training data To train a CNN a large number of training examples is required, which cover each possible viewpoint of the object. We developed a fully autonomous data generation framework, which is able to cover all possible poses and shape variations. A 3D CAD model, e.g. from a public web resource or a reconstructed 3D scanned model, can be used as a reference model for this framework. The first step is to convert the CAD model to a point cloud format and to transform the reference coordinate system to the centroid of the model. After that, rotations for each axis are defined with 5 degree increments, which results in about 373K possible poses. In addition to the pose transformation, the shape transformation, i.e., scaling and shear is also defined for each pose. Scale and shear factors for each axis is randomly selected between a specified range in order to cover possible variations of the object. The reference model is transformed with the defined transformation matrix. Then it is placed to a location with a proper distance – usually found in the pose estimation scenario – to the camera. Self-occluded points are removed using a standard ray tracing of a camera view. Additionally, a randomly placed 2D rectangle is used to remove small parts of the object, in order to simulate partial occlusions and segmentation errors. Finally, the remaining points are used to render a depth image and non-object points or background points are filled with mean values (e.g. Pdata=[0.50.50.5] in case the normalized values are within [0..1]).Thefinallygenerated image is stored including the pose transformation using quaternions, i.e. in the same format the deep CNN provides. C. Pairwise training for robust pose estimation As proposed in [18], [5] our network is trained with input pairs to minimize feature distances of similar poses and maximize feature distances of different poses. The pose difference of a training pair is defined as the Euclidean distance between each quaternion component. Hence, a pair of training examples with a pose distance less than ρs is regarded as as positive pair and if the distance is larger than ρd it is regarded as a negative example (cf. 3). ω= { 1, if ||qanchor−qpair||2<ρs, 0, if ||qanchor−qpair||2>ρd . (3) ω is given to the loss function to determine whether the current pair of images is positive or not, as described in (6). qanchor,qpair denote four-dimensional vectors of each pose transformation serialized from quaternion representation. The whole input batch for each iteration is filled with positive and negative pairs. As described in Fig. 2, a data Fig.2:Streamlines forpairwise trainingusingsharedweights for the CNNs. Output from both streamlines, i.e. the 7th layers and the last layers are used to compute the loss for the annotated training pairs. pair is fed into the CNNs with the same weights and computed separately. To calculate the loss in each iteration, we use the output of the seventh fully connected layer with 4096 dimensions and the last fully connected layer with 4 dimensions, which is furthermore used to predict the rotation information in quaternion. The loss function L for training can be separated into two part as described in (4). L= lr+lf (4) For N batch images per each iteration, lr represents a re- gression error between the annotated pose and the estimated pose which is defined as Euclidean distance (cf. 5), while lf of 6 represents contrastive loss to guide features to have a smaller distance for similar poses and a larger distance for different poses. lr= 1 2N N ∑ n=1 ||qest−qexact||22 (5) lf= 1 2N N ∑ n=1 (ω)d2+2(1−ω)max(1−d,0)2 (6) d= ||fachor− fpair||2 denotes the Euclidean distance between features computed from the seventh fully connected layer. ω, the parameter to classify training pairs as positive or negativeexamples,withsimilarordifferentposes is set in the data generation process. This contrastive loss has generally been used to train Siamese networks, which compare pairs of images [8]. In each iteration weights of the CNNs are updated to minimize the loss function using a stochastic gradient descent (SGD) solver. For this lr is used to update all weights of the CNN, while lf effects all weights except those of the last fully connected layer. D. Estimation procedure In contrast to the training, for pose estimation only a single stream line with one deep CNN is used. The last fullyconnected layerdirectlypredicts thepose represented in quaternion. Given a depth image or a point cloud we classify 89

back to the book Proceedings of the OAGM&ARW Joint Workshop - Vision, Automation and Robotics"

Proceedings of the OAGM&ARW Joint Workshop Vision, Automation and Robotics

Title: Proceedings of the OAGM&ARW Joint Workshop
Subtitle: Vision, Automation and Robotics
Authors: Peter M. Roth; Markus Vincze; Wilfried Kubinger; Andreas Müller; Bernhard Blaschitz; Svorad Stolc
Publisher: Verlag der Technischen Universität Graz
Location: Wien
Date: 2017
Language: English
License: CC BY 4.0
ISBN: 978-3-85125-524-9
Size: 21.0 x 29.7 cm
Pages: 188
Keywords: Tagungsband
Categories: International; Tagungsbände

Page - 89 - in Proceedings of the OAGM&ARW Joint Workshop - Vision, Automation and Robotics

Image of the Page - 89 -

Text of the Page - 89 -

Table of contents