Seite - 90 - in Proceedings of the OAGM&ARW Joint Workshop - Vision, Automation and Robotics

Bild der Seite - 90 -

Text der Seite - 90 -

TABLE I: Pose estimation results with the proposed CNN Proposed CNN Proposed CNN ICP from with ICP without ICP Random Pose Precision 0.956 0.822 0.265 Time (ms) 140±32 129±32 155±33 segmented objects. For the sake of simplicity in this paper we use a simple dominant plane segmentation and a nearest neighbour clustering of 3D points. The pre-processing to provide the input to the CNN is identical as for training (cf. III-B). The trained CNN directly estimates the rotation for the input segment. The corresponding tentative translation is computed from the centroid of the reference model and the segmented point cloud. Finally, a pose refinement step is performed. Basically, the translational error is dominantly caused by the difference between centroids of the reference model and the test image. This is because the centroid of the reference model is derived by the whole object shape, while the test image lack of occluded parts. To minimize this error, self-occluded parts of the reference model are removed after initial alignment, and the centroid of the reference model is recalculated. As a final step, we apply an iterative closest point (ICP) algorithm. IV. EXPERIMENTS We perform experiments to prove our concept with real bananas. An artificial 3D CAD model of a banana is selected and converted into a point cloud, further used to generate training images and store the ground truth pose. Scaling and shear transformations are randomly varied from 0.8 to 1.2 for each of three directions of views generated every 5 degree along each axis. The margin δ to calculate the depth to color conversion is set to 0.5. The CNN is implemented with the Caffe framework [11]. We set the initial weights using the pre-trained network, trained with Imagenet data [4]. To decide about positive and negative examples for pairs training examples, we set the threshold ρs=0.2 for positive and to ρd = 1.0 for negative examples. Positive and negative pairs are randomly selected during the first epoch of cycles. The set of pairs is then fixed for further iterations to reduce training time. Every input image is re-sized to 64x64 pixel, while keeping the ratio between heights and widths of the rendered view. Test images are captured with an Ensenso N35, an industrial stereo sensor that provides only depth information with a resolution of 640x512. We assume robust segmentation results for the test scenes. Therefore, we placed the bananas on the table with enough distance to each other, in order to robustly extract segments, afterdetecting thedominantplane.Wepreparefive test scenes consisting of multiple bananas and approximately 278 scenes containing single banana per image using four different bananas. Estimated poses are evaluated manually. The criterion for the evaluation is based on the graspablity of the detected object, i.e. if the estimated pose is accurate enough to successfully grasp the object it is counted as Fig. 3: Visualization of the estimated poses of multiple ba- nanas. Red: real bananas in the test scene, yellow: estimation results Fig. 4: Example of a bad alignment after ICP. This example is converged to match with an edge part of the banana positive. All experiments are performed with an Intel i7- 6700K and a NVIDIA GTX1080 train the CNN. A. Results for bananas Fig. 3 briefly shows the results for the test scenes con- taining multiple bananas. As shown in Table 1, the overall accuracy after pose refinement is about 95.6% and the computational time for each segment is about 0.14 second for each object, which is highly acceptable for robot grasping tasks. B. Side effect of refinement steps using ICP ICP generally improves the results. However, it sometimes causes worse alignment as shown in Fig. 4. This is because of the shape difference between the reference model and target scenes. The general ICP, which we use assumes a rigid transformation between the reference model and target model. Hence, depending on the inlier threshold ICP converges to partially fit to the scene, while the remaining point cloud does not contribute. V. CONCLUSIONS In this paper, we proved the concept of estimating poses of objects with a high shape variance using a deep CNN estimator. Furthermore, the proposed framework is able to use any kind of artificial or real scanned 3D model in order to generate enough data for training the deep CNN. This on going research will further be improved with the following ideas: • The general rigid transformation ICP is not enough to refine the pose because the shape difference between the reference model and the individual objects. We refer to 90

zurück zum Buch Proceedings of the OAGM&ARW Joint Workshop - Vision, Automation and Robotics"

Proceedings of the OAGM&ARW Joint Workshop Vision, Automation and Robotics

Titel: Proceedings of the OAGM&ARW Joint Workshop
Untertitel: Vision, Automation and Robotics
Autoren: Peter M. Roth; Markus Vincze; Wilfried Kubinger; Andreas Müller; Bernhard Blaschitz; Svorad Stolc
Verlag: Verlag der Technischen Universität Graz
Ort: Wien
Datum: 2017
Sprache: englisch
Lizenz: CC BY 4.0
ISBN: 978-3-85125-524-9
Abmessungen: 21.0 x 29.7 cm
Seiten: 188
Schlagwörter: Tagungsband
Kategorien: International; Tagungsbände

Seite - 90 - in Proceedings of the OAGM&ARW Joint Workshop - Vision, Automation and Robotics

Bild der Seite - 90 -

Text der Seite - 90 -

Inhaltsverzeichnis