Page - 87 - in Proceedings of the OAGM&ARW Joint Workshop - Vision, Automation and Robotics

Image of the Page - 87 -

Text of the Page - 87 -

Pose Estimation of Similar Shape Objects using Convolutional Neural Network trained by Synthetic data Kiru Park, Johann Prankl, Michael Zillich and Markus Vincze Abstract—The objective of this paper is accurate 6D pose estimation from 2.5D point clouds for object classes with a high shape variation, such as vegetables and fruit. General pose estimation methods usually focus on calculating rigid transformations between known models and the target scene, and do not explicitly consider shape variations. We employ deep convolutional neural networks (CNN), which show robust and state of the art performance for the 2D image domain. In contrast, normally the performance of pose estimation from point clouds is weak, because it is hard to prepare large enough annotated training data. To overcome this issue, we propose an autonomous generation process of synthetic 2.5D point clouds covering different shape variations of the objects. The synthetic data is used to train the deep CNN model in order to estimate the object poses. We propose a novel loss function to guide the estimator to have larger feature distances for different poses, and to directly estimate the correct object pose. We performed an evaluation using real objects, where the training was conducted with artificial CAD models downloaded from a public web resource. The results indicate that our approach is suitable for real world robotic applications. I. INTRODUCTION Pose estimation of objects in color and depth images is es- sential for bin-picking tasks to determine grasping points for roboticgrippers.Man-madeobjectsareusuallymanufactured using 3D CAD models having exactly the same shapes with negligible errors. The well-constrained environment enables the robot to identify each pose by comparing features of the pre-created template and an input image [14]. However, it is not possible to provide 3D CAD models for natural objects, such as vegetables or fish, where each object has a slightly different shape. Object pose estimation with template based approaches would need a huge number of templates in order tocovereach individualposeand thedifferent shapevariants. Hence, these approaches would lead to large databases and a high processing time for matching of the templates. Recently, CNN based approaches provide reasonable re- sults for most computer vision tasks including image clas- sification and object detection in 2D images [13] [15]. This achievement is accomplished with a large number of training examples, e.g., [4] [7]. The 2D image datasets are usually collected from web resource and annotated by non-expert persons with tools using a user-friendly interface. For RGB- D images or 2.5D point clouds it is difficult to collect a large number of examples from public web services and it is also hard to annotate the exact poses by non-expert persons. This results in a lack of training data and causes All authors are with the Vision4Robotics group, Automation and Control Institute, Vienna University of Technology, Austria {park, prankl, zillich, vincze}@acin.tuwien.ac.at Fig. 1: Overview of the proposed framework. An artificial 3D CAD model is used to generate synthetic scenes with varied shapes and poses in order to train the deep CNN. The trained network can compute poses of each of segmented clusters. an additional complexity to train a CNN for estimating 6D poses in the 3D space. Therefore, pre-trained CNNs are used for extracting features from color or depth images, and the extracted features are used to train linear regressors to estimate the poses [16]. Although there are several datasets which have 6D pose information for more than 15K images [9], [10], it is still not enough to train a deep CNN and none of them consider object classes with large shape variations. In this paper, we propose a simple pose estimator that can be used to estimate poses of objects with shape variations, such as vegetables or fruit, using a CNN and a single depth image as input. Synthetic depth images containing various poses and shapes of a CAD model are generated to train the proposed CNN. No more template information is required after training. This simplicity is one of the advantages of the proposed model for the robust estimation of object poses with different shape variants. The experiments show that our concept is suitable for real world robotic applications. As a summary, our paper provides the following contribu- tions: • We propose a framework that is able to generate syn- thetic training images and consists of a deep CNN pose estimator for the estimation of poses of natural object classes such as vegetables and fruit. • Pairwise training is applied to train the deep CNN with a loss function that minimizes the errors between the estimated poses and exact ground truth poses and low- level feature distances between similar poses. • We show that our estimator successfully estimates poses of real fruit using more than two hundred test images, 87

back to the book Proceedings of the OAGM&ARW Joint Workshop - Vision, Automation and Robotics"

Page - 87 - in Proceedings of the OAGM&ARW Joint Workshop - Vision, Automation and Robotics

Image of the Page - 87 -

Text of the Page - 87 -

Table of contents