Page - 127 - in Joint Austrian Computer Vision and Robotics Workshop 2020

Image of the Page - 127 -

Text of the Page - 127 -

Segmentation Mask Reﬁnement a) b) c) Figure 3. Our automatic annotation pipeline. a) Two consecutive depth images with one object removed (marked in red). Calculating the difference of the depth images gives a rough segmentation mask of the removed object. b) Refinement of the mask using morphological operations and Gaussian filtering. c) Geometric features (object edges, skeleton, center of mass) are calculated using the refined segmentation mask and are used afterwards to calculate the final position of the graspingpointproposals. The last step transfers theproposed boundingboxes to thecorrespondingRGBimage. the scene, without any additional costs. The only time humans are involved is, when checking all the predicted labelsviamanual inspection tofindimages which contain erroneous labels. In this process we roughly drop10% of the images to avoid inaccurate labeled training data. Figure 4 shows results of our automatically labeleddataset. 3.3.Human-basedDataAnnotation Additionally to our automatic labeling approach, wealso labeledthewholedatasetmanually. Theidea is to train a grasp prediction network on both types of labels independently, and then compare the per- formance of both approaches. All hand labeled data werecheckedbyhumanexpertswithdomainknowl- edge to verify thecorrectnessof theannotations. 4. Grasping Point Prediction in a Cluttered Environment Chu et al. [4] proposed a deep neural network to predict multiple grasping points for multiple objects inthescene. Weadaptedtheirapproachandretrained the network withour specificdataset. 4.1.NetworkArchitectureandLossFunction The network architecture is based on the Faster R-CNN object detection framework [14] using a ResNet-50 [6] as backbone. It takes a three chan- nel RGB image as input and predicts a number of grasping point candidates, whereas one candidateg is defined as described in Equation 1. Note that the rotation angle θ is quantized intoR= 19 intervals, which makes the prediction of this parameter a clas- sification problem. All other parameters (see Equa- tion 1) are predicted using regression. During train- ing, thecomposite loss functionLtotal isdefinedas Ltotal=Lgpn+Lgcr, (4) whereLgpndescribes the lossaccording to the grasp proposal net andLgcr is the grasp configuration pre- diction loss. The loss termLgpn is used to define initial rectangular bounding box proposals without orientation ({x,y,w,h}), whereas Lgcr is used to define the orientation and the refined bounding box prediction {x,y,θ,w,h}. Figure 5 shows the struc- ture of the prediction network and indicates how the loss partsLgpn andLgcr are calculated. Further in- formationabout thenetworkarchitectureandtheloss functioncanbe found in [4]. 4.2.DataPreprocessingandAugmentation Our dataset for training the prediction network consists of only 52 images. Therefore, data augmen- tation is used to increase the size of the training data by the factorof100. Figure6showsexamplesof the augmented data. This increases the variation in the training data and decreases the possibility of overfit- tingduring training. Afteraugmentation,each image wasresizedto227×227px tofit the inputdimension of thenetwork. 4.3.TrainingSchedule Pre-trained ImageNet [5] weights are used as ini- tializationfor theResNet-50backbonetoavoidover- fitting and ease the training process. All other lay- ers beyond ResNet-50 are trained from scratch. The whole structure of the network can be seen in Fig- ure 5. We used the Adam Optimizer and trained our 127

back to the book Joint Austrian Computer Vision and Robotics Workshop 2020"

Joint Austrian Computer Vision and Robotics Workshop 2020

Title: Joint Austrian Computer Vision and Robotics Workshop 2020
Editor: Graz University of Technology
Location: Graz
Date: 2020
Language: English
License: CC BY 4.0
ISBN: 978-3-85125-752-6
Size: 21.0 x 29.7 cm
Pages: 188
Categories: Informatik; Technik