Page - 127 - in Joint Austrian Computer Vision and Robotics Workshop 2020
Image of the Page - 127 -
Text of the Page - 127 -
Segmentation
Mask
Refinement
a) b) c)
Figure 3. Our automatic annotation pipeline. a) Two consecutive depth images with one object removed (marked in red).
Calculating the difference of the depth images gives a rough segmentation mask of the removed object. b) Refinement
of the mask using morphological operations and Gaussian filtering. c) Geometric features (object edges, skeleton, center
of mass) are calculated using the refined segmentation mask and are used afterwards to calculate the final position of the
graspingpointproposals. The last step transfers theproposed boundingboxes to thecorrespondingRGBimage.
the scene, without any additional costs. The only
time humans are involved is, when checking all the
predicted labelsviamanual inspection tofindimages
which contain erroneous labels. In this process we
roughly drop10% of the images to avoid inaccurate
labeled training data. Figure 4 shows results of our
automatically labeleddataset.
3.3.Human-basedDataAnnotation
Additionally to our automatic labeling approach,
wealso labeledthewholedatasetmanually. Theidea
is to train a grasp prediction network on both types
of labels independently, and then compare the per-
formance of both approaches. All hand labeled data
werecheckedbyhumanexpertswithdomainknowl-
edge to verify thecorrectnessof theannotations.
4. Grasping Point Prediction in a Cluttered
Environment
Chu et al. [4] proposed a deep neural network to
predict multiple grasping points for multiple objects
inthescene. Weadaptedtheirapproachandretrained
the network withour specificdataset.
4.1.NetworkArchitectureandLossFunction
The network architecture is based on the Faster
R-CNN object detection framework [14] using a
ResNet-50 [6] as backbone. It takes a three chan-
nel RGB image as input and predicts a number of
grasping point candidates, whereas one candidateg
is defined as described in Equation 1. Note that the
rotation angle θ is quantized intoR= 19 intervals,
which makes the prediction of this parameter a clas-
sification problem. All other parameters (see Equa- tion 1) are predicted using regression. During train-
ing, thecomposite loss functionLtotal isdefinedas
Ltotal=Lgpn+Lgcr, (4)
whereLgpndescribes the lossaccording to the grasp
proposal net andLgcr is the grasp configuration pre-
diction loss. The loss termLgpn is used to define
initial rectangular bounding box proposals without
orientation ({x,y,w,h}), whereas Lgcr is used to
define the orientation and the refined bounding box
prediction {x,y,θ,w,h}. Figure 5 shows the struc-
ture of the prediction network and indicates how the
loss partsLgpn andLgcr are calculated. Further in-
formationabout thenetworkarchitectureandtheloss
functioncanbe found in [4].
4.2.DataPreprocessingandAugmentation
Our dataset for training the prediction network
consists of only 52 images. Therefore, data augmen-
tation is used to increase the size of the training data
by the factorof100. Figure6showsexamplesof the
augmented data. This increases the variation in the
training data and decreases the possibility of overfit-
tingduring training. Afteraugmentation,each image
wasresizedto227×227px tofit the inputdimension
of thenetwork.
4.3.TrainingSchedule
Pre-trained ImageNet [5] weights are used as ini-
tializationfor theResNet-50backbonetoavoidover-
fitting and ease the training process. All other lay-
ers beyond ResNet-50 are trained from scratch. The
whole structure of the network can be seen in Fig-
ure 5. We used the Adam Optimizer and trained our
127
Joint Austrian Computer Vision and Robotics Workshop 2020
- Title
- Joint Austrian Computer Vision and Robotics Workshop 2020
- Editor
- Graz University of Technology
- Location
- Graz
- Date
- 2020
- Language
- English
- License
- CC BY 4.0
- ISBN
- 978-3-85125-752-6
- Size
- 21.0 x 29.7 cm
- Pages
- 188
- Categories
- Informatik
- Technik