Page - 132 - in Joint Austrian Computer Vision and Robotics Workshop 2020

Image of the Page - 132 -

Text of the Page - 132 -

in the case of deformable objects. Objects like fold- ing headphones, scissors, chains, cables can vary in appearance depending on their current usage. This posesaproblemforCNNbasedobjectdetectors. We proposeasimpleRGBbasedmethodfor recognition of rigid but also deformable objects and synthesize images for training a neural network. We then test this method by training the YOLOv3 [13] network with thefullysyntheticdatasetandexplorehowdoes the shape of an object, ie. it’s symmetry and de- formability affect thedetectionperformance. The contributionsof thework include: • An automated pipeline on synthetic data gener- ation used for detection and recognition of both rigidanddeformableobjects. • A novel RGB based method for quick and ef- fortless acquisitionof objectmasks. •Weexplore theeffectofdeformabilityofanob- ject to itsdetectionperformance. 2.Relatedwork Computervisiontasksdependonlargeamountsof annotatedtrainingdata. Forthetasksofdetectingob- jectclassessuchascarsorairplanes therearenumer- ous hand-annotated datasets available: COCO [9], PASCAL VOC [3] and Open Image Dataset [7]. These datasets are built by researchers or companies andconsistofa largenumberof images. Each image has annotations of objects of interest. This may be a bounding box only or contain the mask of the object as well. The COCO (Common Objects in Context) dataset consists of over 330 thousand images con- taining objects that are split into 80 classes. How- ever, sometimes, especially in robotics related tasks, we are interested in detecting a specific object. For example not any mug but the user’s favourite coffee mug. Thementioneddatasetsareof littleuse in these cases, so there isanecessityforaspecializeddataset. Datasets are normally difficult to obtain so there is a lotof research concerningsynthesizingdatasets. JungwooHuhetal. [6]proposedamethodforsyn- thesizing training data that, similarly to ours, relies on obtaining masks of an object. In order to produce the synthetic images they use pure pasting, whereas we use a combination of pasting and Poisson image editing. Additionally they evaluate their method on rigid objects only, for example a baseball bat, a bot- tle, a toy rifle etc. The only deformable object that they use is an umbrella but they keep it closed dur- ing the training and testing so we can consider it as a rigid object in this case. Additionally, they use YOLOv2, which has a lower mAP (Mean Average Precision) than the YOLOv3 while also preserving the ability to process the images in real-time. For obtaining the masks of the objects they use a semi- automatic segmentation method while ours is fully automated and does not involve any manual post- processing. Debidatta Dwibedi et al. [2] assume that object images, which cover diverse viewpoints, are avail- able. They apply a CNN to obtain a mask of the object. They then randomly place the object into a scene image using Poisson cloning. Next, they train the Faster R-CNN [14] network using the syn- thetic images and evaluate the method on the GMU- Kitchens dataset [5]. For the evaluation of the method they also use exclusively rigid objects like bottles, detergents, cups, cornflakes packages etc. Although simple, the method achieves an mAP of 88%,which is similar towhatwereportondetection of rigidobjects. Georgakis et al. [4] propose a method for syn- thesizing training data that takes into consideration the geometry and semantic information of the scene. They use publicly available RGB-D datasets, the GMU-Kitchens [5] and the Washington Washington RGB-DScenesv2 [8], asbackgrounds for theobject images. Using RANSAC they detect planes in the image and artificially place objects on top of them, while also scaling their size according to the dis- tance from the camera. This method produces nat- ural looking images,because insteadofbeingplaced randomly in an image, the objects such as a cup or a bottle are placed on a flat desk surface or on the ground. They test their method using SSD and FasterR-CNN[14]andreportanmAPbetween70% and85%dependingonhowmuchrealdata theyuse. Consideringthefact that thescenes theyuseforeval- uationareclutteredthis isagoodresults. Theobjects used for evaluation are a bowl, a cup, a cereal box, a coffee mug and a soda can. These are all non de- formableobjects. 3.SyntheticDataGeneration Object detection is required in cases such as self- driving cars, unmanned aerial vehicles, robotics etc. Except for detecting rigid objects like cars, chairs or cups it is often needed to detect deformable ob- 132

back to the book Joint Austrian Computer Vision and Robotics Workshop 2020"

Joint Austrian Computer Vision and Robotics Workshop 2020

Title: Joint Austrian Computer Vision and Robotics Workshop 2020
Editor: Graz University of Technology
Location: Graz
Date: 2020
Language: English
License: CC BY 4.0
ISBN: 978-3-85125-752-6
Size: 21.0 x 29.7 cm
Pages: 188
Categories: Informatik; Technik