Page - 99 - in Joint Austrian Computer Vision and Robotics Workshop 2020

Image of the Page - 99 -

Text of the Page - 99 -

object detection. Fragmented occlusion occurs by viewing objects behind tree ans bush leaves. Con- trary topartialocclusion, fragmentedocclusiongives no clear view on minimal recognisable parts of the object [10]which isused todetect theobject [7]. We show in this work that the state-of-the-art in object detection fails on fragmented occlusion even for the moderate case. For this, we created a new dataset (Figure 1) capturing people behind trees. We labellednearly40,000 images in three representative videos. This data raises new challenges on the la- belling and evaluation which we only partially an- swer in this paper. For example, bounding boxes are thestandard incurrentevaluationofdetectorsbut such labelsarehard tofind indata thatcontains frag- mented occlusion. As the state-of-the-art detectors deliver bounding boxes, fragmented occlusion poses new questionson theevaluationmethodology. Furthermore, we augmented Microsoft COCO1 training data by occluding the ground truth masks similarlyas leavesoccludepeoplebehindbushesand trees. We then show results on training Mask R- CNN [4] on this new data showing improvement of MaskR-CNNtrainedon theoriginaldatawithslight fragmented occlusion. 2.RelatedWork State-of-the-art object detection is based on deep learning. Two-stage detectors work by finding as an intermediate step bounding box proposals [3, 2] on the feature maps of the backbone CNN. A region proposal network further improves efficiency [9, 4]. One-stage detectors regress the bounding boxes di- rectly [8, 6] which is computationally efficient on GPUs but this approach is inherently less accurate as it assumesacoarselydiscretisedsearchspace. Al- though thesemethods showusuallyexcellentperfor- mance for fully visible objects, they break down in the case of fragmented occlusion. Fragmented oc- clusion has not been considered for object detection so far, however there is literature about this topic in the field ofmotionanalysis [1]. 3.Methodology We created a dataset recorded in a forest consist- ing of three videos with a total of 18,360 frames and 33,933boundingboxeswhichweremanuallydefined by human annotators. These bounding boxes are di- 1http://cocodataset.org Figure 2: A training image from Microsoft COCO (http://images.cocodataset.org/ train2017/000000001700.jpg). Top Left: the image. Top Right: Segmentation mask of the image. Bottom Left: image overlaid with artificial trees. BottomRight: Maskof theoverlaid image. vided into four different occlusion levels including theunoccludedcase (Figure1). Then, we extended the Microsoft COCO dataset by adding artificial trees as foreground to the im- ages of objects (Figure 2). We chose this dataset, becauseitcontainspixel-wisesegmentationmasksin thegroundtruthaswellasalargenumberofdifferent categories including thehumanperson. The underlying basic idea of our approach is to add artificial fragmented occlusion to Microsoft COCO and train Mask R-CNN on this new data. By this we can adapt the original distribution of data to the case of fragmentally occluded objects. Since we are only interested in humans, we apply thisaugmentationonly to imagescontaininghumans and use only these images for training. The trees used for the augmentation are generated from real images we have obtained from the test data. The method generates whole artificial trees by randomly adding branches to previously manually segmented tree trunks. In total14suchtrunksareextractedfrom thetestdataset. Thebranchesattachedtothesetrunks are also randomly generated by also adding a few manually segmented leaves. The trees are placed in front of objects by ran- domly selecting the x-coordinate on which they will be placed and an angle at which the tree will be ro- tated. Thecalculatedforegroundisapplied to the im- age and its negative mask is multiplied by the seg- mentation mask of the objects in the image. The Mask R-CNN model is then trained with the aug- 99

back to the book Joint Austrian Computer Vision and Robotics Workshop 2020"

Joint Austrian Computer Vision and Robotics Workshop 2020

Title: Joint Austrian Computer Vision and Robotics Workshop 2020
Editor: Graz University of Technology
Location: Graz
Date: 2020
Language: English
License: CC BY 4.0
ISBN: 978-3-85125-752-6
Size: 21.0 x 29.7 cm
Pages: 188
Categories: Informatik; Technik