Page - 132 - in Joint Austrian Computer Vision and Robotics Workshop 2020
Image of the Page - 132 -
Text of the Page - 132 -
in the case of deformable objects. Objects like fold-
ing headphones, scissors, chains, cables can vary in
appearance depending on their current usage. This
posesaproblemforCNNbasedobjectdetectors. We
proposeasimpleRGBbasedmethodfor recognition
of rigid but also deformable objects and synthesize
images for training a neural network. We then test
this method by training the YOLOv3 [13] network
with thefullysyntheticdatasetandexplorehowdoes
the shape of an object, ie. it’s symmetry and de-
formability affect thedetectionperformance.
The contributionsof thework include:
• An automated pipeline on synthetic data gener-
ation used for detection and recognition of both
rigidanddeformableobjects.
• A novel RGB based method for quick and ef-
fortless acquisitionof objectmasks.
•Weexplore theeffectofdeformabilityofanob-
ject to itsdetectionperformance.
2.Relatedwork
Computervisiontasksdependonlargeamountsof
annotatedtrainingdata. Forthetasksofdetectingob-
jectclassessuchascarsorairplanes therearenumer-
ous hand-annotated datasets available: COCO [9],
PASCAL VOC [3] and Open Image Dataset [7].
These datasets are built by researchers or companies
andconsistofa largenumberof images. Each image
has annotations of objects of interest. This may be a
bounding box only or contain the mask of the object
as well. The COCO (Common Objects in Context)
dataset consists of over 330 thousand images con-
taining objects that are split into 80 classes. How-
ever, sometimes, especially in robotics related tasks,
we are interested in detecting a specific object. For
example not any mug but the user’s favourite coffee
mug. Thementioneddatasetsareof littleuse in these
cases, so there isanecessityforaspecializeddataset.
Datasets are normally difficult to obtain so there is a
lotof research concerningsynthesizingdatasets.
JungwooHuhetal. [6]proposedamethodforsyn-
thesizing training data that, similarly to ours, relies
on obtaining masks of an object. In order to produce
the synthetic images they use pure pasting, whereas
we use a combination of pasting and Poisson image
editing. Additionally they evaluate their method on
rigid objects only, for example a baseball bat, a bot-
tle, a toy rifle etc. The only deformable object that they use is an umbrella but they keep it closed dur-
ing the training and testing so we can consider it as
a rigid object in this case. Additionally, they use
YOLOv2, which has a lower mAP (Mean Average
Precision) than the YOLOv3 while also preserving
the ability to process the images in real-time. For
obtaining the masks of the objects they use a semi-
automatic segmentation method while ours is fully
automated and does not involve any manual post-
processing.
Debidatta Dwibedi et al. [2] assume that object
images, which cover diverse viewpoints, are avail-
able. They apply a CNN to obtain a mask of the
object. They then randomly place the object into
a scene image using Poisson cloning. Next, they
train the Faster R-CNN [14] network using the syn-
thetic images and evaluate the method on the GMU-
Kitchens dataset [5]. For the evaluation of the
method they also use exclusively rigid objects like
bottles, detergents, cups, cornflakes packages etc.
Although simple, the method achieves an mAP of
88%,which is similar towhatwereportondetection
of rigidobjects.
Georgakis et al. [4] propose a method for syn-
thesizing training data that takes into consideration
the geometry and semantic information of the scene.
They use publicly available RGB-D datasets, the
GMU-Kitchens [5] and the Washington Washington
RGB-DScenesv2 [8], asbackgrounds for theobject
images. Using RANSAC they detect planes in the
image and artificially place objects on top of them,
while also scaling their size according to the dis-
tance from the camera. This method produces nat-
ural looking images,because insteadofbeingplaced
randomly in an image, the objects such as a cup
or a bottle are placed on a flat desk surface or on
the ground. They test their method using SSD and
FasterR-CNN[14]andreportanmAPbetween70%
and85%dependingonhowmuchrealdata theyuse.
Consideringthefact that thescenes theyuseforeval-
uationareclutteredthis isagoodresults. Theobjects
used for evaluation are a bowl, a cup, a cereal box, a
coffee mug and a soda can. These are all non de-
formableobjects.
3.SyntheticDataGeneration
Object detection is required in cases such as self-
driving cars, unmanned aerial vehicles, robotics etc.
Except for detecting rigid objects like cars, chairs
or cups it is often needed to detect deformable ob-
132
Joint Austrian Computer Vision and Robotics Workshop 2020
- Title
- Joint Austrian Computer Vision and Robotics Workshop 2020
- Editor
- Graz University of Technology
- Location
- Graz
- Date
- 2020
- Language
- English
- License
- CC BY 4.0
- ISBN
- 978-3-85125-752-6
- Size
- 21.0 x 29.7 cm
- Pages
- 188
- Categories
- Informatik
- Technik