Page - 100 - in Joint Austrian Computer Vision and Robotics Workshop 2020
Image of the Page - 100 -
Text of the Page - 100 -
mented images. The selected backbone model is the
Inception v2 [5] network. This network is selected
for its faster computation.
4.Evaluation
To evaluate whether training with the augmented
dataset isuseful, themodel trainedontheaugmented
datamustbecomparedwith themodelnot trainedon
thisdata. However, theintersectionoverUnion(IoU)
measure isnotmeaningful in this case.
Standard evaluation metrics such as the mean av-
erage precision (mAP) define an IoU threshold (e.g.
0.5) and check whether a ground truth object and a
detected object have an IoU value above this value.
If this is the case, the detected object is defined as a
True Positive (TP). If an object is detected but there
is no respective ground truth with an IoU above this
specific threshold, the detected object is defined as
a False Positive (FP). If there is ground truth but no
detected object with an IoU above the threshold, the
object isdefinedasaFalseNegative (FN).
Theseevaluationmethodscannotbeeasilyapplied
to ground truth showing fragmented occlusion, be-
causeof the following twoobservations:
IoU too small: Since the data is based on frag-
menteddetections,adetectorcanonlydetectpartsof
the person. An image where this problem occurs is
showninFigure3. TheboundingboxisclearlyaTP,
based on the fact, that fragmented objects should be
detected, but due to the occlusion by the branches of
the tree, the whole body cannot be recognized. This
leads to an IoU of only≈0.2.
Multiple detections: Another major problem
with the standard evaluation metrics is that exactly
one detected bounding box and one ground truth
bounding box match. However, when handling frag-
mentedobjects,humanheadsand/orotherbodyparts
should be detected separately if body parts are cov-
ered. This creates the problem that parts of the body
(like a head) is detected as well as the whole body.
Figure 4showssomeexamples.
To tackle these two problems, this paper proposes
adifferent evaluationmetric. Foreachboundingbox
in theevaluationdataset,wecalculate themaximum
regionintheimagewherethereisnooverlapwithan-
otherground truthboundingbox. This region is then
extracted and fed into the model. If the model de-
tects an object, we define it as TP, otherwise as FN.
To assess FPs, we create an additional dataset that
represents the maximum region in an image without Figure 3: Ground truth (green) and the detection
(blue)varysubstantiallydue to theocclusioneffects.
overlapwithanyground truthboundingbox. Weex-
tracted in total45,340suchregionswithdifferentas-
pect ratios, different parts of the image and at dif-
ferent time instants. In addition to FPs, we can also
calculate theTNsusing this evaluationmetrics.
Figure5showstheseresultsas recallvs. precision
curve (ROC). There is no significant difference be-
tweenMaskR-CNNtrainedonMicrosoftCOCOand
ontheaugmenteddatasetforL0occlusion. However,
clear improvement has been achieved forL1 andL2
occlusion which proves the applicability of the idea
tomodel fragmentedocclusionby themasks. Never-
theless, all approaches basically do not reach the ex-
pected robustness and accuracy for moderateL2 and
heavyL3 occlusion. One reason for this is that our
current technique is not accurate enough to model
fragmented occlusion. Furthermore, clear limits ex-
istasheavyfragmentedocclusionremoves localspa-
tial and structural information necessary for current
approaches inobjectdetection.
We further recognise that bounding box labelling
is not the appropriate approach for labelling data
showing fragmented occlusion. Especially for L3
andL4occlusion, it is frequently impossible toman-
uallydefine theboundingbox. Suchocclusion levels
allowanapproximate localisationof theobject in the
imagebutmaketheobservationof theobject’sextent
impossible. While therecall inFigure5isstillmean-
ingful, the precision is basically undefined. This ob-
servation has severe consequences on the labelling,
100
Joint Austrian Computer Vision and Robotics Workshop 2020
- Title
- Joint Austrian Computer Vision and Robotics Workshop 2020
- Editor
- Graz University of Technology
- Location
- Graz
- Date
- 2020
- Language
- English
- License
- CC BY 4.0
- ISBN
- 978-3-85125-752-6
- Size
- 21.0 x 29.7 cm
- Pages
- 188
- Categories
- Informatik
- Technik