Seite - 81 - in Joint Austrian Computer Vision and Robotics Workshop 2020
Bild der Seite - 81 -
Text der Seite - 81 -
make predictions based on time sequences. We pre-
fer the ConvLSTM [23] cell for this dense predic-
tion task. Compared to the fullyconnectedLSTM, it
removes unnecessary connections. For instance, the
connection of features from the top left corner of the
previous frame to features of the bottom right corner
of the current frame is not needed. We assume that
if we ensure consistency locally by the convolution
operator, we will generate overall results which are
consistent, as long as motion between frames can be
detect in the local window. Therefore, we need to
choose the filter size large enough to allow the Con-
vLSTM to detect local consistencies and motion be-
tween frames without explicit optical flow informa-
tion. Furthermore, the ConvLSTM allows us to pro-
cess images at different resolutions and reduces the
number of parameters significantly compared to the
fullyconnectedLSTM.ThedefinitionofConvLSTM
cell is shown in [23]. We use two different networks
in which we include the ConvLSTM cell. First,
weintroducetheVideoSSNet(VSSNet)architecture
whichconsistsofsixlayersof3×3convolutionswith
dilationrates{1,1,2,2,4,4}and64channels. Com-
pared to theSSNet,wereplace the last convolutional
layer with a ConvLSTM in the VSSNet. Second,
we also extend the ESPNet [19] with a ConvLSTM
layer. Although it would be reasonable to propagate
features at every layer of a CNN architecture this is
not feasible because of fast growing computational
complexity. Figure 2 shows the ESPNet architecture
with fourpossiblepositions for theConvLSTM.The
proposedarchitecturesareenumeratedalphabetically
from ESPNet L1a to ESPNet L1d, starting with the
ConvLSTMat thehighest feature levelwhichmeans
that it is located closest to the output layer. Besides
the ConvLSTM layer, we implement two ESP mod-
ules at the first spatial level and three ESP modules
at thesecondspatial level,which is thesimplestcon-
figuration introduced in[19]. Allotheraspectsof the
ESPNetarchitecture remainunchanged.
3.2.TemporalConsistencyLoss
Our second building block to enforce consistency
is an additional error term in our loss function. The
resulting loss functionL(·) isdefinedas
L(S,P)=λceLce(S,P)+λinconsLincons(S,P), (1)
whereS ∈ ST×M×N contains the semantic ground
truth andP ∈ RT×M×N×|S| contains the predic-
tions. The set S contains all semantic labels. We
bound the dimensions by the sequence lengthT, the 3x3
RGB
(3,16)
Concat
ESP
(19,64)
2xESP
(64,64)
Concat
ESP
(131,128)
3xESP
(128,128) L1c (131,19)
L1b
(256,19) softmax(·)
1x1
SS
(35,19)
Concat
2x2T
(19,19)
ESP
(38,19)
Concat
2x2T
(19,19)
2x2T
(19,19)
RGB
RGB
Concat L1d L1a
(19,19)(16,16)
0.5x Res
0.25x Res
Figure 2: ESPNet with ConvLSTM. Four different posi-
tionsfor includingaConvLSTM(orange)intotheexisting
ESPNet architecture are depicted. Dashed boxes indicate
that only one ConvLSTM is present in a single architec-
ture. L1b, L1c and L1d replace 1×1 channel reduction
convolutions while L1a adds an additional layer to the ar-
chitecture of the original ESPNet. Red boxes indicate a
spatial dimensionality reduction by the factor two, while
green boxes indicate a spatial dimensionality increase of
two.
imagedimensionsM×N andthenumberofseman-
tic labels |S|. ThefunctionLce(·)computes thecross
entropy loss andLincons(·)penalizes inconsistencies.
The hyper-parametersλce andλincons are introduced
to influence the balance between training with focus
onpredictionaccuracyorconsistency.
Wedefine the inconsistency lossas
Lincons(S,P)= 1
ωnorm(S)
T−1,M,N∑
t,m,n=1
ωvcc(S,P,t,m,n)·
(2)
|S|∑
s=1 δ(St,m,n=s) ·(Pt,m,n,s−Pt+1,m,n,s)2
,
whereδ(·) refers to the indicator functiondefinedas
δ(φ(·))= {
1 ifφ(·) is true
0 else. (3)
The inconsistency loss penalizes pixels with differ-
ent predictions in consecutive frames, which are al-
ready predicted correctly in at least one frame of the
consecutivepair. Thisensures thatallother incorrect
pixels are only affected by the cross-entropy loss.
Additionally, δ(St,m,n= s) selects only the correct
semanticclass forconsistencyenforcement. Wenor-
81
Joint Austrian Computer Vision and Robotics Workshop 2020
- Titel
- Joint Austrian Computer Vision and Robotics Workshop 2020
- Herausgeber
- Graz University of Technology
- Ort
- Graz
- Datum
- 2020
- Sprache
- englisch
- Lizenz
- CC BY 4.0
- ISBN
- 978-3-85125-752-6
- Abmessungen
- 21.0 x 29.7 cm
- Seiten
- 188
- Kategorien
- Informatik
- Technik