Page - 81 - in Joint Austrian Computer Vision and Robotics Workshop 2020

Image of the Page - 81 -

Text of the Page - 81 -

make predictions based on time sequences. We pre- fer the ConvLSTM [23] cell for this dense predic- tion task. Compared to the fullyconnectedLSTM, it removes unnecessary connections. For instance, the connection of features from the top left corner of the previous frame to features of the bottom right corner of the current frame is not needed. We assume that if we ensure consistency locally by the convolution operator, we will generate overall results which are consistent, as long as motion between frames can be detect in the local window. Therefore, we need to choose the filter size large enough to allow the Con- vLSTM to detect local consistencies and motion be- tween frames without explicit optical flow informa- tion. Furthermore, the ConvLSTM allows us to pro- cess images at different resolutions and reduces the number of parameters significantly compared to the fullyconnectedLSTM.ThedefinitionofConvLSTM cell is shown in [23]. We use two different networks in which we include the ConvLSTM cell. First, weintroducetheVideoSSNet(VSSNet)architecture whichconsistsofsixlayersof3×3convolutionswith dilationrates{1,1,2,2,4,4}and64channels. Com- pared to theSSNet,wereplace the last convolutional layer with a ConvLSTM in the VSSNet. Second, we also extend the ESPNet [19] with a ConvLSTM layer. Although it would be reasonable to propagate features at every layer of a CNN architecture this is not feasible because of fast growing computational complexity. Figure 2 shows the ESPNet architecture with fourpossiblepositions for theConvLSTM.The proposedarchitecturesareenumeratedalphabetically from ESPNet L1a to ESPNet L1d, starting with the ConvLSTMat thehighest feature levelwhichmeans that it is located closest to the output layer. Besides the ConvLSTM layer, we implement two ESP mod- ules at the first spatial level and three ESP modules at thesecondspatial level,which is thesimplestcon- figuration introduced in[19]. Allotheraspectsof the ESPNetarchitecture remainunchanged. 3.2.TemporalConsistencyLoss Our second building block to enforce consistency is an additional error term in our loss function. The resulting loss functionL(·) isdefinedas L(S,P)=λceLce(S,P)+λinconsLincons(S,P), (1) whereS ∈ ST×M×N contains the semantic ground truth andP ∈ RT×M×N×|S| contains the predic- tions. The set S contains all semantic labels. We bound the dimensions by the sequence lengthT, the 3x3 RGB (3,16) Concat ESP (19,64) 2xESP (64,64) Concat ESP (131,128) 3xESP (128,128) L1c (131,19) L1b (256,19) softmax(·) 1x1 SS (35,19) Concat 2x2T (19,19) ESP (38,19) Concat 2x2T (19,19) 2x2T (19,19) RGB RGB Concat L1d L1a (19,19)(16,16) 0.5x Res 0.25x Res Figure 2: ESPNet with ConvLSTM. Four different posi- tionsfor includingaConvLSTM(orange)intotheexisting ESPNet architecture are depicted. Dashed boxes indicate that only one ConvLSTM is present in a single architec- ture. L1b, L1c and L1d replace 1×1 channel reduction convolutions while L1a adds an additional layer to the ar- chitecture of the original ESPNet. Red boxes indicate a spatial dimensionality reduction by the factor two, while green boxes indicate a spatial dimensionality increase of two. imagedimensionsM×N andthenumberofseman- tic labels |S|. ThefunctionLce(·)computes thecross entropy loss andLincons(·)penalizes inconsistencies. The hyper-parametersλce andλincons are introduced to influence the balance between training with focus onpredictionaccuracyorconsistency. Wedefine the inconsistency lossas Lincons(S,P)= 1 ωnorm(S) T−1,M,N∑ t,m,n=1 ωvcc(S,P,t,m,n)· (2)  |S|∑ s=1 δ(St,m,n=s) ·(Pt,m,n,s−Pt+1,m,n,s)2   , whereδ(·) refers to the indicator functiondefinedas δ(φ(·))= { 1 ifφ(·) is true 0 else. (3) The inconsistency loss penalizes pixels with differ- ent predictions in consecutive frames, which are al- ready predicted correctly in at least one frame of the consecutivepair. Thisensures thatallother incorrect pixels are only affected by the cross-entropy loss. Additionally, δ(St,m,n= s) selects only the correct semanticclass forconsistencyenforcement. Wenor- 81

back to the book Joint Austrian Computer Vision and Robotics Workshop 2020"

Joint Austrian Computer Vision and Robotics Workshop 2020

Title: Joint Austrian Computer Vision and Robotics Workshop 2020
Editor: Graz University of Technology
Location: Graz
Date: 2020
Language: English
License: CC BY 4.0
ISBN: 978-3-85125-752-6
Size: 21.0 x 29.7 cm
Pages: 188
Categories: Informatik; Technik