Page - 80 - in Joint Austrian Computer Vision and Robotics Workshop 2020

Image of the Page - 80 -

Text of the Page - 80 -

tion changes, occlusions and other variations. Fig- ure1 illustrates thedifferencesbetweena temporally inconsistent prediction of video frames by a trained ESPNet [19]andour consistentmodel. We address this issue by introducing methods which alter existing single frame CNN architectures suchthat theirpredictionaccuracybenefitsfromhav- ing multiple frames of the same scene. Our method is designed such that it can be applied to any single frame CNN architecture. Potential applications in- cluderoboticsandautonomousvehicleswherevideo data can be recorded easily. Since we aim for a real- life application scenario our method does not access future frames. Instead, we only utilize information frompastframestopredict thecurrentframe. Weim- plement our online method on the lightweight CNN architecture ESPNet. We include a recurrent neural network (RNN) layer into the ESPNet which allows past image features to be combined with current im- age features and thus computes consistent semantic segmentation over time. To train the parameters of our novel model for consistency, we introduce a in- consistency error term to our objective function. We verify our methods on a second architecture, which we name Semantic Segmentation Network (SSNet). The reason for the development of SSNet is to en- sure that our methods do not only work on a specific CNN. We train the parameters of the two models on street scenes using supervised learning. The data is provided by the Cityscapes sequence [6] and a syn- thetic data set which we generate from the Carla [7] simulator. Toavoidthe largeeffort requiredtomanu- ally labelvideodata,weusethepre-trainedXception model [3] to predict highly accurate video semantic segmentation. 2.RelatedWork The best performances on semantic segmentation benchmark tasks such as PASCAL VOC [8] and Cityscapes [6] are reached by CNN architectures. Lightweight CNN architectures [19, 12, 29, 25, 27] have been developed to achieve high accuracy with low computational effort. We select the highly effi- cient ESPNet [19] as a basis for our work because it predicts semantic segmentation in real-time while maintaining high prediction accuracy. It uses point- wise convolutions together with a spatial pyramid of dilated convolutions [28]. The dilated convolu- tions allow the network to create a large receptive field while maintaining a shallow architecture. Al- thoughESPNetprocesses imagesfastandaccurately, it lacks temporal consistency when predicting con- secutive frames. Therefore, we extend the ESPNet andenforcevideoconsistency. Video Consistency Kundu et al. [16] and Sid- dhartha et al. [2] base their work on the traditional graph cut [14, 15] approach towards semantic seg- mentation. They extent the traditional 2D to a 3D CRF by adding a temporal dimension which allows them to predict temporally consistent semantic seg- mentationonvideo. Comparedtoourapproachaddi- tionalopticalflowinformationneeds tobecomputed and the size of the temporal dimension must be pre- defined in advance. This results in additional com- putation complexity and less flexibility when chang- ingparameters suchas the framerate. Therefore,we decided to implement RNNs [11, 23, 4] which offer a more flexible approach towards processing video data. RNNs are trained to learn which features of past framesare relevant forcurrent [18,21]or future [24, 22] frames. In general, it is not clear if LSTM, GRU oranyotherRNNarchitecture is superior [5,13,10]. Dependingontheapplication,onearchitecturemight perform slightly better than the other [5]. Variations through modifying the proposed architectures might work even better in some cases [13]. The work of Jozefowicz et al. [13] shows the importance of the elements insideanRNNcell. Luetal. [18]use theplainLSTMtoassociateob- jects in a video. To enforce a frame-to-frame con- sistentprediction, theyuseanassociation lossduring the training of the LSTM. Similarly, we implement a ConvLSTM and an inconsistency loss to tackle se- mantic segmentation. We place the ConvLSTM on different image feature levels in our architecture as suggestedby [22,21]. 3.ConsistentVideoSemanticSegmentation In this section, we introduce our methods to- wards frame-to-frameconsistent semanticsegmenta- tion. We present different architecture to propagate features through time. To train the architectures for temporal consistency, we extend the cross entropy loss functionwithanovel inconsistencyerror term. 3.1.TemporalFeaturePropagation The propagation of image features from the past to the current time step allows the neural network to 80

back to the book Joint Austrian Computer Vision and Robotics Workshop 2020"

Joint Austrian Computer Vision and Robotics Workshop 2020

Title: Joint Austrian Computer Vision and Robotics Workshop 2020
Editor: Graz University of Technology
Location: Graz
Date: 2020
Language: English
License: CC BY 4.0
ISBN: 978-3-85125-752-6
Size: 21.0 x 29.7 cm
Pages: 188
Categories: Informatik; Technik