Web-Books
in the Austria-Forum
Austria-Forum
Web-Books
Informatik
Joint Austrian Computer Vision and Robotics Workshop 2020
Page - 80 -
  • User
  • Version
    • full version
    • text only version
  • Language
    • Deutsch - German
    • English

Page - 80 - in Joint Austrian Computer Vision and Robotics Workshop 2020

Image of the Page - 80 -

Image of the Page - 80 - in Joint Austrian Computer Vision and Robotics Workshop 2020

Text of the Page - 80 -

tion changes, occlusions and other variations. Fig- ure1 illustrates thedifferencesbetweena temporally inconsistent prediction of video frames by a trained ESPNet [19]andour consistentmodel. We address this issue by introducing methods which alter existing single frame CNN architectures suchthat theirpredictionaccuracybenefitsfromhav- ing multiple frames of the same scene. Our method is designed such that it can be applied to any single frame CNN architecture. Potential applications in- cluderoboticsandautonomousvehicleswherevideo data can be recorded easily. Since we aim for a real- life application scenario our method does not access future frames. Instead, we only utilize information frompastframestopredict thecurrentframe. Weim- plement our online method on the lightweight CNN architecture ESPNet. We include a recurrent neural network (RNN) layer into the ESPNet which allows past image features to be combined with current im- age features and thus computes consistent semantic segmentation over time. To train the parameters of our novel model for consistency, we introduce a in- consistency error term to our objective function. We verify our methods on a second architecture, which we name Semantic Segmentation Network (SSNet). The reason for the development of SSNet is to en- sure that our methods do not only work on a specific CNN. We train the parameters of the two models on street scenes using supervised learning. The data is provided by the Cityscapes sequence [6] and a syn- thetic data set which we generate from the Carla [7] simulator. Toavoidthe largeeffort requiredtomanu- ally labelvideodata,weusethepre-trainedXception model [3] to predict highly accurate video semantic segmentation. 2.RelatedWork The best performances on semantic segmentation benchmark tasks such as PASCAL VOC [8] and Cityscapes [6] are reached by CNN architectures. Lightweight CNN architectures [19, 12, 29, 25, 27] have been developed to achieve high accuracy with low computational effort. We select the highly effi- cient ESPNet [19] as a basis for our work because it predicts semantic segmentation in real-time while maintaining high prediction accuracy. It uses point- wise convolutions together with a spatial pyramid of dilated convolutions [28]. The dilated convolu- tions allow the network to create a large receptive field while maintaining a shallow architecture. Al- thoughESPNetprocesses imagesfastandaccurately, it lacks temporal consistency when predicting con- secutive frames. Therefore, we extend the ESPNet andenforcevideoconsistency. Video Consistency Kundu et al. [16] and Sid- dhartha et al. [2] base their work on the traditional graph cut [14, 15] approach towards semantic seg- mentation. They extent the traditional 2D to a 3D CRF by adding a temporal dimension which allows them to predict temporally consistent semantic seg- mentationonvideo. Comparedtoourapproachaddi- tionalopticalflowinformationneeds tobecomputed and the size of the temporal dimension must be pre- defined in advance. This results in additional com- putation complexity and less flexibility when chang- ingparameters suchas the framerate. Therefore,we decided to implement RNNs [11, 23, 4] which offer a more flexible approach towards processing video data. RNNs are trained to learn which features of past framesare relevant forcurrent [18,21]or future [24, 22] frames. In general, it is not clear if LSTM, GRU oranyotherRNNarchitecture is superior [5,13,10]. Dependingontheapplication,onearchitecturemight perform slightly better than the other [5]. Variations through modifying the proposed architectures might work even better in some cases [13]. The work of Jozefowicz et al. [13] shows the importance of the elements insideanRNNcell. Luetal. [18]use theplainLSTMtoassociateob- jects in a video. To enforce a frame-to-frame con- sistentprediction, theyuseanassociation lossduring the training of the LSTM. Similarly, we implement a ConvLSTM and an inconsistency loss to tackle se- mantic segmentation. We place the ConvLSTM on different image feature levels in our architecture as suggestedby [22,21]. 3.ConsistentVideoSemanticSegmentation In this section, we introduce our methods to- wards frame-to-frameconsistent semanticsegmenta- tion. We present different architecture to propagate features through time. To train the architectures for temporal consistency, we extend the cross entropy loss functionwithanovel inconsistencyerror term. 3.1.TemporalFeaturePropagation The propagation of image features from the past to the current time step allows the neural network to 80
back to the  book Joint Austrian Computer Vision and Robotics Workshop 2020"
Joint Austrian Computer Vision and Robotics Workshop 2020
Title
Joint Austrian Computer Vision and Robotics Workshop 2020
Editor
Graz University of Technology
Location
Graz
Date
2020
Language
English
License
CC BY 4.0
ISBN
978-3-85125-752-6
Size
21.0 x 29.7 cm
Pages
188
Categories
Informatik
Technik
Web-Books
Library
Privacy
Imprint
Austria-Forum
Austria-Forum
Web-Books
Joint Austrian Computer Vision and Robotics Workshop 2020