Seite - 159 - in Proceedings of the OAGM&ARW Joint Workshop - Vision, Automation and Robotics

Bild der Seite - 159 -

Text der Seite - 159 -

(a) (b) (c) (d) Fig. 2. Iteratively pose refinement from an initial sensor estimate: (a) Test image with overlaid ground truth pose, (b) initial noisy sensor pose, (c) segmented image, (d) finally pose obtained with our method. IV. 3D LOCALIZATION Building on the same segmentation approach trained using the training data as described in Secs. II and III, we proposed two different approaches for pose estimation. A. Direct Pose Selection [1] Given a coarse initial estimate p˜ of the pose provided by the sensors and a 2.5D map of its surrounding, the goal is to finally estimate the correct pose pˆ. Therefore, we sample poses in a regular grid around p˜ and estimate pˆ=argmax p L(p), (1) whereL(p) is the log-likelihood L(p)=∑ x logPc(p,x)(x). (2) The sum runs over all image locations x, where c(p,x) is the class at location x when rendering the model under pose p, and Pc(x) is the probability for class c at location x where Pc is one of the probability maps predicted by the semantic segmentation. B. CNN-based Refinement [2] As this brute-force strategy is not very efficient, we addi- tionally proposed a CNN-based approach for iterative pose refinement.Torefine the location,wediscretize thedirections along the ground plane into 8 possible directions and train a network to predict the best direction to refine the currently estimated location. We also add a class that indicates that the estimated location is already correct and should not be changed. Thus, given the semantic segmentation of the current input image and a rendering of the 2.5D map from the current pose estimate, the network, denoted by CNNt, yields a 9-dimensional output vector: dt =CNNt(RF,RHE,RVE,RBG,SF,SHE,SVE,SBG), (3) Here, SF, SHE, SVE, and SBG denote the probability maps computed by the semantic segmentation for the classes fac¸ade, horizontal edge, vertical edge and background, re- spectively; RF, RHE, RVE, RBG are binary maps for the same classes, created by rendering the 2.5D map for the current pose estimate. In addition, we train a second network to refine the orientations: do=CNNo(RF,RHE,RVE,RBG,SF,SHE,SVE,SBG), (4) wheredo is a3-dimensionalvector, covering theprobabilities to rotate the camera to the right, to the left or not rotate it at all. Starting from the initial estimate p˜, we iteratively apply CNNt and CNNo and update the current pose. These steps are iterated until both networks are converged and predict not to move. In particular, there are two main advantages of having two networks: (a) As the networks for translation and orientation are treated separately, we do not need to balance between them. (b) The two detached problems are much easier to solve, reducing both, the training and the inference effort. V. RESULTS AND SUMMARY Two illustrative results obtained by the approach described inSec.IV-BareshowninFig.2. It clearlycanbeseen that the initial sensor poses (Fig. 2(c)) does not cover the groundtruth (Fig. 2(a)) very well, whereas the finally estimated poses (Fig.2(c))using thesegmentation results (Fig.2(b))perfectly fit the buildings. Overall, this demonstrates that adopting ideas from semantic segmentation in combination with con- volutional neural networks and the information provided by 2.5D maps can successfully be used for estimating the poses of buildings and thus their exact location. For more details, we would like to refer to [1] and [2]. REFERENCES [1] A. Armagan, M. Hirzer, and V. Lepetit, “Semantic Segmentation for 3D Localization in Urban Environments,” in JURSE, 2017, best Paper Award. [2] A. Armagan, M. Hirzer, P. M. Roth, and V. Lepetit, “Learning to Align Semantic Segmentation and 2.5D Maps for Geolocalization,” in CVPR, 2017. [3] C. Arth, C. Pirchheim, J. Ventura, D. Schmalstieg, and V. Lepetit, “Instant Outdoor Localization and SLAM Initialization from 2.5D Maps,” in ISMAR, 2015, best Paper Award. [4] V. Badrinarayanan, A. Kendall, and R. Cipolla, “SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation,” CoRR, 2015. [5] J. Long, E. Shelhamer, and T. Darrell, “Fully Convolutional Networks for Semantic Segmentation,” in CVPR, 2015. [6] K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition,” CoRR, 2014. 159

zurück zum Buch Proceedings of the OAGM&ARW Joint Workshop - Vision, Automation and Robotics"

Proceedings of the OAGM&ARW Joint Workshop Vision, Automation and Robotics

Titel: Proceedings of the OAGM&ARW Joint Workshop
Untertitel: Vision, Automation and Robotics
Autoren: Peter M. Roth; Markus Vincze; Wilfried Kubinger; Andreas Müller; Bernhard Blaschitz; Svorad Stolc
Verlag: Verlag der Technischen Universität Graz
Ort: Wien
Datum: 2017
Sprache: englisch
Lizenz: CC BY 4.0
ISBN: 978-3-85125-524-9
Abmessungen: 21.0 x 29.7 cm
Seiten: 188
Schlagwörter: Tagungsband
Kategorien: International; Tagungsbände

Seite - 159 - in Proceedings of the OAGM&ARW Joint Workshop - Vision, Automation and Robotics

Bild der Seite - 159 -

Text der Seite - 159 -

Inhaltsverzeichnis