Seite - 159 - in Proceedings of the OAGM&ARW Joint Workshop - Vision, Automation and Robotics
Bild der Seite - 159 -
Text der Seite - 159 -
(a) (b) (c) (d)
Fig. 2. Iteratively pose refinement from an initial sensor estimate: (a) Test image with overlaid ground truth pose, (b) initial noisy sensor pose,
(c) segmented image, (d) finally pose obtained with our method.
IV. 3D LOCALIZATION
Building on the same segmentation approach trained using
the training data as described in Secs. II and III, we proposed
two different approaches for pose estimation.
A. Direct Pose Selection [1]
Given a coarse initial estimate p˜ of the pose provided by
the sensors and a 2.5D map of its surrounding, the goal is
to finally estimate the correct pose pˆ. Therefore, we sample
poses in a regular grid around p˜ and estimate
pˆ=argmax
p L(p), (1)
whereL(p) is the log-likelihood
L(p)=∑
x logPc(p,x)(x). (2)
The sum runs over all image locations x, where c(p,x) is
the class at location x when rendering the model under pose
p, and Pc(x) is the probability for class c at location x where
Pc is one of the probability maps predicted by the semantic
segmentation.
B. CNN-based Refinement [2]
As this brute-force strategy is not very efficient, we addi-
tionally proposed a CNN-based approach for iterative pose
refinement.Torefine the location,wediscretize thedirections
along the ground plane into 8 possible directions and train a
network to predict the best direction to refine the currently
estimated location. We also add a class that indicates that
the estimated location is already correct and should not
be changed. Thus, given the semantic segmentation of the
current input image and a rendering of the 2.5D map from
the current pose estimate, the network, denoted by CNNt,
yields a 9-dimensional output vector:
dt =CNNt(RF,RHE,RVE,RBG,SF,SHE,SVE,SBG), (3)
Here, SF, SHE, SVE, and SBG denote the probability maps
computed by the semantic segmentation for the classes
fac¸ade, horizontal edge, vertical edge and background, re-
spectively; RF, RHE, RVE, RBG are binary maps for the same
classes, created by rendering the 2.5D map for the current
pose estimate. In addition, we train a second network to refine the
orientations:
do=CNNo(RF,RHE,RVE,RBG,SF,SHE,SVE,SBG), (4)
wheredo is a3-dimensionalvector, covering theprobabilities
to rotate the camera to the right, to the left or not rotate it
at all.
Starting from the initial estimate p˜, we iteratively apply
CNNt and CNNo and update the current pose. These steps
are iterated until both networks are converged and predict
not to move. In particular, there are two main advantages of
having two networks: (a) As the networks for translation
and orientation are treated separately, we do not need to
balance between them. (b) The two detached problems are
much easier to solve, reducing both, the training and the
inference effort.
V. RESULTS AND SUMMARY
Two illustrative results obtained by the approach described
inSec.IV-BareshowninFig.2. It clearlycanbeseen that the
initial sensor poses (Fig. 2(c)) does not cover the groundtruth
(Fig. 2(a)) very well, whereas the finally estimated poses
(Fig.2(c))using thesegmentation results (Fig.2(b))perfectly
fit the buildings. Overall, this demonstrates that adopting
ideas from semantic segmentation in combination with con-
volutional neural networks and the information provided by
2.5D maps can successfully be used for estimating the poses
of buildings and thus their exact location. For more details,
we would like to refer to [1] and [2].
REFERENCES
[1] A. Armagan, M. Hirzer, and V. Lepetit, “Semantic Segmentation for
3D Localization in Urban Environments,” in JURSE, 2017, best Paper
Award.
[2] A. Armagan, M. Hirzer, P. M. Roth, and V. Lepetit, “Learning to Align
Semantic Segmentation and 2.5D Maps for Geolocalization,” in CVPR,
2017.
[3] C. Arth, C. Pirchheim, J. Ventura, D. Schmalstieg, and V. Lepetit,
“Instant Outdoor Localization and SLAM Initialization from 2.5D
Maps,” in ISMAR, 2015, best Paper Award.
[4] V. Badrinarayanan, A. Kendall, and R. Cipolla, “SegNet: A Deep
Convolutional Encoder-Decoder Architecture for Image Segmentation,”
CoRR, 2015.
[5] J. Long, E. Shelhamer, and T. Darrell, “Fully Convolutional Networks
for Semantic Segmentation,” in CVPR, 2015.
[6] K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks
for Large-Scale Image Recognition,” CoRR, 2014.
159
Proceedings of the OAGM&ARW Joint Workshop
Vision, Automation and Robotics
- Titel
- Proceedings of the OAGM&ARW Joint Workshop
- Untertitel
- Vision, Automation and Robotics
- Autoren
- Peter M. Roth
- Markus Vincze
- Wilfried Kubinger
- Andreas Müller
- Bernhard Blaschitz
- Svorad Stolc
- Verlag
- Verlag der Technischen Universität Graz
- Ort
- Wien
- Datum
- 2017
- Sprache
- englisch
- Lizenz
- CC BY 4.0
- ISBN
- 978-3-85125-524-9
- Abmessungen
- 21.0 x 29.7 cm
- Seiten
- 188
- Schlagwörter
- Tagungsband
- Kategorien
- International
- Tagungsbände