Seite - 158 - in Proceedings of the OAGM&ARW Joint Workshop - Vision, Automation and Robotics
Bild der Seite - 158 -
Text der Seite - 158 -
3D Localization in Urban Environments from Single Images*
Anil Armagan1, Martin Hirzer1, Peter M. Roth1 and Vincent Lepetit1,2 .
Abstract—In this paper, we tackle the problem of geo-
localization in urban environments overcoming the limitations
in terms of accuracy of sensors like GPS, compass and
accelerometer. For that purpose, we adopt recent findings in
image segmentation and machine learning and combine them
with the valuable information given by 2.5D maps of buildings.
In particular, we first extract the fac¸ades of buildings and their
edges and use this information to estimate the orientation and
location that best align an input image to a 3D rendering of
the given 2.5D map. As this step builds on a learned semantic
segmentation procedure, rich training data is required. Thus,
we also discuss how the required training data can be efficiently
generated via a 3D tracking system.
I. INTRODUCTION
Accurate geo-localization of images is a very active area
in Computer Vision, as it can potentially be used for appli-
cations such as autonomous driving and Augmented Reality.
As the typically available GPS and compass information are
often not accurate enough for such applications, we recently
proposed a method that builds only on untextured 2.5D maps
[3]. In general, 2.5D maps hold the 2D information about
the environment, more precisely the buildings’ outlines and
their heights. However, this approach is limited in practice,
as it heavily relies on the often unreliable and error prone
extraction of straight line segments to find the re-projections
of the corners of the buildings.
To overcome this limitation, as shown in Fig. 1, we
replace this step by semantic segmentation (i.e., [4] and
[5]) to extract the visible fac¸ades and their edges, which
is described in more detail in Sec. II. Since learning the
necessary model requires a large amount of training data,
as detailed in Sec. III, we use a 3D tracking algorithm
to semi-automatically label the huge amount of required
training images. In order to estimate the correct pose, we
introduce two strategies. The first strategy samples random
poses around the initial pose given by the sensors and selects
the best one. The second strategy builds on a more advanced
search algorithm by using CNNs to iteratively update the
pose. Both approaches are discussed in Sec. IV.
II. SEMANTIC SEGMENTATION
Given a color input image I, we train a fully convolutional
network (FCN) [5] to perform a semantic segmentation. FCN
applies a series of convolutional and pooling layers to the
* This work was funded by the Christian Doppler Laboratory for
Semantic 3D Computer Vision.
1 Institute of Computer Graphics and Vision, Graz University
of Technology, Graz, Austria {armagan, hirzer, pmroth,
lepetit}@icg.tugraz.at
2 Laboratoire Bordelais de Recherche en Informatique, Universite´ de
Bordeaux, Bordeaux, France (a) (b)
(c) (d)
Fig. 1. Overview of our approach: Given an input image (a), we segment
the fac¸ades and their edges (b). We can either sample poses around the
pose provided by the sensors or use CNNs to move the camera starting
from the sensor pose (c), and keep the pose that aligns the 2.5D map and
the segmentation best (d).
input image, followed by deconvolution layers to produce
a segmentation map of the whole image at the original
resolution. In our case, we aim at segmenting the fac¸ades and
the edges at building corners or between different fac¸ades.
Everything else is referred to as “background”. We therefore
consider four classes: fac¸ade, vertical edges, horizontal edges
and background. We use a stage-wise training procedure,
where we start with a coarse network (FCN-32s) initialized
from VGG-16 [6], fine-tune it on our data, and then use
the thus generated model to initialize the weights of a more
fine-grained network (FCN-16s). This process is repeated in
order to compute the final segmentation network having an
8 pixels prediction stride (FCN-8s).
III. ACQUISITION OF TRAINING DATA
Deep-learning segmentation methods require a large num-
ber of training images to generalize well, however, man-
ual annotation is costly. We therefore use a 3D tracking
system [3] to easily annotate frames of video sequences.
First, we create simple 3D models from the 2.5D maps.
Then, for each sequence, we initialize the pose for the first
frame manually, and the tracker estimates the poses for the
remaining frames. This allows us to label fac¸ades and their
edges very efficiently. More precisely, we recorded 95 short
video sequences using a mobile device. In order to ensure an
accurate labeling, in particular for the edges, we only keep
frames in which the re-projection of the 3D model is well
aligned with the real image, and remove those frames that
suffer from tracking errors or drift.
158
Proceedings of the OAGM&ARW Joint Workshop
Vision, Automation and Robotics
- Titel
- Proceedings of the OAGM&ARW Joint Workshop
- Untertitel
- Vision, Automation and Robotics
- Autoren
- Peter M. Roth
- Markus Vincze
- Wilfried Kubinger
- Andreas Müller
- Bernhard Blaschitz
- Svorad Stolc
- Verlag
- Verlag der Technischen Universität Graz
- Ort
- Wien
- Datum
- 2017
- Sprache
- englisch
- Lizenz
- CC BY 4.0
- ISBN
- 978-3-85125-524-9
- Abmessungen
- 21.0 x 29.7 cm
- Seiten
- 188
- Schlagwörter
- Tagungsband
- Kategorien
- International
- Tagungsbände