Page - 109 - in Joint Austrian Computer Vision and Robotics Workshop 2020
Image of the Page - 109 -
Text of the Page - 109 -
Ourcontributionsare:
•We propose a Trilinear Interpolation Layer
suited for creating equivariant feature spaces in
SO(3).
•We provide quantitative and qualitative evi-
dence for the advantage of equivariant feature
spaces by predicting unseen views inSO(3)of
objects fromtheLineMODdataset [4].
The remainder of the paper is structured as fol-
lows. Section 2 reviews related work. In Section 3
wedescribeourapproach. Section4presentsourex-
perimental results. Finally, Section 5 concludes the
paper.
2.RelatedWork
Object pose refiners rely on the availability of
prior stages to produce pose hypotheses [7, 10, 12,
16, 18, 20, 21]. When depth data is available, the
Iterative-Closest-Point algorithm (ICP) can be used
to refine initial pose estimates [18, 7, 20]. Recent
RGB-based approaches do not rely on the availabil-
ity of depth data for pose refinement [7, 10, 12, 16,
21]. CNN-basedobjectposerefinementarchitectures
such as [10, 12, 21] pass two input images to the
network in order to estimate the relative rotation be-
tween these. These images are an observation of the
object in thedesiredposeanda renderingof thepre-
diction. In [10] the authors base their network archi-
tectureonanapproachforopticalflowestimation[1]
andpredictopticalflow,maskandrelativeposedevi-
ation inSE(3). Theauthorsof [21]useasimilarap-
proach with two encoders, one per input image. The
encoders’outputsaresubtractedandfurtherencoded
to predict the refined pose in SE(3). We present
a concept suitable for enhancing such methods by
guiding the network to learn an equivariant feature
space.
TheSTNintroducedby[6] iswidelyusedfor fea-
ture and image space transformation [2, 13, 14, 15,
19]. It consists of the combination of a localiza-
tion network, a grid generator and a sampler.. The
authors of [2] apply STL to properly align the fea-
tures to their inputs. In [13] the authors predict deep
heatmaps from randomly sampled object patches to
predict poses under occlusion. They apply the STL
to upsample their predictions. In [14, 15] an analog
of the localizationnetwork isused toproducefeature
maps invariant to input transformations. The authors Figure 2: Encoder-decoder architecture for image
synthesis.
of [11] leverage on the methodology of STN to gen-
erate realistic looking images from the intersection
of the natural image and geometric manifold, using
an adapted Generative Adversarial Network. Con-
versely to theseapproacheswemodify theSTLcom-
ponent of STN to enableSO(3) transformations of
input featuremapswith spatial dimension.
3.Approach
This section presents our approach for learning
equivarient features inSO(3) in order to synthesize
imagesfromunseenviewpoints. Wefirstgiveaprob-
lem definition, then describe the Trilinear Interpola-
tionLayer. Finally,weoutlinehowtheTILisusedin
anencoder-decoderarchitecture for imagesynthesis.
3.1.ProblemStatement
LetX= {
xc, (
x˜0θ0, ...,x˜
n
θi )}
be a set of training
examples wherexc refers to the projection ∏
of ob-
ject oc, in its canonical pose, to the image space I.
The set of x˜n
θi are the projections of transformed
objects oθi where θi represent the transformation in
SO(3) for the projection into I. Our goal is to learn
the inverse of the mapping function ∏−1 in order to
producetransformedimages. Inotherwords, to learn
x˜n
θi = ∏[∏−1(xc),θi] given an image of the ob-
ject in its canonical pose and transformation param-
eters.
In order to model the inversion of the mapping
function ∏
, we utilize a CNN due to their power to
encode statistical relationships from visual data into
feature spaces [8]. To provide information regarding
relative transformations θi in SO(3) between pairs
of images to our model, we modify the STL of [6].
An overview of the encoder-decoder architecture for
imagesynthesisusing themodifiedSTLispresented
inFigure2.
3.2.TrilinearInterpolation
The STL [6] allowsSE(2) transformations to be
applied to feature maps. This works well in image
109
Joint Austrian Computer Vision and Robotics Workshop 2020
- Title
- Joint Austrian Computer Vision and Robotics Workshop 2020
- Editor
- Graz University of Technology
- Location
- Graz
- Date
- 2020
- Language
- English
- License
- CC BY 4.0
- ISBN
- 978-3-85125-752-6
- Size
- 21.0 x 29.7 cm
- Pages
- 188
- Categories
- Informatik
- Technik