Page - 108 - in Joint Austrian Computer Vision and Robotics Workshop 2020
Image of the Page - 108 -
Text of the Page - 108 -
ImageSynthesis inSO(3)byLearningEquivariantFeatureSpaces
MarcoPeer,StefanThalhammer, andMarkusVincze
FacultyofElectricalEngineeringand InformationTechnology,TUWien,1040Vienna,Austria
marco.peer@tuwien.ac.at, {thalhammer,vincze}@acin.tuwien.ac.at
Abstract. Equivariance is a desired property for
feature spacesdesigned tomake transformationsbe-
tween samples, such as object views, predictable.
Encoding this property in two dimensional feature
spaces for 3D transformations is beneficial for tasks
such as image synthesis and object pose refinement.
WeproposetheTrilinearInterpolationLayerthatap-
pliesSO(3) transformationstothebottleneckfeature
mapofanencoder-decodernetwork. Byemployinga
3D grid to trilinearly interpolate in the feature map
wecreatemodels suited forviewsynthesiswith three
degrees of rotational freedom. We quantitatively and
qualitatively evaluate on image synthesis inSO(3)
providingevidenceof thesuitabilityofourapproach.
1. Introduction
Invariantfeaturespacesareagnostictoinputtrans-
formations in order to help models overcome vari-
ations in the data capturing process. Equivariant
feature spaces are exploitable with respect to image
space transformations, thus more suited for reason-
ing about changes in image space [9]. As a con-
sequence the property of equivariance is desired for
feature spaces that are used for predicting transfor-
mations of or in the image space. More precisely,
equivariant feature spaces can be exploited to pre-
dict unseen views based on known transformations
or to estimate relative transformations between two
inputs. Feature spaces that correlate an input with
a transformed output via observable transformation
parameters are desired for applications such as im-
age synthesisorobjectpose refinement.
In this work we study the equivariance of features
spaces of Convolutional Neural Networks (CNN) as
motivatedbythe taskofobjectposerefinement. This
motivationarisesfromrecentRGB-basedobjectpose
refinementmethods thatusepairsof images[10,21]: Figure 1: Given an object view and a relative 3D ro-
tation,unseenviewsare synthesized.
One image represents the observation of the desired
object and theother imageusuallya renderingof the
object inahypothesizedpose. Anetworkistrainedto
predict the relative transformation between the input
pair. We study how to correlate such an image pair,
in feature space, in order to achieve predictability of
the relativeobject transformations.
The Spatial Transform Network (STN) [6] pro-
vides a mean to learn image space transformations
conditioned on the input to produce a transformed
output feature map. Studies such as [2, 13] apply
a sub part of the STN, known as the Spatial Trans-
former Layer (STL), to properly align the network’s
output with its input by applying image space trans-
lations. The authors of [19] wrap a projection func-
tionaroundthe in-andoutputsof theSTLinorder to
make image properties such as lighting andSO(3)
transformation in a limited range predictable. Al-
ternatively to their approach, we directly modify the
structure of the STL. We extend the STL to enable
trilinear interpolationofa featuremap inorder to in-
terpret transformations in all of SO(3). In the re-
mainder of the paper it is referred to as the Trilinear
InterpolationLayer (TIL).
108
Joint Austrian Computer Vision and Robotics Workshop 2020
- Title
- Joint Austrian Computer Vision and Robotics Workshop 2020
- Editor
- Graz University of Technology
- Location
- Graz
- Date
- 2020
- Language
- English
- License
- CC BY 4.0
- ISBN
- 978-3-85125-752-6
- Size
- 21.0 x 29.7 cm
- Pages
- 188
- Categories
- Informatik
- Technik