Page - 108 - in Joint Austrian Computer Vision and Robotics Workshop 2020

Image of the Page - 108 -

Text of the Page - 108 -

ImageSynthesis inSO(3)byLearningEquivariantFeatureSpaces MarcoPeer,StefanThalhammer, andMarkusVincze FacultyofElectricalEngineeringand InformationTechnology,TUWien,1040Vienna,Austria marco.peer@tuwien.ac.at, {thalhammer,vincze}@acin.tuwien.ac.at Abstract. Equivariance is a desired property for feature spacesdesigned tomake transformationsbe- tween samples, such as object views, predictable. Encoding this property in two dimensional feature spaces for 3D transformations is beneficial for tasks such as image synthesis and object pose refinement. WeproposetheTrilinearInterpolationLayerthatap- pliesSO(3) transformationstothebottleneckfeature mapofanencoder-decodernetwork. Byemployinga 3D grid to trilinearly interpolate in the feature map wecreatemodels suited forviewsynthesiswith three degrees of rotational freedom. We quantitatively and qualitatively evaluate on image synthesis inSO(3) providingevidenceof thesuitabilityofourapproach. 1. Introduction Invariantfeaturespacesareagnostictoinputtrans- formations in order to help models overcome vari- ations in the data capturing process. Equivariant feature spaces are exploitable with respect to image space transformations, thus more suited for reason- ing about changes in image space [9]. As a con- sequence the property of equivariance is desired for feature spaces that are used for predicting transfor- mations of or in the image space. More precisely, equivariant feature spaces can be exploited to pre- dict unseen views based on known transformations or to estimate relative transformations between two inputs. Feature spaces that correlate an input with a transformed output via observable transformation parameters are desired for applications such as im- age synthesisorobjectpose refinement. In this work we study the equivariance of features spaces of Convolutional Neural Networks (CNN) as motivatedbythe taskofobjectposerefinement. This motivationarisesfromrecentRGB-basedobjectpose refinementmethods thatusepairsof images[10,21]: Figure 1: Given an object view and a relative 3D ro- tation,unseenviewsare synthesized. One image represents the observation of the desired object and theother imageusuallya renderingof the object inahypothesizedpose. Anetworkistrainedto predict the relative transformation between the input pair. We study how to correlate such an image pair, in feature space, in order to achieve predictability of the relativeobject transformations. The Spatial Transform Network (STN) [6] pro- vides a mean to learn image space transformations conditioned on the input to produce a transformed output feature map. Studies such as [2, 13] apply a sub part of the STN, known as the Spatial Trans- former Layer (STL), to properly align the network’s output with its input by applying image space trans- lations. The authors of [19] wrap a projection func- tionaroundthe in-andoutputsof theSTLinorder to make image properties such as lighting andSO(3) transformation in a limited range predictable. Al- ternatively to their approach, we directly modify the structure of the STL. We extend the STL to enable trilinear interpolationofa featuremap inorder to in- terpret transformations in all of SO(3). In the re- mainder of the paper it is referred to as the Trilinear InterpolationLayer (TIL). 108

back to the book Joint Austrian Computer Vision and Robotics Workshop 2020"