Page - 109 - in Joint Austrian Computer Vision and Robotics Workshop 2020

Image of the Page - 109 -

Text of the Page - 109 -

Ourcontributionsare: •We propose a Trilinear Interpolation Layer suited for creating equivariant feature spaces in SO(3). •We provide quantitative and qualitative evi- dence for the advantage of equivariant feature spaces by predicting unseen views inSO(3)of objects fromtheLineMODdataset [4]. The remainder of the paper is structured as fol- lows. Section 2 reviews related work. In Section 3 wedescribeourapproach. Section4presentsourex- perimental results. Finally, Section 5 concludes the paper. 2.RelatedWork Object pose refiners rely on the availability of prior stages to produce pose hypotheses [7, 10, 12, 16, 18, 20, 21]. When depth data is available, the Iterative-Closest-Point algorithm (ICP) can be used to refine initial pose estimates [18, 7, 20]. Recent RGB-based approaches do not rely on the availabil- ity of depth data for pose refinement [7, 10, 12, 16, 21]. CNN-basedobjectposerefinementarchitectures such as [10, 12, 21] pass two input images to the network in order to estimate the relative rotation be- tween these. These images are an observation of the object in thedesiredposeanda renderingof thepre- diction. In [10] the authors base their network archi- tectureonanapproachforopticalflowestimation[1] andpredictopticalflow,maskandrelativeposedevi- ation inSE(3). Theauthorsof [21]useasimilarap- proach with two encoders, one per input image. The encoders’outputsaresubtractedandfurtherencoded to predict the refined pose in SE(3). We present a concept suitable for enhancing such methods by guiding the network to learn an equivariant feature space. TheSTNintroducedby[6] iswidelyusedfor fea- ture and image space transformation [2, 13, 14, 15, 19]. It consists of the combination of a localiza- tion network, a grid generator and a sampler.. The authors of [2] apply STL to properly align the fea- tures to their inputs. In [13] the authors predict deep heatmaps from randomly sampled object patches to predict poses under occlusion. They apply the STL to upsample their predictions. In [14, 15] an analog of the localizationnetwork isused toproducefeature maps invariant to input transformations. The authors Figure 2: Encoder-decoder architecture for image synthesis. of [11] leverage on the methodology of STN to gen- erate realistic looking images from the intersection of the natural image and geometric manifold, using an adapted Generative Adversarial Network. Con- versely to theseapproacheswemodify theSTLcom- ponent of STN to enableSO(3) transformations of input featuremapswith spatial dimension. 3.Approach This section presents our approach for learning equivarient features inSO(3) in order to synthesize imagesfromunseenviewpoints. Wefirstgiveaprob- lem definition, then describe the Trilinear Interpola- tionLayer. Finally,weoutlinehowtheTILisusedin anencoder-decoderarchitecture for imagesynthesis. 3.1.ProblemStatement LetX= { xc, ( x˜0θ0, ...,x˜ n θi )} be a set of training examples wherexc refers to the projection ∏ of ob- ject oc, in its canonical pose, to the image space I. The set of x˜n θi are the projections of transformed objects oθi where θi represent the transformation in SO(3) for the projection into I. Our goal is to learn the inverse of the mapping function ∏−1 in order to producetransformedimages. Inotherwords, to learn x˜n θi = ∏[∏−1(xc),θi] given an image of the ob- ject in its canonical pose and transformation param- eters. In order to model the inversion of the mapping function ∏ , we utilize a CNN due to their power to encode statistical relationships from visual data into feature spaces [8]. To provide information regarding relative transformations θi in SO(3) between pairs of images to our model, we modify the STL of [6]. An overview of the encoder-decoder architecture for imagesynthesisusing themodifiedSTLispresented inFigure2. 3.2.TrilinearInterpolation The STL [6] allowsSE(2) transformations to be applied to feature maps. This works well in image 109

back to the book Joint Austrian Computer Vision and Robotics Workshop 2020"