Web-Books
in the Austria-Forum
Austria-Forum
Web-Books
Informatik
Joint Austrian Computer Vision and Robotics Workshop 2020
Page - 109 -
  • User
  • Version
    • full version
    • text only version
  • Language
    • Deutsch - German
    • English

Page - 109 - in Joint Austrian Computer Vision and Robotics Workshop 2020

Image of the Page - 109 -

Image of the Page - 109 - in Joint Austrian Computer Vision and Robotics Workshop 2020

Text of the Page - 109 -

Ourcontributionsare: •We propose a Trilinear Interpolation Layer suited for creating equivariant feature spaces in SO(3). •We provide quantitative and qualitative evi- dence for the advantage of equivariant feature spaces by predicting unseen views inSO(3)of objects fromtheLineMODdataset [4]. The remainder of the paper is structured as fol- lows. Section 2 reviews related work. In Section 3 wedescribeourapproach. Section4presentsourex- perimental results. Finally, Section 5 concludes the paper. 2.RelatedWork Object pose refiners rely on the availability of prior stages to produce pose hypotheses [7, 10, 12, 16, 18, 20, 21]. When depth data is available, the Iterative-Closest-Point algorithm (ICP) can be used to refine initial pose estimates [18, 7, 20]. Recent RGB-based approaches do not rely on the availabil- ity of depth data for pose refinement [7, 10, 12, 16, 21]. CNN-basedobjectposerefinementarchitectures such as [10, 12, 21] pass two input images to the network in order to estimate the relative rotation be- tween these. These images are an observation of the object in thedesiredposeanda renderingof thepre- diction. In [10] the authors base their network archi- tectureonanapproachforopticalflowestimation[1] andpredictopticalflow,maskandrelativeposedevi- ation inSE(3). Theauthorsof [21]useasimilarap- proach with two encoders, one per input image. The encoders’outputsaresubtractedandfurtherencoded to predict the refined pose in SE(3). We present a concept suitable for enhancing such methods by guiding the network to learn an equivariant feature space. TheSTNintroducedby[6] iswidelyusedfor fea- ture and image space transformation [2, 13, 14, 15, 19]. It consists of the combination of a localiza- tion network, a grid generator and a sampler.. The authors of [2] apply STL to properly align the fea- tures to their inputs. In [13] the authors predict deep heatmaps from randomly sampled object patches to predict poses under occlusion. They apply the STL to upsample their predictions. In [14, 15] an analog of the localizationnetwork isused toproducefeature maps invariant to input transformations. The authors Figure 2: Encoder-decoder architecture for image synthesis. of [11] leverage on the methodology of STN to gen- erate realistic looking images from the intersection of the natural image and geometric manifold, using an adapted Generative Adversarial Network. Con- versely to theseapproacheswemodify theSTLcom- ponent of STN to enableSO(3) transformations of input featuremapswith spatial dimension. 3.Approach This section presents our approach for learning equivarient features inSO(3) in order to synthesize imagesfromunseenviewpoints. Wefirstgiveaprob- lem definition, then describe the Trilinear Interpola- tionLayer. Finally,weoutlinehowtheTILisusedin anencoder-decoderarchitecture for imagesynthesis. 3.1.ProblemStatement LetX= { xc, ( x˜0θ0, ...,x˜ n θi )} be a set of training examples wherexc refers to the projection ∏ of ob- ject oc, in its canonical pose, to the image space I. The set of x˜n θi are the projections of transformed objects oθi where θi represent the transformation in SO(3) for the projection into I. Our goal is to learn the inverse of the mapping function ∏−1 in order to producetransformedimages. Inotherwords, to learn x˜n θi = ∏[∏−1(xc),θi] given an image of the ob- ject in its canonical pose and transformation param- eters. In order to model the inversion of the mapping function ∏ , we utilize a CNN due to their power to encode statistical relationships from visual data into feature spaces [8]. To provide information regarding relative transformations θi in SO(3) between pairs of images to our model, we modify the STL of [6]. An overview of the encoder-decoder architecture for imagesynthesisusing themodifiedSTLispresented inFigure2. 3.2.TrilinearInterpolation The STL [6] allowsSE(2) transformations to be applied to feature maps. This works well in image 109
back to the  book Joint Austrian Computer Vision and Robotics Workshop 2020"
Joint Austrian Computer Vision and Robotics Workshop 2020
Title
Joint Austrian Computer Vision and Robotics Workshop 2020
Editor
Graz University of Technology
Location
Graz
Date
2020
Language
English
License
CC BY 4.0
ISBN
978-3-85125-752-6
Size
21.0 x 29.7 cm
Pages
188
Categories
Informatik
Technik
Web-Books
Library
Privacy
Imprint
Austria-Forum
Austria-Forum
Web-Books
Joint Austrian Computer Vision and Robotics Workshop 2020