Page - 110 - in Joint Austrian Computer Vision and Robotics Workshop 2020

Image of the Page - 110 -

Text of the Page - 110 -

(a) Three dimensional grid. The feature map are cen- tered inDon initialization. (b) Interpolation scheme of P using adjacent grid cell values. Figure3: Trilinear InterpolationLayer (TIL)compo- nents. space, however, requires adaptation for the SO(3) domain[19]. TheSTLiscomposedofagridgenera- tor and asampler. The grid generator is modified by adding a depth dimension D. The input feature map of space RH×W×C thusbecomesRH×W×D×C,whereH,W andC are the height, width and number of channels, respectively. The feature map is centered alongD as shown in Figure 3a. The sampler of the STL bi- linearly interpolates between corner points using the corresponding areas. For volumes, this scheme is unsuitable. Therefore, trilinear interpolation is used instead, as shown in Figure 3b. Feature maps are interpolated channel-wise and projected back to 2D by averaging along the depth dimension. In order to guarantee proper interpolation in 3D,H andW must be greater than 1. The proposed modification enables transformations in SO(3) and only affects non-trainable layers. Thus, the additional computa- tionaloverhead compared toSTL isnegligible. Since averaging overD is used for projecting the grid back to 2D no feature map scaling can be ap- plied while sampling. Thus, modifying the trilinear interpolation by allowing scaling along depth would enable transformations inSE(3), thus yielding full 6DoF.However, this isoutof thescope in thispaper. 3.3.NetworkArchitecture The network in Figure 2 is an encoder-decoder architecture. The encoder consists of a truncated ResNet18 [3], pretrained on ImageNet [17], for fea- ture encoding. ResNet18 consists of five stages. In order topreservealargerspatial imagedimensionwe removethefourthandthefifthstageandtaketheout- putsof the lastRectifiedLinearUnit (ReLU)ofstage three. The final output is a tensor of size8×8with 128 feature maps. Theencodedimage ∏−1(xc)aswellas the trans- formation parameters θi are passed to the TIL. Fea- ture maps are trilinearly interpolated to produce the mappingof theencoded transformed image x˜n θi . The transformed encoding is forwarded to the decoder stageof thenetwork. Thedesignof thedecoder is ratherad-hoctoshow that the TIL is not restricted to a certain architec- ture. A transposed convolution with ReLU activa- tion is followed by stacks of deconvolution layers with ReLU activation and upsampling layers. These stacks are repeated two times and a final transposed convolution layer with linear activation is added. Featurechannels are reduced gradually. Kernel sizes ofthetransposedconvolutional layersare5−3−3−5 and upsampling kernel sizes of 3×3 are used. All strides are set to 1. The output of the decoder is an imageof size64×64. Ineach training iteration, thedeviationof x˜θ toxθ is minimized. The loss function to be optimized is l2. The network is trained to correlate objects views with its corresponding transformation in SO(3) in the camera frame. Consequently, a feature space is created that enables tosynthesizeviewsnot included in the trainingset. 4.Experiments This section presents experiments for image syn- thesis of unseen views of household objects with lit- tle texture. These experiments show that the exten- sion of an encoder-decoder network with the pro- posed TIL reconstructs objects views inSO(3). In addition, the method can also reconstruct views in regions ofSO(3)where no data was provided to the networkduring training. 4.1.Dataset Our experiments are conducted on a subset of the LineMOD dataset [4]. We use the object models of Benchvise, Cat, Glue, Camera and Lamp. These ob- jects represent elongated and asymmetric shapes as wellascomplexshapeswithselfocclusion. Withthis subset we cover the representative challenges when synthesizingviews forobjects. 4.2.DatasetCreation Dataset images are rendered using the renderer provided by [5]. For our purposes, the RGB images are scaled to64×64pixels. Toeachobject’s canon- ical pose, 45◦ are added to elevation in order to only trainonviewsof theupperhemisphereof theobject. 110

back to the book Joint Austrian Computer Vision and Robotics Workshop 2020"

Joint Austrian Computer Vision and Robotics Workshop 2020

Title: Joint Austrian Computer Vision and Robotics Workshop 2020
Editor: Graz University of Technology
Location: Graz
Date: 2020
Language: English
License: CC BY 4.0
ISBN: 978-3-85125-752-6
Size: 21.0 x 29.7 cm
Pages: 188
Categories: Informatik; Technik