Page - 112 - in Joint Austrian Computer Vision and Robotics Workshop 2020

Image of the Page - 112 -

Text of the Page - 112 -

Figure 4: View synthesis fromSO(3) transformations unseen during training time. First row: reconstructed Lamp with varying azimuth from -43◦ to 43◦. Second row: reconstructed Glue with elevation variation from - 43◦ to43◦. Rowthree tofive: objectsBenchvise,Camera,Cat reconstructedwithazimuth/elevationrangefrom (-43◦,-43◦) to (43◦,43◦). Object poses outside the green box are samples out of training distribution. Centered images, in the redbox,mark thecanonicalposes. 0 25 50 75 100 125 150 175 Angle in ∘ 0∘02 0∘04 0∘06 0∘08 0∘10 0∘12 0∘14 0∘16 MAE MSE DSSIM Figure 5: Error values and its variance over azimuth angle. The network was trained on its corresponding loss function with a spatial bottleneck dimension of 8×8×128. The vertical line shows the training set range. some of the synthesized views outside of the train- ing range, it is visible that views can be predicted properlybased onSO(3) transformations. Figure 5 provides reconstruction error and vari- ance over an extended azimuth and elevation angle range of [0, 180◦]. The results in the figure are av- eragedoverall objects. The trainingdataset contains images with azimuth angles up to 37◦. A sharp rise in error and variance is observed at azimuth angle of approximately45◦. Foranglesabovethisvalue,error and variance increase rapidly. As such, the network cannotproperly reconstruct theseviews. These results show that our formulation for creat- ing equivariant feature spaces has the desired prop- erty to correlate spatial transformations with 2D views of the transformed object. Thus, the pro- posed Trilinear interpolation layer guides the net- work towards learning an equivariant feature space inSO(3). 5.Conclusion We extend recent work for learning equivari- ant feature spaces for synthesizing object views in SO(3). Theproposedextensionof theSpatialTrans- form Network [6], that we call the Trilinear interpo- lation Layer, appliesSO(3) transformations to fea- ture maps from 2D data. Validity of the approach is provided by training a simple encoder-decoder net- work architecture. Our experiments show that our formulation not only enables the prediction of views unseen during training time but also in a small range outside. Thecurrent formulationenablescontrol for5DoF, SO(3)and translations in image space. Future work will tackle adapting the proposed layer to create ob- ject view synthesis in all ofSE(3). We then plan to integratethis inaposerefinementstrategytoimprove objectposeestimation. 112

back to the book Joint Austrian Computer Vision and Robotics Workshop 2020"