Page - 110 - in Joint Austrian Computer Vision and Robotics Workshop 2020
Image of the Page - 110 -
Text of the Page - 110 -
(a) Three dimensional grid.
The feature map are cen-
tered inDon initialization. (b) Interpolation scheme of
P using adjacent grid cell
values.
Figure3: Trilinear InterpolationLayer (TIL)compo-
nents.
space, however, requires adaptation for the SO(3)
domain[19]. TheSTLiscomposedofagridgenera-
tor and asampler.
The grid generator is modified by adding a depth
dimension D. The input feature map of space
RH×W×C thusbecomesRH×W×D×C,whereH,W
andC are the height, width and number of channels,
respectively. The feature map is centered alongD
as shown in Figure 3a. The sampler of the STL bi-
linearly interpolates between corner points using the
corresponding areas. For volumes, this scheme is
unsuitable. Therefore, trilinear interpolation is used
instead, as shown in Figure 3b. Feature maps are
interpolated channel-wise and projected back to 2D
by averaging along the depth dimension. In order
to guarantee proper interpolation in 3D,H andW
must be greater than 1. The proposed modification
enables transformations in SO(3) and only affects
non-trainable layers. Thus, the additional computa-
tionaloverhead compared toSTL isnegligible.
Since averaging overD is used for projecting the
grid back to 2D no feature map scaling can be ap-
plied while sampling. Thus, modifying the trilinear
interpolation by allowing scaling along depth would
enable transformations inSE(3), thus yielding full
6DoF.However, this isoutof thescope in thispaper.
3.3.NetworkArchitecture
The network in Figure 2 is an encoder-decoder
architecture. The encoder consists of a truncated
ResNet18 [3], pretrained on ImageNet [17], for fea-
ture encoding. ResNet18 consists of five stages. In
order topreservealargerspatial imagedimensionwe
removethefourthandthefifthstageandtaketheout-
putsof the lastRectifiedLinearUnit (ReLU)ofstage
three. The final output is a tensor of size8×8with
128 feature maps.
Theencodedimage ∏−1(xc)aswellas the trans- formation parameters θi are passed to the TIL. Fea-
ture maps are trilinearly interpolated to produce the
mappingof theencoded transformed image x˜n
θi . The
transformed encoding is forwarded to the decoder
stageof thenetwork.
Thedesignof thedecoder is ratherad-hoctoshow
that the TIL is not restricted to a certain architec-
ture. A transposed convolution with ReLU activa-
tion is followed by stacks of deconvolution layers
with ReLU activation and upsampling layers. These
stacks are repeated two times and a final transposed
convolution layer with linear activation is added.
Featurechannels are reduced gradually. Kernel sizes
ofthetransposedconvolutional layersare5−3−3−5
and upsampling kernel sizes of 3×3 are used. All
strides are set to 1. The output of the decoder is an
imageof size64×64.
Ineach training iteration, thedeviationof x˜θ toxθ
is minimized. The loss function to be optimized is
l2. The network is trained to correlate objects views
with its corresponding transformation in SO(3) in
the camera frame. Consequently, a feature space is
created that enables tosynthesizeviewsnot included
in the trainingset.
4.Experiments
This section presents experiments for image syn-
thesis of unseen views of household objects with lit-
tle texture. These experiments show that the exten-
sion of an encoder-decoder network with the pro-
posed TIL reconstructs objects views inSO(3). In
addition, the method can also reconstruct views in
regions ofSO(3)where no data was provided to the
networkduring training.
4.1.Dataset
Our experiments are conducted on a subset of the
LineMOD dataset [4]. We use the object models of
Benchvise, Cat, Glue, Camera and Lamp. These ob-
jects represent elongated and asymmetric shapes as
wellascomplexshapeswithselfocclusion. Withthis
subset we cover the representative challenges when
synthesizingviews forobjects.
4.2.DatasetCreation
Dataset images are rendered using the renderer
provided by [5]. For our purposes, the RGB images
are scaled to64×64pixels. Toeachobject’s canon-
ical pose, 45◦ are added to elevation in order to only
trainonviewsof theupperhemisphereof theobject.
110
Joint Austrian Computer Vision and Robotics Workshop 2020
- Title
- Joint Austrian Computer Vision and Robotics Workshop 2020
- Editor
- Graz University of Technology
- Location
- Graz
- Date
- 2020
- Language
- English
- License
- CC BY 4.0
- ISBN
- 978-3-85125-752-6
- Size
- 21.0 x 29.7 cm
- Pages
- 188
- Categories
- Informatik
- Technik