Seite - 90 - in Proceedings of the OAGM&ARW Joint Workshop - Vision, Automation and Robotics
Bild der Seite - 90 -
Text der Seite - 90 -
TABLE I: Pose estimation results with the proposed CNN
Proposed CNN Proposed CNN ICP from
with ICP without ICP Random Pose
Precision 0.956 0.822 0.265
Time (ms) 140±32 129±32 155±33
segmented objects. For the sake of simplicity in this paper
we use a simple dominant plane segmentation and a nearest
neighbour clustering of 3D points. The pre-processing to
provide the input to the CNN is identical as for training (cf.
III-B). The trained CNN directly estimates the rotation for
the input segment. The corresponding tentative translation
is computed from the centroid of the reference model and
the segmented point cloud. Finally, a pose refinement step
is performed. Basically, the translational error is dominantly
caused by the difference between centroids of the reference
model and the test image. This is because the centroid of the
reference model is derived by the whole object shape, while
the test image lack of occluded parts. To minimize this error,
self-occluded parts of the reference model are removed after
initial alignment, and the centroid of the reference model is
recalculated. As a final step, we apply an iterative closest
point (ICP) algorithm.
IV. EXPERIMENTS
We perform experiments to prove our concept with real
bananas. An artificial 3D CAD model of a banana is selected
and converted into a point cloud, further used to generate
training images and store the ground truth pose. Scaling
and shear transformations are randomly varied from 0.8 to
1.2 for each of three directions of views generated every 5
degree along each axis. The margin δ to calculate the depth
to color conversion is set to 0.5. The CNN is implemented
with the Caffe framework [11]. We set the initial weights
using the pre-trained network, trained with Imagenet data
[4]. To decide about positive and negative examples for
pairs training examples, we set the threshold ρs=0.2 for
positive and to ρd = 1.0 for negative examples. Positive
and negative pairs are randomly selected during the first
epoch of cycles. The set of pairs is then fixed for further
iterations to reduce training time. Every input image is
re-sized to 64x64 pixel, while keeping the ratio between
heights and widths of the rendered view. Test images are
captured with an Ensenso N35, an industrial stereo sensor
that provides only depth information with a resolution of
640x512. We assume robust segmentation results for the test
scenes. Therefore, we placed the bananas on the table with
enough distance to each other, in order to robustly extract
segments, afterdetecting thedominantplane.Wepreparefive
test scenes consisting of multiple bananas and approximately
278 scenes containing single banana per image using four
different bananas. Estimated poses are evaluated manually.
The criterion for the evaluation is based on the graspablity
of the detected object, i.e. if the estimated pose is accurate
enough to successfully grasp the object it is counted as Fig. 3: Visualization of the estimated poses of multiple ba-
nanas. Red: real bananas in the test scene, yellow: estimation
results
Fig. 4: Example of a bad alignment after ICP. This example
is converged to match with an edge part of the banana
positive. All experiments are performed with an Intel i7-
6700K and a NVIDIA GTX1080 train the CNN.
A. Results for bananas
Fig. 3 briefly shows the results for the test scenes con-
taining multiple bananas. As shown in Table 1, the overall
accuracy after pose refinement is about 95.6% and the
computational time for each segment is about 0.14 second
for each object, which is highly acceptable for robot grasping
tasks.
B. Side effect of refinement steps using ICP
ICP generally improves the results. However, it sometimes
causes worse alignment as shown in Fig. 4. This is because
of the shape difference between the reference model and
target scenes. The general ICP, which we use assumes
a rigid transformation between the reference model and
target model. Hence, depending on the inlier threshold ICP
converges to partially fit to the scene, while the remaining
point cloud does not contribute.
V. CONCLUSIONS
In this paper, we proved the concept of estimating poses
of objects with a high shape variance using a deep CNN
estimator. Furthermore, the proposed framework is able to
use any kind of artificial or real scanned 3D model in order
to generate enough data for training the deep CNN. This on
going research will further be improved with the following
ideas:
• The general rigid transformation ICP is not enough to
refine the pose because the shape difference between the
reference model and the individual objects. We refer to
90
Proceedings of the OAGM&ARW Joint Workshop
Vision, Automation and Robotics
- Titel
- Proceedings of the OAGM&ARW Joint Workshop
- Untertitel
- Vision, Automation and Robotics
- Autoren
- Peter M. Roth
- Markus Vincze
- Wilfried Kubinger
- Andreas Müller
- Bernhard Blaschitz
- Svorad Stolc
- Verlag
- Verlag der Technischen Universität Graz
- Ort
- Wien
- Datum
- 2017
- Sprache
- englisch
- Lizenz
- CC BY 4.0
- ISBN
- 978-3-85125-524-9
- Abmessungen
- 21.0 x 29.7 cm
- Seiten
- 188
- Schlagwörter
- Tagungsband
- Kategorien
- International
- Tagungsbände