Page - 100 - in Proceedings - OAGM & ARW Joint Workshop 2016 on "Computer Vision and Robotics“
Image of the Page - 100 -
Text of the Page - 100 -
voxels. A kd-tree would allow for exact nearest neighbor queries, but would have 𝑂 (𝑚 ∗ 𝑙 𝑜
𝑔 (𝑛 ))
complexity. The speedup achieved by the 𝑂 (𝑚 ) verification allows us to evaluate more candidate
transformations at the same time to boost accuracy.
Filtering Candidate Solutions. After the random sampling and matching process is over, the
candidate solutions are filtered, since it is likely that multiple similar solutions have been found. In
[3] a pose clustering approach is used. The pose clustering combines multiple similar poses to find
an average position from these candidate solutions. This approach falls short for symmetric objects.
E.g. a sphere where the reference frame is off center will result in many different potential poses,
but each pose will have a completely different translation and rotation. In RANGO, we have
replaced the clustering approach with a filtering approach. All candidate solutions are sorted by the
number of inliers, highest number first. Iteratively, each solution is re-checked if it meets a given
inlier threshold, and if it does, all scene voxels that were used in this inlier check are removed from
the 3D voxel grid. This way only the best fitting candidate solution for a potential pose is used,
while it is still possible to find multiple instances of the same object in the scene data. This
approach works well for both complex and symmetrical objects.
3.2. Multi-Forest Tracking
Our multi-forest tracking approach is a variation of the multi-forest tracking algorithm described in
[18]. Our modification to this algorithms retains the performance characteristic of [18] while having
a significantly lower memory and training overhead, which allows the use of this algorithm on
devices with limited computational power such as a tablet pc. It is noteworthy that we only use
depth data for both tracking and object localization. The reason is that our main goal is to be able to
robustly track industrial parts, and these usually do not carry much color information.
Single-view-Tracking. The multi-forest tracker described in [18] uses 6 ∗ 𝑛
𝑐 ∗ 𝑛
𝑡 random
forests, where the number of dimensions to represent a pose is 6, 𝑛
𝑐 is the number of camera
positions and 𝑛
𝑡 the number of trees in each forest. For each camera position sample points of the
objects are extracted and used to train 6 random regression forests for tracking of this camera view.
An algorithm switches between the camera views that are currently best visible. They parameterize
this as 𝑛
𝑐 = 42 and 𝑛
𝑡 = 100 resulting in 25200 random trees. Each tree is generated from a test
set of 50000 samples, resulting in a significant training effort. In our approach we have reduced
this effort significantly to only 6 ∗ 𝑛
𝑡 trees. After the samples have been generated, each random
forest is trained with all samples for a single dimension of the pose vector. That is, random forests 1
to 3 are trained for changes in translation (x, y, and z), and random forest 4 to 6 are trained on the
changes in rotation (roll, pitch and yaw) parameters respectively. During tracking, the depth
changes are used to predict the changes in pose vector by simply combining the predictions of each
random forest.
In practice, we set 𝑛
𝑡 = 70, resulting in only 420 random trees which in turn leads to a 60 times
faster training time and 60 times less memory requirements. It typically takes about 3 minutes on
an Intel(R) Core(TM) i5-3570 CPU (the proposed approach is implemented and tested on such a
workstation) to train a new object for tracking. This low memory requirement also allows the
tracking to run in real time.
We initially sample a set of approximately 400 points from the surface of the object. Sampling is
done by raytracing points onto the objects surface, creating a 3D grid around that object and then
sampling a single point per grid. This sampling approach leads to evenly spaced sample points all
over the visible surface of the object. The depth distance of these sample points to the visible depth
map is then used in the training data. Since we have sampled points from all around the objects,
many points will not lie on the visible surface but will be behind it. We rely on the random
100
Proceedings
OAGM & ARW Joint Workshop 2016 on "Computer Vision and Robotics“
- Title
- Proceedings
- Subtitle
- OAGM & ARW Joint Workshop 2016 on "Computer Vision and Robotics“
- Authors
- Peter M. Roth
- Kurt Niel
- Publisher
- Verlag der Technischen Universität Graz
- Location
- Wels
- Date
- 2017
- Language
- English
- License
- CC BY 4.0
- ISBN
- 978-3-85125-527-0
- Size
- 21.0 x 29.7 cm
- Pages
- 248
- Keywords
- Tagungsband
- Categories
- International
- Tagungsbände