Page - 111 - in Proceedings - OAGM & ARW Joint Workshop 2016 on "Computer Vision and Robotics“
Image of the Page - 111 -
Text of the Page - 111 -
input v3 v2 v1 v0
shuffle v2 v3 v0 v1
max max32 max32 max10 max10
shuffle max10 max10 max32 max32
max max3210 max3210 max3210 max3210
Table2. Getting minimum/maximum ofa vectorregister holding4 elements.
PL
O P2 P3 P4
X1X2X3X4
P1
Figure2. Dependencies forcomputingnew values for an integral imageswith vector units.
instruction has the same indexing capabilities but operates on a doubled data amount. The AVX
instruction simply takes the immediate value and applies the shuffle for each lane. Only a small
number of special instructions allow crosslane data exchange in some restricted ways [17, Volume 1
Chapter14].
3.1. Vectorization
Taking intoaccount thepointsmentionedabove,wewill nowdevelop avectorization scheme. Using
vector units for the comparisons is straight forward. Vector units will help us to comparenelements
atonce. Finally,wehave toget themaximumandminimumof thevector itself,whichresults inover-
head because of operating horizontally. To be precise, additional comparisons of lognare necessary
and the same amount of data permutations. The basic idea is to compare pairs of values and then
use the result again for pairwise comparisons but with half the number of pairs. Vector units permit
comparing several pairs with a single instruction at the same time. Data shuffling assures we are
comparingdifferentpairs in thenext step. Table2 illustrates theprocedure foravectorunitholding4
elements.
More complicated is the vectorization of computing the integral image, which can be interpreted
as a 2D version of the prefix sum. [18] provides a good summary of prefix sums in general, their
applications and a parallel version. The proposed parallelization model is well suited for using GPU
acceleration. This was proven by [19]. On the other hand, the GPU version turned out to be only
useful for large dataset. Moreover, this was tested for traditional prefix sums and not for 2D versions
suitable for integral images which would consist of two passes: one prefix sum over the rows, a
second one over the columns, taking the result of the first pass as the input. A similar two-pass
algorithm for integral images using GPU was developed and tested by [20]. A notable speedup was
not gained for images with a pixel count less than 0.5 million. Furthermore, the data transfer to
and from the GPU was not included in the time measurement. Using SIMD extension for integral
image computation is a quite new approach, leading to the fact that there are few literature reference
that use SIMD extensions. [16] applies SSE for a part of the computation algorithm. Nonetheless,
finally, this version is slower than a sequential algorithm presented in the same work. We will show
an implementation consisting of a single pass instead of two traditional prefix sum passes. Figure 2
shows the dependencies for computing the new pixels X1 .. X4 in the integral image. All new pixels
111
Proceedings
OAGM & ARW Joint Workshop 2016 on "Computer Vision and Robotics“
- Title
- Proceedings
- Subtitle
- OAGM & ARW Joint Workshop 2016 on "Computer Vision and Robotics“
- Authors
- Peter M. Roth
- Kurt Niel
- Publisher
- Verlag der Technischen Universität Graz
- Location
- Wels
- Date
- 2017
- Language
- English
- License
- CC BY 4.0
- ISBN
- 978-3-85125-527-0
- Size
- 21.0 x 29.7 cm
- Pages
- 248
- Keywords
- Tagungsband
- Categories
- International
- Tagungsbände