Page - 111 - in Proceedings - OAGM & ARW Joint Workshop 2016 on "Computer Vision and Robotics“

Image of the Page - 111 -

Text of the Page - 111 -

input v3 v2 v1 v0 shuffle v2 v3 v0 v1 max max32 max32 max10 max10 shuffle max10 max10 max32 max32 max max3210 max3210 max3210 max3210 Table2. Getting minimum/maximum ofa vectorregister holding4 elements. PL O P2 P3 P4 X1X2X3X4 P1 Figure2. Dependencies forcomputingnew values for an integral imageswith vector units. instruction has the same indexing capabilities but operates on a doubled data amount. The AVX instruction simply takes the immediate value and applies the shuffle for each lane. Only a small number of special instructions allow crosslane data exchange in some restricted ways [17, Volume 1 Chapter14]. 3.1. Vectorization Taking intoaccount thepointsmentionedabove,wewill nowdevelop avectorization scheme. Using vector units for the comparisons is straight forward. Vector units will help us to comparenelements atonce. Finally,wehave toget themaximumandminimumof thevector itself,whichresults inover- head because of operating horizontally. To be precise, additional comparisons of lognare necessary and the same amount of data permutations. The basic idea is to compare pairs of values and then use the result again for pairwise comparisons but with half the number of pairs. Vector units permit comparing several pairs with a single instruction at the same time. Data shuffling assures we are comparingdifferentpairs in thenext step. Table2 illustrates theprocedure foravectorunitholding4 elements. More complicated is the vectorization of computing the integral image, which can be interpreted as a 2D version of the prefix sum. [18] provides a good summary of prefix sums in general, their applications and a parallel version. The proposed parallelization model is well suited for using GPU acceleration. This was proven by [19]. On the other hand, the GPU version turned out to be only useful for large dataset. Moreover, this was tested for traditional prefix sums and not for 2D versions suitable for integral images which would consist of two passes: one prefix sum over the rows, a second one over the columns, taking the result of the first pass as the input. A similar two-pass algorithm for integral images using GPU was developed and tested by [20]. A notable speedup was not gained for images with a pixel count less than 0.5 million. Furthermore, the data transfer to and from the GPU was not included in the time measurement. Using SIMD extension for integral image computation is a quite new approach, leading to the fact that there are few literature reference that use SIMD extensions. [16] applies SSE for a part of the computation algorithm. Nonetheless, finally, this version is slower than a sequential algorithm presented in the same work. We will show an implementation consisting of a single pass instead of two traditional prefix sum passes. Figure 2 shows the dependencies for computing the new pixels X1 .. X4 in the integral image. All new pixels 111

back to the book Proceedings - OAGM & ARW Joint Workshop 2016 on "Computer Vision and Robotics“"

Proceedings OAGM & ARW Joint Workshop 2016 on "Computer Vision and Robotics“

Title: Proceedings
Subtitle: OAGM & ARW Joint Workshop 2016 on "Computer Vision and Robotics“
Authors: Peter M. Roth; Kurt Niel
Publisher: Verlag der Technischen Universität Graz
Location: Wels
Date: 2017
Language: English
License: CC BY 4.0
ISBN: 978-3-85125-527-0
Size: 21.0 x 29.7 cm
Pages: 248
Keywords: Tagungsband
Categories: International; Tagungsbände

Page - 111 - in Proceedings - OAGM & ARW Joint Workshop 2016 on "Computer Vision and Robotics“

Image of the Page - 111 -

Text of the Page - 111 -

Table of contents