Page - 112 - in Proceedings - OAGM & ARW Joint Workshop 2016 on "Computer Vision and Robotics“

Image of the Page - 112 -

Text of the Page - 112 -

have the same row offset (O) and the same compensation factor PL, which is optimal for a vector unit. Thesameis true foradding thedifferentpreviousvaluesP1 .. P4. UsingPLandOforallvector elements requires one additional instruction to broadcast the single value to all vector elements. We will refer to the summing of X1 to X4 as partial prefix sum. The approach is similar to horizontal minimum / maximum within a vector register. The difference is that a shuffle would not help, but shiftingsolves theproblem. Table3 lists thesingle steps. Twoadditionsand twoshifts arenecessary. This definitely is an improvement over [16] as their scheme required three additions and the same amount of shifts. It might be the reason that their SSE version was slower than an enhanced serial algorithm. Unfortunately,asAVXdoesnothaveshiftoperations, theyareconsideredtobeonlyuseful for integer data. So the shift has to be emulated by combining a shuffle and a blend. The shuffle rearranges data and the blend masks the first element with zero, which is not supported by the shuffle or other operations. AVX2 added integer support and shift operations at the same time. Due to the lane concept, a special cross-lane operation is necessary. The idea is to do the partial sum for each lane. In the last step, the overall sum of the lower lane is broadcasted to all elements in the higher lane of a register and added. What is helpful is that the partial prefix sum in the first step is independent from the other values. Without any doubt,O−PL+Px does not depend on the sum at first. In the final step, both temporal results have to be merged with a vector addition. In the first place, we have two independent data streams, which helps exploring instruction level parallelism. This is especially important due to the fact that—asstatedbefore—summing within thevector isnot ideal for vector units. The data preparation is another step optimal for vector units. If pattern matching is done using a norm without an inner product, the similarity measure is applied to the difference between pattern and test candidate. We can estimate the expected speedup. For the regular version, we require 3 additions (or subtrac- tions). Thevectorizedversionhasanoverheadof2 · logn,wheren is thenumberofvectorelements. Then, there are three additions and one broadcast, but this already computesnpixels at once. Note, that this isaveryroughestimation. Wehavenot taken intoaccount instruction levelparallelism. This means instructions differ in latency and throughput. Moreover, the processor might have more oper- ational units for some instructions than for others [21]. Another fact we did not consider is moving data around. SSE and AVX are — like the whole x86 instruction set — based on load and store. The normal version requires a load for each single element, however, the corresponding instructions for vector units load data chunks as large as the vector unit in a single step at the same time. Making the process faster, the bandwidth is alsoexceeded faster. The complexity of the analysis above should make it clear that it is nearly impossible to give an estimated speedup for the whole discrepancy norm calculation. Thus, we will use practical tests to evaluate theperformance impact. input v3 v2 v1 v0 shift 0 v3 v2 v1 add v3 v3+2 v2+1 v1+0 shift 0 0 v3 v3+2 add v3 v3+2 v3+2+1 v3+2+1+0 Table3. Computing partialprefixsumfor avectorregister holding4elements. 112

back to the book Proceedings - OAGM & ARW Joint Workshop 2016 on "Computer Vision and Robotics“"

Proceedings OAGM & ARW Joint Workshop 2016 on "Computer Vision and Robotics“

Title: Proceedings
Subtitle: OAGM & ARW Joint Workshop 2016 on "Computer Vision and Robotics“
Authors: Peter M. Roth; Kurt Niel
Publisher: Verlag der Technischen Universität Graz
Location: Wels
Date: 2017
Language: English
License: CC BY 4.0
ISBN: 978-3-85125-527-0
Size: 21.0 x 29.7 cm
Pages: 248
Keywords: Tagungsband
Categories: International; Tagungsbände

Page - 112 - in Proceedings - OAGM & ARW Joint Workshop 2016 on "Computer Vision and Robotics“

Image of the Page - 112 -

Text of the Page - 112 -

Table of contents