Seite - 112 - in Proceedings - OAGM & ARW Joint Workshop 2016 on "Computer Vision and Robotics“
Bild der Seite - 112 -
Text der Seite - 112 -
have the same row offset (O) and the same compensation factor PL, which is optimal for a vector
unit. Thesameis true foradding thedifferentpreviousvaluesP1 .. P4. UsingPLandOforallvector
elements requires one additional instruction to broadcast the single value to all vector elements. We
will refer to the summing of X1 to X4 as partial prefix sum. The approach is similar to horizontal
minimum / maximum within a vector register. The difference is that a shuffle would not help, but
shiftingsolves theproblem. Table3 lists thesingle steps. Twoadditionsand twoshifts arenecessary.
This definitely is an improvement over [16] as their scheme required three additions and the same
amount of shifts. It might be the reason that their SSE version was slower than an enhanced serial
algorithm.
Unfortunately,asAVXdoesnothaveshiftoperations, theyareconsideredtobeonlyuseful for integer
data. So the shift has to be emulated by combining a shuffle and a blend. The shuffle rearranges
data and the blend masks the first element with zero, which is not supported by the shuffle or other
operations. AVX2 added integer support and shift operations at the same time. Due to the lane
concept, a special cross-lane operation is necessary. The idea is to do the partial sum for each lane.
In the last step, the overall sum of the lower lane is broadcasted to all elements in the higher lane
of a register and added. What is helpful is that the partial prefix sum in the first step is independent
from the other values. Without any doubt,O−PL+Px does not depend on the sum at first. In the
final step, both temporal results have to be merged with a vector addition. In the first place, we have
two independent data streams, which helps exploring instruction level parallelism. This is especially
important due to the fact that—asstatedbefore—summing within thevector isnot ideal for vector
units. The data preparation is another step optimal for vector units. If pattern matching is done using
a norm without an inner product, the similarity measure is applied to the difference between pattern
and test candidate.
We can estimate the expected speedup. For the regular version, we require 3 additions (or subtrac-
tions). Thevectorizedversionhasanoverheadof2 · logn,wheren is thenumberofvectorelements.
Then, there are three additions and one broadcast, but this already computesnpixels at once. Note,
that this isaveryroughestimation. Wehavenot taken intoaccount instruction levelparallelism. This
means instructions differ in latency and throughput. Moreover, the processor might have more oper-
ational units for some instructions than for others [21]. Another fact we did not consider is moving
data around. SSE and AVX are — like the whole x86 instruction set — based on load and store. The
normal version requires a load for each single element, however, the corresponding instructions for
vector units load data chunks as large as the vector unit in a single step at the same time. Making the
process faster, the bandwidth is alsoexceeded faster.
The complexity of the analysis above should make it clear that it is nearly impossible to give an
estimated speedup for the whole discrepancy norm calculation. Thus, we will use practical tests to
evaluate theperformance impact.
input v3 v2 v1 v0
shift 0 v3 v2 v1
add v3 v3+2 v2+1 v1+0
shift 0 0 v3 v3+2
add v3 v3+2 v3+2+1 v3+2+1+0
Table3. Computing partialprefixsumfor avectorregister holding4elements.
112
Proceedings
OAGM & ARW Joint Workshop 2016 on "Computer Vision and Robotics“
- Titel
- Proceedings
- Untertitel
- OAGM & ARW Joint Workshop 2016 on "Computer Vision and Robotics“
- Autoren
- Peter M. Roth
- Kurt Niel
- Verlag
- Verlag der Technischen Universität Graz
- Ort
- Wels
- Datum
- 2017
- Sprache
- englisch
- Lizenz
- CC BY 4.0
- ISBN
- 978-3-85125-527-0
- Abmessungen
- 21.0 x 29.7 cm
- Seiten
- 248
- Schlagwörter
- Tagungsband
- Kategorien
- International
- Tagungsbände