Web-Books
in the Austria-Forum
Austria-Forum
Web-Books
International
Proceedings - OAGM & ARW Joint Workshop 2016 on "Computer Vision and Robotics“
Page - 112 -
  • User
  • Version
    • full version
    • text only version
  • Language
    • Deutsch - German
    • English

Page - 112 - in Proceedings - OAGM & ARW Joint Workshop 2016 on "Computer Vision and Robotics“

Image of the Page - 112 -

Image of the Page - 112 - in Proceedings - OAGM & ARW Joint Workshop 2016 on

Text of the Page - 112 -

have the same row offset (O) and the same compensation factor PL, which is optimal for a vector unit. Thesameis true foradding thedifferentpreviousvaluesP1 .. P4. UsingPLandOforallvector elements requires one additional instruction to broadcast the single value to all vector elements. We will refer to the summing of X1 to X4 as partial prefix sum. The approach is similar to horizontal minimum / maximum within a vector register. The difference is that a shuffle would not help, but shiftingsolves theproblem. Table3 lists thesingle steps. Twoadditionsand twoshifts arenecessary. This definitely is an improvement over [16] as their scheme required three additions and the same amount of shifts. It might be the reason that their SSE version was slower than an enhanced serial algorithm. Unfortunately,asAVXdoesnothaveshiftoperations, theyareconsideredtobeonlyuseful for integer data. So the shift has to be emulated by combining a shuffle and a blend. The shuffle rearranges data and the blend masks the first element with zero, which is not supported by the shuffle or other operations. AVX2 added integer support and shift operations at the same time. Due to the lane concept, a special cross-lane operation is necessary. The idea is to do the partial sum for each lane. In the last step, the overall sum of the lower lane is broadcasted to all elements in the higher lane of a register and added. What is helpful is that the partial prefix sum in the first step is independent from the other values. Without any doubt,O−PL+Px does not depend on the sum at first. In the final step, both temporal results have to be merged with a vector addition. In the first place, we have two independent data streams, which helps exploring instruction level parallelism. This is especially important due to the fact that—asstatedbefore—summing within thevector isnot ideal for vector units. The data preparation is another step optimal for vector units. If pattern matching is done using a norm without an inner product, the similarity measure is applied to the difference between pattern and test candidate. We can estimate the expected speedup. For the regular version, we require 3 additions (or subtrac- tions). Thevectorizedversionhasanoverheadof2 · logn,wheren is thenumberofvectorelements. Then, there are three additions and one broadcast, but this already computesnpixels at once. Note, that this isaveryroughestimation. Wehavenot taken intoaccount instruction levelparallelism. This means instructions differ in latency and throughput. Moreover, the processor might have more oper- ational units for some instructions than for others [21]. Another fact we did not consider is moving data around. SSE and AVX are — like the whole x86 instruction set — based on load and store. The normal version requires a load for each single element, however, the corresponding instructions for vector units load data chunks as large as the vector unit in a single step at the same time. Making the process faster, the bandwidth is alsoexceeded faster. The complexity of the analysis above should make it clear that it is nearly impossible to give an estimated speedup for the whole discrepancy norm calculation. Thus, we will use practical tests to evaluate theperformance impact. input v3 v2 v1 v0 shift 0 v3 v2 v1 add v3 v3+2 v2+1 v1+0 shift 0 0 v3 v3+2 add v3 v3+2 v3+2+1 v3+2+1+0 Table3. Computing partialprefixsumfor avectorregister holding4elements. 112
back to the  book Proceedings - OAGM & ARW Joint Workshop 2016 on "Computer Vision and Robotics“"
Proceedings OAGM & ARW Joint Workshop 2016 on "Computer Vision and Robotics“
Title
Proceedings
Subtitle
OAGM & ARW Joint Workshop 2016 on "Computer Vision and Robotics“
Authors
Peter M. Roth
Kurt Niel
Publisher
Verlag der Technischen Universität Graz
Location
Wels
Date
2017
Language
English
License
CC BY 4.0
ISBN
978-3-85125-527-0
Size
21.0 x 29.7 cm
Pages
248
Keywords
Tagungsband
Categories
International
Tagungsbände

Table of contents

  1. Learning / Recognition 24
  2. Signal & Image Processing / Filters 43
  3. Geometry / Sensor Fusion 45
  4. Tracking / Detection 85
  5. Vision for Robotics I 95
  6. Vision for Robotics II 127
  7. Poster OAGM & ARW 167
  8. Task Planning 191
  9. Robotic Arm 207
Web-Books
Library
Privacy
Imprint
Austria-Forum
Austria-Forum
Web-Books
Proceedings