AA-sort with SSE4.1


              Cybozu Labs
  2012/6/16 MITSUNARI Shigeo(@herumi)
x86/x64 optimization seminar 4(#x86opti)
Agenda
 Introduction of AA-sort
   classic combsort
   vectorized combsort
   vectorized merge
 benchmark




2012/6/16 #x86opti 4        2 /29
AA-sort
 Aligned-Access sort
   proposed by Hiroshi Inoue, etc. in
    "A high-performance sorting algorithm for multicore
    single-instruction multiple-data processors," 2011
      http://www.research.ibm.com/trl/people/inouehrs/SPE_SIMDsort.htm
      http://www.research.ibm.com/trl/people/inouehrs/pact2007.htm
   For SIMD
     less conditional branch, no unaligned data access
   For multicore processors
     they implemented it for PowerPC and Cell BE
   O(n log n) complexity
 I tried it for Intel CPU(not complete)
   https://github.com/herumi/opti/blob/master/intsort.hpp
     current version is for only one processor
2012/6/16 #x86opti 4                                                      3 /29
AA-sort
 vectorized combsort for a block (<= L2cache?)
 vectorized merge sorted block

                                         input array

          block 0          block 1        block 2        block3   ...

             sort             sort           sort          sort

             <               <               <             <      ...

                           merge                        merge
                       <                            <             ...
                                         merge
                                     <                            ...
2012/6/16 #x86opti 4                                                    4 /29
AA-sort algorithm
 sort each block
   O(n log n)
 merge sorted block
   O(n)




2012/6/16 #x86opti 4      5 /29
classic combsort(1/2)
 improved bubble sort
   unstable
   O(n log n)
   compare two elements having a gap(>=1)
     gap is divided by shrink factor (about 1.3)
     size_t nextGap(size_t N) { return (N * 10) / 13; }

     void combsort(uint32_t *a, size_t N) {
       size_t gap = nextGap(N);
       while (gap > 1) {
         for (size_t i = 0; i < N - gap; i++) {
           if (a[i] > a[i + gap]) std::swap(a[i], a[i + gap]);
         }
         gap = nextGap(gap);
       }
       …
2012/6/16 #x86opti 4                                             6 /29
classic combsort(2/2)
 gap = 1 means bubble sort
   loop until the array is fully sorted

           …
           for (;;) {
             bool isSwapped = false;
             for (size_t i = 0; i < N - 1; i++) {
               if (a[i] > a[i + 1]) {
                 std::swap(a[i], a[i + 1]);
                 isSwapped = true;
               }
             }
             if (!isSwapped) return;
           }
       }


2012/6/16 #x86opti 4                                7 /29
gap function
 Combsort11
   last pattern of gap [11, 8, 6, 4, 3, 2, 1] seems good
    by http://cs.clackamas.cc.or.us/molatore/cs260Spr03/combsort.htm


      size_t nextGap(size_t n) {
          n = (n * 10) / 13;
          if (n == 9 || n == 10) return 11; // (*)
          return n;
      }



   a little faster if line(*) is appended



2012/6/16 #x86opti 4                                               8 /29
vectorized combsort
 step1 : sort values within each vector(32bitx4)
 step2 : SIMD version combsort
 step3 : reorder data
       6       8        9    3    5      7       12    14    0    4        1        20     11    ...

                                 step1
                                                      sort                                sort
  +0       3       5        0     …          …                        0         1          3       …   101
  +1       9       7        1     …          …                    102          104        105      …   380
  +2       6       12       4     …          …                    389          391        392      …   502
  +3       8       14       20    …          …
                                                        step2
                                                                  511          515        612      …   973
        v0         v1       v2    v3
                                                                                    step3

       0       1        3    …   101   102   104       105   …   380      389       391   392    …

2012/6/16 #x86opti 4                                                                                         9 /29
step1
 step1.1 : sort [v[i][j] | i<-[0..3]] for j = 0,
  1, 2, 3
 step1.2 : transpose
  3       5      0     8

  2       7      1     2
                            step1.1
  8      12      4     13

  9      14      20    15
                                          sort

 v0      v1     v2     v3        0    3          5    8
                                                           step1.2
                                 1    2          2    7

                                 4    8          12   13
                                                                     transpose
                                 9    14         15   20
                                                                 0   1     4     9

                                                                 3   2     8     14

                                                                 5   2    12     15

                                                                 8   7    13     20

2012/6/16 #x86opti 4                                                              10 /29
sort of 4 items
 use max ud, minud for uint32_t x 4
        a                 b

                  <                 v0                v1              v2              v3

    min(a,b)           max(a,b)             <                                   <

                                   min01            max01           min23           max23

                                                <                           <
                                                s=max(min          t=min(max
                                  min0123                                           max0123
                                                01,min23)          01,max23)
                                                               <

                                  min0123           min(s,t)        max(s,t)        max0123


                                                                           sorted

2012/6/16 #x86opti 4                                                                        11 /29
source of step1.1
 V128 is a type of 32-bit integer x 4
   pminud(a, b) : min(a_i, b_i) for i = 0, 1, 2, 3

                 void sort_step1_vec(V128 x[4])
                 {
                     V128 min01 = pminud(x[0], x[1]);
                     V128 max01 = pmaxud(x[0], x[1]);
                     V128 min23 = pminud(x[2], x[3]);
                     V128 max23 = pmaxud(x[2], x[3]);
                     x[0] = pminud(min01, min23);
                     x[3] = pmaxud(max01, max23);
                     V128 s = pmaxud(min01, min23);
                     V128 t = pminud(max01, max23);
                     x[1] = pminud(s, t);
                     x[2] = pmaxud(s, t);
                 }


2012/6/16 #x86opti 4                                    12 /29
transpose of 4x4 matrix
 use unpcklps and unpckhps
                                 t0=unpcklps(x0,x2)
+0     3       5        0   8                         3    5     8    12

+1     2       7        1   2
                                 t2=unpckhps(x0,x2)   0    8     4    13

+2     8      12        4   13                        2    7     9    14

+3     9      14       20   15   t1=unpcklps(x1,x3)   1    2     20   15
                                 t3=unpckhps(x1,x3)
      x0      x1       x2   x3                        t0   t1   t2    t3



       3       5        8   12   x0=unpcklps(t0,t1)   3    2    8     9
       0       8        4   13                        5    7    12    14
       2       7        9   14   x1=unpckhps(t0,t1)   0    1    4     20
       1       2       20   15                        8    2    13    15
                                 x2=unpcklps(t2,t3)
      t0      t1       t2   t3   x3=unpckhps(t2,t3)   x0   x1   x2    x3

2012/6/16 #x86opti 4                                                       13 /29
source of transpose and step1
  void transpose(V128 x[4])       void sort_step1(V128 *va, size_t N)
  {                               {
    V128 x0 = x[0];                 for(size_t i = 0; i < N; i+= 4) {
    V128 x1 = x[1];                   sort_step1_vec(&va[i]);
    V128 x2 = x[2];                   transpose(&va[i]);
    V128 x3 = x[3];                 }
    V128 t0 = unpcklps(x0, x2);   }
    V128 t1 = unpcklps(x1, x3);
    V128 t2 = unpckhps(x0, x2);
    V128 t3 = unpckhps(x1, x3);
    x[0] = unpcklps(t0, t1);
    x[1] = unpckhps(t0, t1);
    x[2] = unpcklps(t2, t3);
    x[3] = unpckhps(t2, t3);
  }



2012/6/16 #x86opti 4                                              14 /29
SIMD version combsort
 first half code use
   vector_cmpswap
   vector_cmpswap_skew
     bool sort_step2(V128 *va, size_t N) {
       size_t gap = nextGap(N);
       while (gap > 1) {
         for (size_t i = 0; i < N - gap; i++) {
           vector_cmpswap(va[i], va[i + gap]);
         }
         for (size_t i = N - gap; i < N; i++) {
           vector_cmpswap_skew(va[i], va[i + gap - N]);
         }
         gap = nextGap(gap);
       }
       ...


2012/6/16 #x86opti 4                                      15 /29
vector_cmpswap
 no conditional branch
           a              b

                  <

       min(a,b)        max(a,b)


     if (a[i] > a[i + gap]) std::swap(a[i], a[i + gap]);

                                  vectorised

     void vector_cmpswap(V128& a, V128& b)
     {
       V128 t = pmaxud(a, b);
       a = pminud(a, b);
       b = t;
     }

2012/6/16 #x86opti 4                                       16 /29
vector_cmpswap_skew
 for boundary of array

       a               a3      a2           a1           a0



       b               b3      b2           b1           b0


                                           (a',b') = vector_cmpswap_ske(a,b)

       a'              a3   min(a2,b3)   min(a1,b2)   min(a0,b1)



       b'        max(a2,b3) max(a1,b2) max(a0,b1)        b0




2012/6/16 #x86opti 4                                                           17 /29
isSortedVec
 check whether array is sorted
   ptest_zf(a, b) is true if (a & b) == 0
   a <= b  max(a,b) == b  c := max(a,b) – b == 0
   pcmpgtd is for int32_t, so we can't use it
          bool isSortedVec(const V128 *va, size_t N) {
            for (size_t i = 0; i < N - 1; i++) {
              V128 a = va[i];
              V128 b = va[i + 1];
              V128 c = pmaxud(a, b);
              c = psubd(c, b);
              if (!ptest_zf(c, c)) {
                return false;
              }
            }
            return true;
          }
2012/6/16 #x86opti 4                                     18 /29
loop for gap == 1
 vectorised bubble sort for gap == 1
   retire if loop count reaches maxLoop
     fall to std::sort
         almost rare
            const int maxLoop = 10;
            for (int i = 0; i < maxLoop; i++) {
              for (size_t i = 0; i < N - 1; i++) {
                vector_cmpswap(va[i], va[i + 1]);
              }
              vector_cmpswap_skew(va[N - 1], va[0]);
              if (isSortedVec(va, N)) return true;
            }




2012/6/16 #x86opti 4                                   19 /29
AA-sort algorithm
 sort each block
   O(n log n)
 merge sorted block
   O(n)




2012/6/16 #x86opti 4      20 /29
merge two sorted vector
   a = [a3:a2:a1:a0], b = [b3:b2:b1:b0] are soreted
   c = [b:a] = merge and sort (a, b)
                                                      sorted

                   a        a0   a1     a2       a3

                                                       sorted
                   b        b0   b1     b2       b3


                                      [b:a] = vector_merge(a,b)


        c0             c1   c2   c3     c0       c1       c2      c3

                                                                  sorted



2012/6/16 #x86opti 4                                                   21 /29
data flow of merge
                                   sorted                                          sorted


     a0          a1        a2          a3            b0          b1         b2          b3




           <                       <                       <                       <
   min00       max00       min11       max11       min22       max22       min33       max33



                       <                                               <




                       <                       <                       <


2012/6/16 #x86opti 4                                                                         22 /29
source of vector_merge
 Too complex
   good idea?          void vector_merge(V128& a, V128& b) {
                         V128 m = pminud(a, b);
                         V128 M = pmaxud(a, b);
                         V128 s0 = punpckhqdq(m, m);
                         V128 s1 = pminud(s0, M);
                         V128 s2 = pmaxud(s0, M);
                         V128 s3 = punpcklqdq(s1, punpckhqdq(M, M));
                         V128 s4 = punpcklqdq(s2, m);
                         s4 = pshufd<PACK(2, 1, 0, 3)>(s4);
                         V128 s5 = pminud(s3, s4);
                         V128 s6 = pmaxud(s3, s4);
                         V128 s7 = pinsrd<2>(s5, movd(s6));
                         V128 s8 = pinsrd<0>(s6, pextrd<2>(s5));
                         a = pshufd<PACK(1, 2, 0, 3)>(s7);
                         b = pshufd<PACK(3, 2, 0, 1)>(s8);
                       }
2012/6/16 #x86opti 4                                                   23 /29
std::merge()
 merge [begin1, end1) and [begin2, end2)
 template <class In1, class In2, class Out>
 Out merge(In1 begin1, In1 end1, In2 begin2, In2 end2, Out out)
 {
   for (;;) {
     *out++ = *begin2 < *begin1 ? *begin2++ : *begin1++;
     if (begin1 == end1) return copy(begin2, end2, result);
     if (begin2 == end2) return copy(begin1, end1, result);
   }
 }




2012/6/16 #x86opti 4                                              24 /29
vectorised merge
 merge arrays with vector_merge()
 void merge(V128 *vo, const V128 *va, size_t aN, const V128 *vb, size_t bN){
   uint32_t aPos = 0, bPos = 0, outPos = 0;
   V128 vMin = va[aPos++];
   V128 vMax = vb[bPos++];
   for (;;) {
     vector_merge(vMin, vMax);
     vo[outPos++] = vMin;
     if (aPos < aN) {
       if (bPos < bN) {
         V128 ta = va[aPos];
         V128 tb = vb[bPos];          ; compare ta0 with tb0
         if (movd(ta) <= movd(tb)) {
           vMin = ta;
           aPos++;
         } else {
           vMin = tb;
           bPos++;
         }

2012/6/16 #x86opti 4                                                     25 /29
block size and rate of sort
 What is good size for vectorised sort?
   half size of L2 is recommended for PowerPC 970MP
     L2 = 1MiB => 512KiB => block size = 128Ki / uint32_t
 BS = 32Ki seems good for Xeon, Core i7
 profile of sort and merge
        100
         80
         60
         40                                         merge(%)
         20                                         sort(%)
           0




2012/6/16 #x86opti 4                                           26 /29
Benchmark(1/3)
 AA-sort vs std::sort for random data
   Xeon X5650 + gcc-4.6.3
      4 times faster for # < 64Ki, 2.85 times faster for # is 4Mi
                   10000000
                                                      std::sort                                fast
                    1000000
                                                      AA-sort
                     100000
     clock cycle




                      10000
                       1000
                        100
                         10
                          1
                              16   64   256   1Ki   4Ki   16Ki    64Ki   256Ki   1Mi   4Mi
                                                                                             # of uint32_t


2012/6/16 #x86opti 4                                                                                    27 /29
Benchmark(2/3)
 sort 64Ki uint on Xeon + gcc-4.6.3
   AA-sort speed does not strongly depend on pattern
    25000
                                               fast
    20000
                       std::sort
    15000              AA-sort

    10000

     5000

          0




2012/6/16 #x86opti 4                                    28 /29
Benchmark(3/3)
 sort 64Ki uint on Core i7 + gcc-4.6.3 / VC11

   16000
                                        fast
   14000
   12000
   10000
                                        std::sort(gcc)
    8000
                                        AA-sort(gcc)
    6000
                                        std::sort(VC)
    4000
                                        AA-sort(VC)
    2000
         0




2012/6/16 #x86opti 4                                29 /29

AA-sort with SSE4.1

  • 1.
    AA-sort with SSE4.1 Cybozu Labs 2012/6/16 MITSUNARI Shigeo(@herumi) x86/x64 optimization seminar 4(#x86opti)
  • 2.
    Agenda  Introduction ofAA-sort  classic combsort  vectorized combsort  vectorized merge  benchmark 2012/6/16 #x86opti 4 2 /29
  • 3.
    AA-sort  Aligned-Access sort  proposed by Hiroshi Inoue, etc. in "A high-performance sorting algorithm for multicore single-instruction multiple-data processors," 2011  http://www.research.ibm.com/trl/people/inouehrs/SPE_SIMDsort.htm  http://www.research.ibm.com/trl/people/inouehrs/pact2007.htm  For SIMD less conditional branch, no unaligned data access  For multicore processors they implemented it for PowerPC and Cell BE  O(n log n) complexity  I tried it for Intel CPU(not complete)  https://github.com/herumi/opti/blob/master/intsort.hpp current version is for only one processor 2012/6/16 #x86opti 4 3 /29
  • 4.
    AA-sort  vectorized combsortfor a block (<= L2cache?)  vectorized merge sorted block input array block 0 block 1 block 2 block3 ... sort sort sort sort < < < < ... merge merge < < ... merge < ... 2012/6/16 #x86opti 4 4 /29
  • 5.
    AA-sort algorithm  sorteach block  O(n log n)  merge sorted block  O(n) 2012/6/16 #x86opti 4 5 /29
  • 6.
    classic combsort(1/2)  improvedbubble sort  unstable  O(n log n)  compare two elements having a gap(>=1) gap is divided by shrink factor (about 1.3) size_t nextGap(size_t N) { return (N * 10) / 13; } void combsort(uint32_t *a, size_t N) { size_t gap = nextGap(N); while (gap > 1) { for (size_t i = 0; i < N - gap; i++) { if (a[i] > a[i + gap]) std::swap(a[i], a[i + gap]); } gap = nextGap(gap); } … 2012/6/16 #x86opti 4 6 /29
  • 7.
    classic combsort(2/2)  gap= 1 means bubble sort  loop until the array is fully sorted … for (;;) { bool isSwapped = false; for (size_t i = 0; i < N - 1; i++) { if (a[i] > a[i + 1]) { std::swap(a[i], a[i + 1]); isSwapped = true; } } if (!isSwapped) return; } } 2012/6/16 #x86opti 4 7 /29
  • 8.
    gap function  Combsort11  last pattern of gap [11, 8, 6, 4, 3, 2, 1] seems good by http://cs.clackamas.cc.or.us/molatore/cs260Spr03/combsort.htm size_t nextGap(size_t n) { n = (n * 10) / 13; if (n == 9 || n == 10) return 11; // (*) return n; }  a little faster if line(*) is appended 2012/6/16 #x86opti 4 8 /29
  • 9.
    vectorized combsort  step1: sort values within each vector(32bitx4)  step2 : SIMD version combsort  step3 : reorder data 6 8 9 3 5 7 12 14 0 4 1 20 11 ... step1 sort sort +0 3 5 0 … … 0 1 3 … 101 +1 9 7 1 … … 102 104 105 … 380 +2 6 12 4 … … 389 391 392 … 502 +3 8 14 20 … … step2 511 515 612 … 973 v0 v1 v2 v3 step3 0 1 3 … 101 102 104 105 … 380 389 391 392 … 2012/6/16 #x86opti 4 9 /29
  • 10.
    step1  step1.1 :sort [v[i][j] | i<-[0..3]] for j = 0, 1, 2, 3  step1.2 : transpose 3 5 0 8 2 7 1 2 step1.1 8 12 4 13 9 14 20 15 sort v0 v1 v2 v3 0 3 5 8 step1.2 1 2 2 7 4 8 12 13 transpose 9 14 15 20 0 1 4 9 3 2 8 14 5 2 12 15 8 7 13 20 2012/6/16 #x86opti 4 10 /29
  • 11.
    sort of 4items  use max ud, minud for uint32_t x 4 a b < v0 v1 v2 v3 min(a,b) max(a,b) < < min01 max01 min23 max23 < < s=max(min t=min(max min0123 max0123 01,min23) 01,max23) < min0123 min(s,t) max(s,t) max0123 sorted 2012/6/16 #x86opti 4 11 /29
  • 12.
    source of step1.1 V128 is a type of 32-bit integer x 4  pminud(a, b) : min(a_i, b_i) for i = 0, 1, 2, 3 void sort_step1_vec(V128 x[4]) { V128 min01 = pminud(x[0], x[1]); V128 max01 = pmaxud(x[0], x[1]); V128 min23 = pminud(x[2], x[3]); V128 max23 = pmaxud(x[2], x[3]); x[0] = pminud(min01, min23); x[3] = pmaxud(max01, max23); V128 s = pmaxud(min01, min23); V128 t = pminud(max01, max23); x[1] = pminud(s, t); x[2] = pmaxud(s, t); } 2012/6/16 #x86opti 4 12 /29
  • 13.
    transpose of 4x4matrix  use unpcklps and unpckhps t0=unpcklps(x0,x2) +0 3 5 0 8 3 5 8 12 +1 2 7 1 2 t2=unpckhps(x0,x2) 0 8 4 13 +2 8 12 4 13 2 7 9 14 +3 9 14 20 15 t1=unpcklps(x1,x3) 1 2 20 15 t3=unpckhps(x1,x3) x0 x1 x2 x3 t0 t1 t2 t3 3 5 8 12 x0=unpcklps(t0,t1) 3 2 8 9 0 8 4 13 5 7 12 14 2 7 9 14 x1=unpckhps(t0,t1) 0 1 4 20 1 2 20 15 8 2 13 15 x2=unpcklps(t2,t3) t0 t1 t2 t3 x3=unpckhps(t2,t3) x0 x1 x2 x3 2012/6/16 #x86opti 4 13 /29
  • 14.
    source of transposeand step1 void transpose(V128 x[4]) void sort_step1(V128 *va, size_t N) { { V128 x0 = x[0]; for(size_t i = 0; i < N; i+= 4) { V128 x1 = x[1]; sort_step1_vec(&va[i]); V128 x2 = x[2]; transpose(&va[i]); V128 x3 = x[3]; } V128 t0 = unpcklps(x0, x2); } V128 t1 = unpcklps(x1, x3); V128 t2 = unpckhps(x0, x2); V128 t3 = unpckhps(x1, x3); x[0] = unpcklps(t0, t1); x[1] = unpckhps(t0, t1); x[2] = unpcklps(t2, t3); x[3] = unpckhps(t2, t3); } 2012/6/16 #x86opti 4 14 /29
  • 15.
    SIMD version combsort first half code use  vector_cmpswap  vector_cmpswap_skew bool sort_step2(V128 *va, size_t N) { size_t gap = nextGap(N); while (gap > 1) { for (size_t i = 0; i < N - gap; i++) { vector_cmpswap(va[i], va[i + gap]); } for (size_t i = N - gap; i < N; i++) { vector_cmpswap_skew(va[i], va[i + gap - N]); } gap = nextGap(gap); } ... 2012/6/16 #x86opti 4 15 /29
  • 16.
    vector_cmpswap  no conditionalbranch a b < min(a,b) max(a,b) if (a[i] > a[i + gap]) std::swap(a[i], a[i + gap]); vectorised void vector_cmpswap(V128& a, V128& b) { V128 t = pmaxud(a, b); a = pminud(a, b); b = t; } 2012/6/16 #x86opti 4 16 /29
  • 17.
    vector_cmpswap_skew  for boundaryof array a a3 a2 a1 a0 b b3 b2 b1 b0 (a',b') = vector_cmpswap_ske(a,b) a' a3 min(a2,b3) min(a1,b2) min(a0,b1) b' max(a2,b3) max(a1,b2) max(a0,b1) b0 2012/6/16 #x86opti 4 17 /29
  • 18.
    isSortedVec  check whetherarray is sorted  ptest_zf(a, b) is true if (a & b) == 0  a <= b  max(a,b) == b  c := max(a,b) – b == 0  pcmpgtd is for int32_t, so we can't use it bool isSortedVec(const V128 *va, size_t N) { for (size_t i = 0; i < N - 1; i++) { V128 a = va[i]; V128 b = va[i + 1]; V128 c = pmaxud(a, b); c = psubd(c, b); if (!ptest_zf(c, c)) { return false; } } return true; } 2012/6/16 #x86opti 4 18 /29
  • 19.
    loop for gap== 1  vectorised bubble sort for gap == 1  retire if loop count reaches maxLoop fall to std::sort  almost rare const int maxLoop = 10; for (int i = 0; i < maxLoop; i++) { for (size_t i = 0; i < N - 1; i++) { vector_cmpswap(va[i], va[i + 1]); } vector_cmpswap_skew(va[N - 1], va[0]); if (isSortedVec(va, N)) return true; } 2012/6/16 #x86opti 4 19 /29
  • 20.
    AA-sort algorithm  sorteach block  O(n log n)  merge sorted block  O(n) 2012/6/16 #x86opti 4 20 /29
  • 21.
    merge two sortedvector  a = [a3:a2:a1:a0], b = [b3:b2:b1:b0] are soreted  c = [b:a] = merge and sort (a, b) sorted a a0 a1 a2 a3 sorted b b0 b1 b2 b3 [b:a] = vector_merge(a,b) c0 c1 c2 c3 c0 c1 c2 c3 sorted 2012/6/16 #x86opti 4 21 /29
  • 22.
    data flow ofmerge sorted sorted a0 a1 a2 a3 b0 b1 b2 b3 < < < < min00 max00 min11 max11 min22 max22 min33 max33 < < < < < 2012/6/16 #x86opti 4 22 /29
  • 23.
    source of vector_merge Too complex  good idea? void vector_merge(V128& a, V128& b) { V128 m = pminud(a, b); V128 M = pmaxud(a, b); V128 s0 = punpckhqdq(m, m); V128 s1 = pminud(s0, M); V128 s2 = pmaxud(s0, M); V128 s3 = punpcklqdq(s1, punpckhqdq(M, M)); V128 s4 = punpcklqdq(s2, m); s4 = pshufd<PACK(2, 1, 0, 3)>(s4); V128 s5 = pminud(s3, s4); V128 s6 = pmaxud(s3, s4); V128 s7 = pinsrd<2>(s5, movd(s6)); V128 s8 = pinsrd<0>(s6, pextrd<2>(s5)); a = pshufd<PACK(1, 2, 0, 3)>(s7); b = pshufd<PACK(3, 2, 0, 1)>(s8); } 2012/6/16 #x86opti 4 23 /29
  • 24.
    std::merge()  merge [begin1,end1) and [begin2, end2) template <class In1, class In2, class Out> Out merge(In1 begin1, In1 end1, In2 begin2, In2 end2, Out out) { for (;;) { *out++ = *begin2 < *begin1 ? *begin2++ : *begin1++; if (begin1 == end1) return copy(begin2, end2, result); if (begin2 == end2) return copy(begin1, end1, result); } } 2012/6/16 #x86opti 4 24 /29
  • 25.
    vectorised merge  mergearrays with vector_merge() void merge(V128 *vo, const V128 *va, size_t aN, const V128 *vb, size_t bN){ uint32_t aPos = 0, bPos = 0, outPos = 0; V128 vMin = va[aPos++]; V128 vMax = vb[bPos++]; for (;;) { vector_merge(vMin, vMax); vo[outPos++] = vMin; if (aPos < aN) { if (bPos < bN) { V128 ta = va[aPos]; V128 tb = vb[bPos]; ; compare ta0 with tb0 if (movd(ta) <= movd(tb)) { vMin = ta; aPos++; } else { vMin = tb; bPos++; } 2012/6/16 #x86opti 4 25 /29
  • 26.
    block size andrate of sort  What is good size for vectorised sort?  half size of L2 is recommended for PowerPC 970MP L2 = 1MiB => 512KiB => block size = 128Ki / uint32_t  BS = 32Ki seems good for Xeon, Core i7  profile of sort and merge 100 80 60 40 merge(%) 20 sort(%) 0 2012/6/16 #x86opti 4 26 /29
  • 27.
    Benchmark(1/3)  AA-sort vsstd::sort for random data  Xeon X5650 + gcc-4.6.3 4 times faster for # < 64Ki, 2.85 times faster for # is 4Mi 10000000 std::sort fast 1000000 AA-sort 100000 clock cycle 10000 1000 100 10 1 16 64 256 1Ki 4Ki 16Ki 64Ki 256Ki 1Mi 4Mi # of uint32_t 2012/6/16 #x86opti 4 27 /29
  • 28.
    Benchmark(2/3)  sort 64Kiuint on Xeon + gcc-4.6.3  AA-sort speed does not strongly depend on pattern 25000 fast 20000 std::sort 15000 AA-sort 10000 5000 0 2012/6/16 #x86opti 4 28 /29
  • 29.
    Benchmark(3/3)  sort 64Kiuint on Core i7 + gcc-4.6.3 / VC11 16000 fast 14000 12000 10000 std::sort(gcc) 8000 AA-sort(gcc) 6000 std::sort(VC) 4000 AA-sort(VC) 2000 0 2012/6/16 #x86opti 4 29 /29