[WIP] Vectorized partition function for floats on avx2 machines. #18344

umanwizard · 2019-03-22T18:45:31Z

Speeds up kthvalue on cpu; can be used for topk and sort in the future.

The idea is to compare 8 elements at a time to the pivot, then reshuffle them within a 256-bit register by using a lookup table according to the comparison, and then write them back into the array at the required positions. (I got most of the ideas from here and then modified the algorithm a bit to be in-place.)

Check the comments in qs_partition_inplace of QuickPartition.cpp for more details.

This can be easily extended to doubles, but the speedup might not be as great because you'll only sort 4 elements at a time.

Future possibilities:

Port topk and sort to use the same logic
Make it work for types other than float32
Figure out why it is only slightly faster on a large, fast-moving dimension on multi-core (see the 1000x1000000 case below)

Timings:

Experiment	Before	After	Speedup
Many cores, 1000000x1000 floats, dim 0	24.1s	3.27s	7.4x
Many cores, 1000000x1000 floats, dim 1	3.62s	577ms	6.3x
Many cores, 1000x1000000 floats, dim 0	4.06s	727ms	5.6x
Many cores, 1000x1000000 floats, dim 1	3.92s	3.4s	1.15x
One core, 1000000x1000 floats, dim 0	1m7s	26.9s	2.5x
One core, 1000000x1000 floats, dim 1	18s	7.54s	2.4x
One core, 1000x1000000 floats, dim 0	27.2s	9.29s	2.9x
One core, 1000x1000000 floats, dim 1	14.5s	4.1s	3.5x

Speeds up kthvalue on cpu; can be used for topk and sort in the future.

apaszke · 2019-03-24T18:33:33Z

aten/src/ATen/native/cpu/QuickPartition.cpp

+  static const uint32_t __attribute__((aligned(256))) t[256*8];
+};
+
+const uint32_t __attribute__((aligned(256))) Table<float>::t[] = {


A comment explaining how this has been computed and what it is would be very useful

Good point, I have added a comment

The comment says what the table is (good!) but it doesn't say how to compute the table (you probably ran a Python script or something to generate the table, right? Paste it in here.)

I'd rather see something like this in vec256 directly, since it's very specific to 256bit vectors.

ezyang · 2019-03-25T20:45:12Z

aten/src/ATen/native/cpu/QuickPartition.cpp

+using at::vec256::Vec256;
+using at::vec256::permute;
+using at::vec256::int_same_size_t;
+


Somewhere in this file we should quote the paper where this algorithm comes from.

Ok, updated to cite that paper at the top of the file.

umanwizard · 2019-04-05T17:31:09Z

@ezyang are you able to review this anytime soon? If not it's no problem, will add others.

t-vi · 2019-04-05T17:47:47Z

I haven't looked at the code or algorithm, but test failures look legit. Apparently there is a case when it's not working as expected. The other question I have is whether this falls back as expected when the AVX isn't available at run time (I think the code in cpu usually is only called when it is).

ezyang · 2019-04-11T18:39:20Z

aten/src/ATen/cpu/vec256/vec256_base.h

  DEFINE_COMP(<)
 #undef DEFINE_COMP

+  c10::guts::enable_if_t<size() <= 32, int32_t> msb_mask() const {


It would be nice to have a comment here just briefly saying what this does.

ezyang · 2019-04-11T18:39:28Z

aten/src/ATen/cpu/vec256/vec256_base.h

+
 };

+template <class T> inline Vec256<T> permute(const Vec256<T>& src,


ezyang · 2019-04-11T18:40:32Z

aten/src/ATen/cpu/vec256/vec256_int.h

    values = _mm256_setr_epi32(val1, val2, val3, val4, val5, val6, val7, val8);
  }
+  explicit operator __m256() const {
+    return (__m256)values;


Nit: use static_cast here

ezyang · 2019-04-11T18:40:42Z

aten/src/ATen/cpu/vec256/vec256_int.h


+template <>
+Vec256<int32_t> permute(const Vec256<int32_t>& src, const Vec256<int32_t>& indices) {
+  return (__m256i)_mm256_permutevar8x32_ps((__m256)src, indices);


ezyang · 2019-04-11T18:42:35Z

aten/src/ATen/native/cpu/QuickPartition.h

+#pragma once
+#include <ATen/NumericUtils.h>
+#include <stdint.h>
+#include <aten/src/ATen/cpu/vec256/vec256.h>


Ya, I desperately need comments in this file.

ezyang · 2019-04-11T18:54:54Z