Skip to content

Conversation

@umanwizard
Copy link
Contributor

Speeds up kthvalue on cpu; can be used for topk and sort in the future.

The idea is to compare 8 elements at a time to the pivot, then reshuffle them within a 256-bit register by using a lookup table according to the comparison, and then write them back into the array at the required positions. (I got most of the ideas from here and then modified the algorithm a bit to be in-place.)

Check the comments in qs_partition_inplace of QuickPartition.cpp for more details.

This can be easily extended to doubles, but the speedup might not be as great because you'll only sort 4 elements at a time.

Future possibilities:

  • Port topk and sort to use the same logic
  • Make it work for types other than float32
  • Figure out why it is only slightly faster on a large, fast-moving dimension on multi-core (see the 1000x1000000 case below)

Timings:

Experiment Before After Speedup
Many cores, 1000000x1000 floats, dim 0 24.1s 3.27s 7.4x
Many cores, 1000000x1000 floats, dim 1 3.62s 577ms 6.3x
Many cores, 1000x1000000 floats, dim 0 4.06s 727ms 5.6x
Many cores, 1000x1000000 floats, dim 1 3.92s 3.4s 1.15x
One core, 1000000x1000 floats, dim 0 1m7s 26.9s 2.5x
One core, 1000000x1000 floats, dim 1 18s 7.54s 2.4x
One core, 1000x1000000 floats, dim 0 27.2s 9.29s 2.9x
One core, 1000x1000000 floats, dim 1 14.5s 4.1s 3.5x

Speeds up kthvalue on cpu; can be used for topk and sort in the future.
@umanwizard umanwizard requested review from ezyang and t-vi March 22, 2019 18:45
static const uint32_t __attribute__((aligned(256))) t[256*8];
};

const uint32_t __attribute__((aligned(256))) Table<float>::t[] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A comment explaining how this has been computed and what it is would be very useful

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I have added a comment

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says what the table is (good!) but it doesn't say how to compute the table (you probably ran a Python script or something to generate the table, right? Paste it in here.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather see something like this in vec256 directly, since it's very specific to 256bit vectors.

using at::vec256::Vec256;
using at::vec256::permute;
using at::vec256::int_same_size_t;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somewhere in this file we should quote the paper where this algorithm comes from.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, updated to cite that paper at the top of the file.

@umanwizard
Copy link
Contributor Author

@ezyang are you able to review this anytime soon? If not it's no problem, will add others.

@t-vi
Copy link
Collaborator

t-vi commented Apr 5, 2019

I haven't looked at the code or algorithm, but test failures look legit. Apparently there is a case when it's not working as expected. The other question I have is whether this falls back as expected when the AVX isn't available at run time (I think the code in cpu usually is only called when it is).

DEFINE_COMP(<)
#undef DEFINE_COMP

c10::guts::enable_if_t<size() <= 32, int32_t> msb_mask() const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to have a comment here just briefly saying what this does.


};

template <class T> inline Vec256<T> permute(const Vec256<T>& src,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

values = _mm256_setr_epi32(val1, val2, val3, val4, val5, val6, val7, val8);
}
explicit operator __m256() const {
return (__m256)values;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: use static_cast here


template <>
Vec256<int32_t> permute(const Vec256<int32_t>& src, const Vec256<int32_t>& indices) {
return (__m256i)_mm256_permutevar8x32_ps((__m256)src, indices);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and here

#pragma once
#include <ATen/NumericUtils.h>
#include <stdint.h>
#include <aten/src/ATen/cpu/vec256/vec256.h>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya, I desperately need comments in this file.

*mode_value = tmp_values[k - 1];
*mode_index = tmp_indices[k - 1];
});
bool can_use_small_index = self.size(dim) <= std::numeric_limits<int_same_size_t<scalar_t>>::max();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused by the small index logic here (in particular, I don't see why index size should have anything to do with the size of the scalar type; that seems really weird to me!) An explanatory note would help here.



template <typename scalar_t, typename index_t, typename ValFn, typename IdxFn>
void qsel_with_indices(const Tensor& self, Tensor& values, Tensor& indices, ValFn finalize_vals, IdxFn finalize_idxs, index_t k, bool largest, int64_t dim) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once again, would be super helpful to know what finalize_vals and finalize_idxs are supposed tod.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't look like largest is ever called at false. What's going on?

using index_t = int_same_size_t<scalar_t>;
qsel_with_indices<scalar_t, index_t>(self, values, indices,
[k](const scalar_t *vals, const Tensor& val_slice) {
*val_slice.data<scalar_t>() = vals[k - 1];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data<>() is a checked pointer access, which I think is inappropriate here because I believe you're calling this in a loop. However, I don't see an unsafe_data member on Tensor, I guess we should add one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a little shocking that you're passing an entire honking Tensor reference into this function; that sounds like a bunch of unnecessary allocations, especially when you just want to poke some memory at one location.

[&](int64_t i, TensorList tl) {
auto self = tl[0].accessor<scalar_t, 1>();
auto sz = self.size(0);
std::unique_ptr<index_t[]> tmp_indices_(new index_t[sz]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Err, why not just use a std::vector here

index_t R,
scalar_t piv,
bool largest,
int) -> decltype(vec_qs_partition_inplace((scalar_t *)(nullptr), (scalar_t *)(nullptr), (index_t *)(nullptr), scalar_t(), bool()), index_t()){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ummmm... this is just an int32_t, right? Can we just say that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, are these shenanigans because the type may change based on index_t? But I am still not sure why the return type cannot be straightforwardly written in terms of index_t...

}

template <typename scalar_t, typename index_t>
auto partition(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd love a comment here saying what the contract between L, R, piv and the output values are, at the very least.

// Re-set active partition
if (j <= k)
L = i;
L = j;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happened here? (I'm sure it's right, but I found it surprising!)

@ezyang
Copy link
Contributor

ezyang commented Apr 11, 2019

This PR is not complete because it needs to use the registration mechanism to decide whether or not to use the AVX2 kernel or not. That likely explains the test failures; I don't see any use of our existing dispatching mechanisms for AVX2 kernels.

namespace at {
namespace native {
namespace {
// fallback to scalar when we don't have avx2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, so this says it falls back when we don't have avx2, but you're doing this as a compile time test, which is not sufficient. It needs to be a runtime test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grep for REGISTER_DISPATCH to see how to do this properly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I suspect you may have to do more work. The existing code for handling machines which can vectorize or not do this purely by recompiling Vec256 multiple times under different AVX settings. But in your case you need something a little different, you don't want to run the algorithm above at all when you don't have sufficient AVX support, you just want to do something completely different. I'm not exactly sure how to spell that presently. cc @colesbury @cpuhrsch

Copy link
Contributor

@cpuhrsch cpuhrsch Apr 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can guard using __AVX__ macros

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Err, really? I thought __AVX__ says whether or not the compiler has AVX support. But that's not what we need to control over here; we need to control whether or not the runtime has AVX support.

};

template <typename T, bool largest>
int_same_size_t<T> nan_partition(T *begin, T *end,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, uh, what exactly does this do? Just partition away the NaNs? Or something else? A comment would be worth a lot here!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, it looks like this is what you call when the pivot is a NaN. That's all I needed in the comment :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OOC: which test exercises this case?

return _mm256_cmp_ps(values, other.values, _CMP_GE_OQ);
}

Vec256<float> not_leq(const Vec256<float>& other) const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is "not less than or equal to"?


int_same_size_t<T> *indices_left_out = indices;
int_same_size_t<T> *indices_right_out = indices + sz;
while (true) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Highly not a fan of 'while true'

@cpuhrsch
Copy link
Contributor

Can't some of this stuff be done using TensorIterator reductions?

matches_jit_signature: True
python_module: nn


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

of course you'll delete this

@@ -0,0 +1,572 @@
// AVX2 optimized partition algorithm
// Based on the following paper:
// Shay Gueron, Vlad Krasnov, Fast Quicksort Implementation Using AVX Instructions, The Computer Journal, Volume 59, Issue 1, January 2016, Pages 83–90, https://doi.org/10.1093/comjnl/bxv063
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@umanwizard umanwizard changed the title Vectorized partition function for floats on avx2 machines. [WIP] Vectorized partition function for floats on avx2 machines. Apr 11, 2019
@ezyang
Copy link
Contributor

ezyang commented Apr 11, 2019

General comment: I have a hard time telling what largest is supposed to do. I thought it lets you invert which way the sort goes, but there are a few places where this expectation feels violated. E.g., https://github.com/pytorch/pytorch/pull/18344/files#diff-625f0146fdeb14ca4158305031e2c5e8R303

int_same_size_t<T> nan_partition(T *begin, T *end,
int_same_size_t<T> *indices) {
// We consider NaNs to be larger than any element.
if (!largest) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what's going on here.

@ezyang
Copy link
Contributor

ezyang commented Apr 12, 2019

General comment: I have a hard time telling what largest is supposed to do. I thought it lets you invert which way the sort goes, but there are a few places where this expectation feels violated.

I think the problem is that what I think of as the logical sense for largest is inverted in this PR. I would have expected largest == true when I am attempting to compute the kth largest element. torch.kthvalue is defined to return the kth smallest element. Thus, I expect largest == false in this PR. However, in https://github.com/pytorch/pytorch/pull/18344/files#diff-dfc3d7c56800ce763e3798bc86d964dcR224 you have passed largest = true. So this seems to imply that the opposite sense is being used here.

@umanwizard
Copy link
Contributor Author

It seems unlikely that this is ever going to land. Closing the PR.

@umanwizard umanwizard closed this Jan 3, 2020
@edelsohn edelsohn mentioned this pull request Aug 9, 2020
24 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants