Speed up sum over a dimension #6026

colesbury · 2018-03-26T21:56:03Z

Perf numbers:
https://gist.github.com/colesbury/9e28dd7b0f27b0b019f68adbd4bd4b88

I've changed the dispatch stub so that it doesn't require every kernel
to be compiled for every instruction set. Kernel implementations are
stored in the stub's table with the REGISTER_DISPATCH macro.

I've also moved vec256 to it's own folder and split up the
specializations before they get too unwieldy.

aten/src/ATen/cpu/vec256/vec256_base.h

aten/src/ATen/cpu/vec256/vec256_int.h

aten/src/ATen/native/cpu/ReduceOpsKernel.cpp

cpuhrsch · 2018-03-27T04:15:13Z

Generally the dispatch code looks more compact, which is nice. But, I think we might to revisit it again in the future. Having the ability to pass a table or function for non-standard cases is definitely good, but we might need to add support for a wider range of objects, such as templated functions (I'm not sure this supports that).

All calls to tbb will need to be prefaced by a call to the init_tbb_num_threads function as otherwise the scheduler will be initialized with the wrong number of threads. Maybe we should setup tbb so that this is required by making it unavailable anywhere outside of Parallel.h/.cpp and therefore force all devs to wrap the tbb calls into an ATen api call. Alternatively it would be great if this wasn't necessary to begin with, but I don't think this will happen any time soon. There are preview features that appear to attempt to support this, but I didn't get them to work correctly for our case.

Everything else looks good to me.

EDIT: I'm wondering if we could delay merging the dispatch and vec256 changes until after the Unary ops have been merged?

aten/src/ATen/native/cpu/ReduceOpsKernel.cpp

Perf numbers: https://gist.github.com/colesbury/9e28dd7b0f27b0b019f68adbd4bd4b88 I've changed the dispatch stub so that it doesn't require every kernel to be compiled for every instruction set. Kernel implementations are stored in the stub's table with the REGISTER_DISPATCH macro. I've also moved vec256 to it's own folder and split up the specializations before they get too unwieldy.

- Prefer signed integers. Mixing signed and unsigned integers is a pain and ATen mostly uses signed integers (int64_t). - Use inline lambda instead of struct for UnaryOps - Rename partial load overload "load_partial"

aten/src/ATen/native/cpu/UnaryOpsKernel.cpp

+  });
+}
+
+static void sqrt_kernel(Tensor& result, const Tensor& self) {


colesbury requested a review from cpuhrsch March 26, 2018 21:56

onnxbot-worker-3 mentioned this pull request Mar 26, 2018

[auto] pytorch-pr-6026 onnxbot/onnx-fb-universe#1314

Closed

colesbury force-pushed the vectorized_sum branch 3 times, most recently from 146234c to cad1390 Compare March 27, 2018 01:04