-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Speed up sum over a dimension #6026
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
146234c to
cad1390
Compare
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
|
Generally the dispatch code looks more compact, which is nice. But, I think we might to revisit it again in the future. Having the ability to pass a table or function for non-standard cases is definitely good, but we might need to add support for a wider range of objects, such as templated functions (I'm not sure this supports that). All calls to tbb will need to be prefaced by a call to the init_tbb_num_threads function as otherwise the scheduler will be initialized with the wrong number of threads. Maybe we should setup tbb so that this is required by making it unavailable anywhere outside of Parallel.h/.cpp and therefore force all devs to wrap the tbb calls into an ATen api call. Alternatively it would be great if this wasn't necessary to begin with, but I don't think this will happen any time soon. There are preview features that appear to attempt to support this, but I didn't get them to work correctly for our case. Everything else looks good to me. EDIT: I'm wondering if we could delay merging the dispatch and vec256 changes until after the Unary ops have been merged? |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
6eb1995 to
81ddaa6
Compare
Perf numbers: https://gist.github.com/colesbury/9e28dd7b0f27b0b019f68adbd4bd4b88 I've changed the dispatch stub so that it doesn't require every kernel to be compiled for every instruction set. Kernel implementations are stored in the stub's table with the REGISTER_DISPATCH macro. I've also moved vec256 to it's own folder and split up the specializations before they get too unwieldy.
- Prefer signed integers. Mixing signed and unsigned integers is a pain and ATen mostly uses signed integers (int64_t). - Use inline lambda instead of struct for UnaryOps - Rename partial load overload "load_partial"
81ddaa6 to
9b9386f
Compare
| }); | ||
| } | ||
|
|
||
| static void sqrt_kernel(Tensor& result, const Tensor& self) { |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
Perf numbers:
https://gist.github.com/colesbury/9e28dd7b0f27b0b019f68adbd4bd4b88
I've changed the dispatch stub so that it doesn't require every kernel
to be compiled for every instruction set. Kernel implementations are
stored in the stub's table with the REGISTER_DISPATCH macro.
I've also moved vec256 to it's own folder and split up the
specializations before they get too unwieldy.