-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Small fixes to improve TensorIterator overhead for the common case of inputs and outputs of the same type #27457
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This pull request was exported from Phabricator. Differential Revision: D17784532 |
59120d7 to
7310549
Compare
|
This pull request was exported from Phabricator. Differential Revision: D17784532 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indentation
7310549 to
c83fa03
Compare
|
This pull request was exported from Phabricator. Differential Revision: D17784532 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this check be sunk into assert_no_partial_overlap?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I already added a short-circuit in get_overlap_status, but it was behind .contiguous so was slower, now I pulled it up there and removed from here, seems to be ok.
… inputs and outputs of the same type (pytorch#27457) Summary: Pull Request resolved: pytorch#27457 1) Short-circuits computing common type and type promotion logic for the common case of operands and result of the same type 2) Improves performance of checking memory overlap by returning MemoryOverlap::FULL if tensors are the same, skips the call from TensorIterator when tensors are the same 3) Changes the default size of DimVector from 5 to 6, thus allowing it not to be resized for a common case of binary operation. `strides` DimVector is forced to have at least 2*num_tensors elements, which for an operation with 2 inputs and one output is 6 4) If `offset` is 0 (common non-broadcasting case), don't fill `strides` vector with 0-s, because all the values will be subsequently written to. These changes combined improve the overhead from 1.02 us to .74 us for a simple in-place operation. Test Plan: should be covered by existing tests Differential Revision: D17784532 fbshipit-source-id: 5e523e1892cfe4b7aee0d90cc8e820409c309e9f
c83fa03 to
3960835
Compare
|
This pull request was exported from Phabricator. Differential Revision: D17784532 |
| struct CAFFE2_API TensorIterator { | ||
| using DimMask = std::bitset<64>; | ||
| using PtrVector = SmallVector<char*, 4>; | ||
| using StrideVector = SmallVector<int64_t, 6>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
super nit: maybe move this magical numbers into constant with good name
super super nit: make Ptr vector same size as StrideVector
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
… inputs and outputs of the same type (#27457) Summary: 1) Short-circuits computing common type and type promotion logic for the common case of operands and result of the same type 2) Improves performance of checking memory overlap by returning MemoryOverlap::FULL if tensors are the same, skips the call from TensorIterator when tensors are the same 3) Changes the default size of DimVector from 5 to 6, thus allowing it not to be resized for a common case of binary operation. `strides` DimVector is forced to have at least 2*num_tensors elements, which for an operation with 2 inputs and one output is 6 4) If `offset` is 0 (common non-broadcasting case), don't fill `strides` vector with 0-s, because all the values will be subsequently written to. These changes combined improve the overhead from 1.02 us to .74 us for a simple in-place operation. Pull Request resolved: pytorch/pytorch#27457 Test Plan: should be covered by existing tests Differential Revision: D17784532 Pulled By: ngimel fbshipit-source-id: e6a8ee58be5de14461bdbc2e2b0b6d16a96c309f
… inputs and outputs of the same type (pytorch#27457) Summary: 1) Short-circuits computing common type and type promotion logic for the common case of operands and result of the same type 2) Improves performance of checking memory overlap by returning MemoryOverlap::FULL if tensors are the same, skips the call from TensorIterator when tensors are the same 3) Changes the default size of DimVector from 5 to 6, thus allowing it not to be resized for a common case of binary operation. `strides` DimVector is forced to have at least 2*num_tensors elements, which for an operation with 2 inputs and one output is 6 4) If `offset` is 0 (common non-broadcasting case), don't fill `strides` vector with 0-s, because all the values will be subsequently written to. These changes combined improve the overhead from 1.02 us to .74 us for a simple in-place operation. Pull Request resolved: pytorch#27457 Test Plan: should be covered by existing tests Differential Revision: D17784532 Pulled By: ngimel fbshipit-source-id: e6a8ee58be5de14461bdbc2e2b0b6d16a96c309f
Summary:
from TensorIterator when tensors are the same
stridesDimVector is forced to have at least 2*num_tensors elements, which for an operation with 2 inputs and one output is 6
offsetis 0 (common non-broadcasting case), don't fillstridesvector with 0-s, because all the values will be subsequently written to.These changes combined improve the overhead from 1.02 us to .74 us for a simple in-place operation.
Test Plan: should be covered by existing tests
Differential Revision: D17784532