Skip to content

Conversation

@ngimel
Copy link
Collaborator

@ngimel ngimel commented Oct 7, 2019

Summary:

  1. Short-circuits computing common type and type promotion logic for the common case of operands and result of the same type
  2. Improves performance of checking memory overlap by returning MemoryOverlap::FULL if tensors are the same, skips the call
    from TensorIterator when tensors are the same
  3. Changes the default size of DimVector from 5 to 6, thus allowing it not to be resized for a common case of binary operation. strides
    DimVector is forced to have at least 2*num_tensors elements, which for an operation with 2 inputs and one output is 6
  4. If offset is 0 (common non-broadcasting case), don't fill strides vector with 0-s, because all the values will be subsequently written to.

These changes combined improve the overhead from 1.02 us to .74 us for a simple in-place operation.

Test Plan: should be covered by existing tests

Differential Revision: D17784532

@pytorchbot pytorchbot added module: internals Related to internal abstractions in c10 and ATen module: operators labels Oct 7, 2019
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D17784532

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D17784532

@ngimel
Copy link
Collaborator Author

ngimel commented Oct 7, 2019

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D17784532

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this check be sunk into assert_no_partial_overlap?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already added a short-circuit in get_overlap_status, but it was behind .contiguous so was slower, now I pulled it up there and removed from here, seems to be ok.

… inputs and outputs of the same type (pytorch#27457)

Summary:
Pull Request resolved: pytorch#27457

1) Short-circuits computing common type and type promotion logic for the common case of operands and result of the same type
2) Improves performance of checking memory overlap by returning MemoryOverlap::FULL if tensors are the same, skips the call
from TensorIterator when tensors are the same
3) Changes the default size of DimVector from 5 to 6, thus allowing it not to be resized for a common case of binary operation. `strides`
DimVector is forced to have at least 2*num_tensors elements, which for an operation with 2 inputs and one output is 6
4) If `offset` is 0 (common non-broadcasting case), don't fill `strides` vector with 0-s, because all the values will be subsequently written to.

These changes combined improve the overhead from 1.02 us to .74 us for a simple in-place operation.

Test Plan: should be covered by existing tests

Differential Revision: D17784532

fbshipit-source-id: 5e523e1892cfe4b7aee0d90cc8e820409c309e9f
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D17784532

struct CAFFE2_API TensorIterator {
using DimMask = std::bitset<64>;
using PtrVector = SmallVector<char*, 4>;
using StrideVector = SmallVector<int64_t, 6>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super nit: maybe move this magical numbers into constant with good name
super super nit: make Ptr vector same size as StrideVector

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

zdevito pushed a commit to zdevito/ATen that referenced this pull request Oct 16, 2019
… inputs and outputs of the same type (#27457)

Summary:
1) Short-circuits computing common type and type promotion logic for the common case of operands and result of the same type
2) Improves performance of checking memory overlap by returning MemoryOverlap::FULL if tensors are the same, skips the call
from TensorIterator when tensors are the same
3) Changes the default size of DimVector from 5 to 6, thus allowing it not to be resized for a common case of binary operation. `strides`
DimVector is forced to have at least 2*num_tensors elements, which for an operation with 2 inputs and one output is 6
4) If `offset` is 0 (common non-broadcasting case), don't fill `strides` vector with 0-s, because all the values will be subsequently written to.

These changes combined improve the overhead from 1.02 us to .74 us for a simple in-place operation.
Pull Request resolved: pytorch/pytorch#27457

Test Plan: should be covered by existing tests

Differential Revision: D17784532

Pulled By: ngimel

fbshipit-source-id: e6a8ee58be5de14461bdbc2e2b0b6d16a96c309f
@facebook-github-bot
Copy link
Contributor

@ngimel merged this pull request in 174e1ba.

@ngimel ngimel deleted the export-D17784532 branch December 5, 2019 03:48
thiagocrepaldi pushed a commit to thiagocrepaldi/pytorch that referenced this pull request Feb 4, 2020
… inputs and outputs of the same type (pytorch#27457)

Summary:
1) Short-circuits computing common type and type promotion logic for the common case of operands and result of the same type
2) Improves performance of checking memory overlap by returning MemoryOverlap::FULL if tensors are the same, skips the call
from TensorIterator when tensors are the same
3) Changes the default size of DimVector from 5 to 6, thus allowing it not to be resized for a common case of binary operation. `strides`
DimVector is forced to have at least 2*num_tensors elements, which for an operation with 2 inputs and one output is 6
4) If `offset` is 0 (common non-broadcasting case), don't fill `strides` vector with 0-s, because all the values will be subsequently written to.

These changes combined improve the overhead from 1.02 us to .74 us for a simple in-place operation.
Pull Request resolved: pytorch#27457

Test Plan: should be covered by existing tests

Differential Revision: D17784532

Pulled By: ngimel

fbshipit-source-id: e6a8ee58be5de14461bdbc2e2b0b6d16a96c309f
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged module: internals Related to internal abstractions in c10 and ATen

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants