-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Enable non-synchronizing cub scan for cum* operations #42036
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| constexpr int max_cub_size = std::numeric_limits<int>::max() / 2 + 1; // 2**30 | ||
| for (int64_t i = 0; i < size; i += max_cub_size) { | ||
| int size_cub = std::min<int64_t>(size - i, max_cub_size); | ||
| Tensor first_elem; // need to save it for all iterations other than first |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very clever :)
| // need to temporarily transform first element of the range we are | ||
| // operating on self might be multi-d, but we need to index a single | ||
| // element | ||
| auto self_view = at::_unsafe_view(self, -1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why view self inside the loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In most cases we won't enter this loop, in some others we'll enter it once. Taking the view out of the loop will make us take it unconditionally, which we don't want.
💊 CI failures summary and remediationsAs of commit 3ad7c06 (more details on the Dr. CI page):
XLA failureJob pytorch_xla_linux_bionic_py3_6_clang9_test is failing. Please create an issue with title prefixed by This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group. This comment has been revised 11 times. |
| x_cpu = x.cpu().float() | ||
| expected = fn(x_cpu) | ||
| actual = fn(x).cpu().float() | ||
| self.assertEqual(expected, actual.cpu().float()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| self.assertEqual(expected, actual.cpu().float()) | |
| self.assertEqual(expected, actual) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I always forget whether assertEqual can handle different devices and dtypes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are already doing .cpu().float() in the line above, so no need to do it here again.
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
mruberry
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
This uses cub for cum* operations, because, unlike thrust, cub is non-synchronizing.
Cub does not support more than
2**31element tensors out of the box (in fact, due to cub bugs the cutoff point is even smaller)so to support that I split the tensor into
2**30element chunks, and modify the first value of the second and subsequent chunks to contain the cumsum result of the previous chunks. Since modification is done inplace on the source tensor, if something goes wrong and we error out before the source tensor is reverted back to its original state, source tensor will be corrupted, but in most cases errors will invalidate the full coda context.