Enable non-synchronizing cub scan for cum* operations #42036

ngimel · 2020-07-24T21:33:40Z

This uses cub for cum* operations, because, unlike thrust, cub is non-synchronizing.
Cub does not support more than 2**31 element tensors out of the box (in fact, due to cub bugs the cutoff point is even smaller)
so to support that I split the tensor into 2**30 element chunks, and modify the first value of the second and subsequent chunks to contain the cumsum result of the previous chunks. Since modification is done inplace on the source tensor, if something goes wrong and we error out before the source tensor is reverted back to its original state, source tensor will be corrupted, but in most cases errors will invalidate the full coda context.

ajtulloch · 2020-07-24T22:00:29Z

aten/src/ATen/native/cuda/ScanKernels.cu

+  constexpr int max_cub_size = std::numeric_limits<int>::max() / 2 + 1; // 2**30
+  for (int64_t i = 0; i < size; i += max_cub_size) {
+    int size_cub = std::min<int64_t>(size - i, max_cub_size);
+    Tensor first_elem; // need to save it for all iterations other than first


Very clever :)

zasdfgbnm · 2020-07-24T22:11:30Z

aten/src/ATen/native/cuda/ScanKernels.cu

+      // need to temporarily transform first element of the range we are
+      // operating on self might be multi-d, but we need to index a single
+      // element
+      auto self_view = at::_unsafe_view(self, -1);


Why view self inside the loop?

In most cases we won't enter this loop, in some others we'll enter it once. Taking the view out of the loop will make us take it unconditionally, which we don't want.

dr-ci · 2020-07-24T22:28:48Z

💊 CI failures summary and remediations

As of commit 3ad7c06 (more details on the Dr. CI page):

1/1 failures introduced in this PR

XLA failure

Job pytorch_xla_linux_bionic_py3_6_clang9_test is failing. Please create an issue with title prefixed by [PT_BREAK] in pytorch/xla and link to to this PR. If you have questions, please reach out to @ailzhang / @dlibenzi / @JackCaoG.

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 11 times.

aten/src/ATen/native/cuda/ScanKernels.cu

zasdfgbnm · 2020-07-24T22:30:31Z

test/test_torch.py

+        x_cpu = x.cpu().float()
+        expected = fn(x_cpu)
+        actual = fn(x).cpu().float()
+        self.assertEqual(expected, actual.cpu().float())


Suggested change

self.assertEqual(expected, actual.cpu().float())

self.assertEqual(expected, actual)

I always forget whether assertEqual can handle different devices and dtypes

You are already doing .cpu().float() in the line above, so no need to do it here again.

facebook-github-bot

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

test/test_torch.py

mruberry

LGTM!

facebook-github-bot

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2020-07-28T00:15:02Z

@ngimel merged this pull request in 6ca5421.

Natalia Gimelshein added 4 commits July 23, 2020 22:54

use cub scan

92d0c42

use only inclusive scan

f5bbb89

simplify logic, add tests

962945e

remove thrust transform header

ed09f0b

ngimel requested review from mruberry and zasdfgbnm July 24, 2020 21:33

ajtulloch reviewed Jul 24, 2020

View reviewed changes

zasdfgbnm reviewed Jul 24, 2020

View reviewed changes

zasdfgbnm approved these changes Jul 24, 2020

View reviewed changes

Natalia Gimelshein added 3 commits July 24, 2020 15:45

lint, don't restore original value for inplace operation

83f833e

updates per review

74fbd72

fix include

05f3e09

facebook-github-bot reviewed Jul 25, 2020

View reviewed changes

Natalia Gimelshein added 2 commits July 24, 2020 23:35

handle discontiguous out

66a4c68

lint

4898000

mruberry reviewed Jul 27, 2020

View reviewed changes

test/test_torch.py Outdated Show resolved Hide resolved

remove debug print

3ad7c06

mruberry approved these changes Jul 27, 2020

View reviewed changes

ailzhang mentioned this pull request Jul 27, 2020

Disable test_discontiguous_out_cumsum which checks contiguity. pytorch/xla#2377

Merged

facebook-github-bot reviewed Jul 27, 2020

View reviewed changes

facebook-github-bot closed this in 6ca5421 Jul 27, 2020

facebook-github-bot added the merged label Jul 28, 2020

mruberry added the Merged label Oct 28, 2020

	self.assertEqual(expected, actual.cpu().float())
	self.assertEqual(expected, actual)

Enable non-synchronizing cub scan for cum* operations #42036

Enable non-synchronizing cub scan for cum* operations #42036

Uh oh!

Conversation

ngimel commented Jul 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ajtulloch Jul 24, 2020

Choose a reason for hiding this comment

Uh oh!

zasdfgbnm Jul 24, 2020

Choose a reason for hiding this comment

Uh oh!

ngimel Jul 24, 2020

Choose a reason for hiding this comment

Uh oh!

dr-ci bot commented Jul 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

XLA failure

Uh oh!

Uh oh!

Uh oh!

zasdfgbnm Jul 24, 2020

Choose a reason for hiding this comment

Uh oh!

ngimel Jul 24, 2020

Choose a reason for hiding this comment

Uh oh!

zasdfgbnm Jul 24, 2020

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mruberry left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Jul 28, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ngimel commented Jul 24, 2020 •

edited

Loading

dr-ci bot commented Jul 24, 2020 •

edited

Loading