[BC-breaking] Use ScatterGatherKernel for scatter_reduce (CPU-only) #74226

mikaylagawarecki · 2022-03-15T04:50:29Z

Stack from ghstack:

Update signature of scatter_reduce_ to match scatter_/scatter_add_

Tensor.scatter_reduce_(int64 dim, Tensor index, Tensor src, str reduce)

Add new reduction options in ScatterGatherKernel.cpp and update scatter_reduce to call into the cpu kernel for scatter.reduce
scatter_reduce now has the same shape constraints as scatter_ and scatter_add_
Migrate test/test_torch.py:test_scatter_reduce to test/test_scatter_gather_ops.py

Differential Revision: D35222842

[ghstack-poisoned]

pytorch-bot · 2022-03-15T04:50:33Z

CI Flow Status

⚛️ CI Flow

Ruleset - Version: v1
Ruleset - File: https://github.com/pytorch/pytorch/blob/4d1d5d9354f3aeaf58f245ba8477e3360dd07193/.github/generated-ciflow-ruleset.json
PR ciflow labels: ciflow/default
Add ciflow labels to this PR to trigger more builds:

Workflows	Labels (bold enabled)	Status
Triggered Workflows
linux-binary-conda	`ciflow/binaries`, `ciflow/binaries_conda`, `ciflow/default`	✅ triggered
linux-binary-libtorch-cxx11-abi	`ciflow/all`, `ciflow/binaries`, `ciflow/binaries_libtorch`, `ciflow/default`, `ciflow/trunk`	✅ triggered
linux-binary-libtorch-pre-cxx11	`ciflow/all`, `ciflow/binaries`, `ciflow/binaries_libtorch`, `ciflow/default`, `ciflow/trunk`	✅ triggered
linux-binary-manywheel	`ciflow/all`, `ciflow/binaries`, `ciflow/binaries_wheel`, `ciflow/default`, `ciflow/trunk`	✅ triggered
linux-bionic-py3.7-clang9	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/noarch`, `ciflow/trunk`	✅ triggered
linux-bionic-rocm4.5-py3.7	`ciflow/all`, `ciflow/default`, `ciflow/linux`, `ciflow/rocm`, `ciflow/trunk`	✅ triggered
linux-docs	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/docs`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
linux-vulkan-bionic-py3.7-clang9	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`, `ciflow/vulkan`	✅ triggered
linux-xenial-cuda11.3-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
linux-xenial-cuda11.3-py3.7-gcc7-bazel-test	`ciflow/all`, `ciflow/bazel`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
linux-xenial-py3-clang5-mobile-build	`ciflow/all`, `ciflow/default`, `ciflow/linux`, `ciflow/mobile`, `ciflow/trunk`	✅ triggered
linux-xenial-py3-clang5-mobile-custom-build-static	`ciflow/all`, `ciflow/default`, `ciflow/linux`, `ciflow/mobile`, `ciflow/trunk`	✅ triggered
linux-xenial-py3.7-clang7-asan	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/sanitizers`, `ciflow/trunk`	✅ triggered
linux-xenial-py3.7-clang7-onnx	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/onnx`, `ciflow/trunk`	✅ triggered
linux-xenial-py3.7-gcc5.4	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
linux-xenial-py3.7-gcc5.4-mobile-lightweight-dispatch-build	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/libtorch`, `ciflow/linux`, `ciflow/mobile`, `ciflow/trunk`	✅ triggered
linux-xenial-py3.7-gcc7	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
linux-xenial-py3.7-gcc7-no-ops	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
macos-arm64-binary-conda	`ciflow/binaries`, `ciflow/binaries_conda`, `ciflow/default`	✅ triggered
macos-arm64-binary-wheel	`ciflow/binaries`, `ciflow/binaries_wheel`, `ciflow/default`	✅ triggered
macos-binary-conda	`ciflow/binaries`, `ciflow/binaries_conda`, `ciflow/default`	✅ triggered
macos-binary-libtorch-cxx11-abi	`ciflow/binaries`, `ciflow/binaries_libtorch`, `ciflow/default`	✅ triggered
macos-binary-libtorch-pre-cxx11	`ciflow/binaries`, `ciflow/binaries_libtorch`, `ciflow/default`	✅ triggered
macos-binary-wheel	`ciflow/binaries`, `ciflow/binaries_wheel`, `ciflow/default`	✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single	`ciflow/all`, `ciflow/android`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit	`ciflow/all`, `ciflow/android`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
win-vs2019-cpu-py3	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/trunk`, `ciflow/win`	✅ triggered
win-vs2019-cuda11.3-py3	`ciflow/all`, `ciflow/cuda`, `ciflow/default`, `ciflow/trunk`, `ciflow/win`	✅ triggered
windows-binary-libtorch-debug	`ciflow/all`, `ciflow/binaries`, `ciflow/binaries_libtorch`, `ciflow/default`, `ciflow/trunk`	✅ triggered
windows-binary-libtorch-release	`ciflow/all`, `ciflow/binaries`, `ciflow/binaries_libtorch`, `ciflow/default`, `ciflow/trunk`	✅ triggered
windows-binary-wheel	`ciflow/all`, `ciflow/binaries`, `ciflow/binaries_wheel`, `ciflow/default`, `ciflow/trunk`	✅ triggered
Skipped Workflows
caffe2-linux-xenial-py3.7-gcc5.4	`ciflow/all`, `ciflow/cpu`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
docker-builds	`ciflow/all`, `ciflow/trunk`	🚫 skipped
ios-12-5-1-arm64	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/scheduled`	🚫 skipped
ios-12-5-1-arm64-coreml	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/scheduled`	🚫 skipped
ios-12-5-1-arm64-custom-ops	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/scheduled`	🚫 skipped
ios-12-5-1-arm64-metal	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/scheduled`	🚫 skipped
ios-12-5-1-x86-64	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
ios-12-5-1-x86-64-coreml	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
libtorch-linux-xenial-cuda10.2-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
libtorch-linux-xenial-cuda11.3-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
linux-bionic-cuda10.2-py3.9-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/slow`, `ciflow/trunk`	🚫 skipped
linux-docs-push	`ciflow/all`, `ciflow/cpu`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
linux-xenial-cuda11.3-py3.7-gcc7-no-ops	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
macos-10-15-py3-arm64	`ciflow/all`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
macos-10-15-py3-lite-interpreter-x86-64	`ciflow/all`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
macos-11-py3-x86-64	`ciflow/all`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
parallelnative-linux-xenial-py3.7-gcc5.4	`ciflow/all`, `ciflow/cpu`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
periodic-libtorch-linux-bionic-cuda11.5-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
periodic-linux-bionic-cuda11.5-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/scheduled`, `ciflow/slow`, `ciflow/slow-gradcheck`	🚫 skipped
periodic-linux-xenial-cuda11.3-py3.7-gcc7-debug	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
periodic-win-vs2019-cuda11.5-py3	`ciflow/all`, `ciflow/cuda`, `ciflow/scheduled`, `ciflow/win`	🚫 skipped
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-build	`ciflow/all`, `ciflow/android`, `ciflow/cpu`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
pytorch-xla-linux-bionic-py3.7-clang8	`ciflow/all`, `ciflow/cpu`, `ciflow/linux`, `ciflow/trunk`, `ciflow/xla`	🚫 skipped

facebook-github-bot · 2022-03-15T04:50:35Z

🔗 Helpful links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/74226
📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
↩️ [fb-only] Re-run with SSH instructions
❓Need help or want to give feedback on the CI? Visit our office hours

💊 CI failures summary and remediations

As of commit 241b898 (more details on the Dr. CI page):

1/1 failures introduced in this PR

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages

trunk / linux-bionic-rocm4.5-py3.7-distributed / test (distributed, 1, 1, linux.rocm.gpu) (1/1)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

2022-03-31T22:21:35.6736775Z AssertionError: 2 unit test(s) failed:

2022-03-31T22:21:34.6599020Z 
2022-03-31T22:21:34.6599263Z OK (skipped=1)
2022-03-31T22:21:34.6599642Z 
2022-03-31T22:21:34.6599926Z Generating XML reports...
2022-03-31T22:21:34.6687770Z Generated XML report: test-reports/dist-nccl/distributed.test_distributed_spawn/TEST-TestDistBackendWithSpawn-20220331222131.xml
2022-03-31T22:21:35.6729963Z Traceback (most recent call last):
2022-03-31T22:21:35.6730948Z   File "distributed/test_distributed_spawn.py", line 40, in <module>
2022-03-31T22:21:35.6731726Z     run_tests()
2022-03-31T22:21:35.6733198Z   File "/opt/conda/lib/python3.7/site-packages/torch/testing/_internal/common_utils.py", line 634, in run_tests
2022-03-31T22:21:35.6735981Z     len(failed_tests), '\n\t'.join(failed_tests))
2022-03-31T22:21:35.6736775Z AssertionError: 2 unit test(s) failed:
2022-03-31T22:21:35.6738216Z 	TestDistBackendWithSpawn.test_post_localSGD_optimizer_parity_with_hierarchical_sgd
2022-03-31T22:21:35.6739575Z 	TestDistBackendWithSpawn.test_post_localSGD_optimizer_parity_with_hierarchical_sgd_grad_is_view
2022-03-31T22:21:36.7411275Z Traceback (most recent call last):
2022-03-31T22:21:36.7412561Z   File "test/run_test.py", line 1054, in <module>
2022-03-31T22:21:36.7422270Z     main()
2022-03-31T22:21:36.7422936Z   File "test/run_test.py", line 1032, in main
2022-03-31T22:21:36.7425457Z     raise RuntimeError(err_message)
2022-03-31T22:21:36.7427261Z RuntimeError: distributed/test_distributed_spawn failed!
2022-03-31T22:21:37.8954539Z 
2022-03-31T22:21:37.8955305Z real	62m37.143s

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

ghstack-source-id: 236567f Pull Request resolved: #74226

[ghstack-poisoned]

ghstack-source-id: 3a2bf75 Pull Request resolved: #74226

Update signature of `scatter_reduce_` to match `scatter_/scatter_add_` `Tensor.scatter_reduce_(int64 dim, Tensor index, Tensor src, str reduce, *, bool include_input=True)` - Update `scatter_reduce` to call into the cpu/cuda kernels for `scatter.reduce` - `scatter_reduce` now has the same shape constraints as `scatter_` and `scatter_add_` - Add an argument `include_input` which indicates whether the value in the `self` Tensor at a given position is included in the reduction with the elements from `src` scattered to that position. For `I_self = {all indices of self}` `I_src= {all indices of src}` `S = {indices of self modified by scatter}` `self_indices_to_src_indices : I_self --> I_src` maps indices in `self` to a tuple of indices in `src` scattered to that index of `self` Then for `s ∈ S` and `t ∈ I\S` when `include_input=False` `self[s] = reduction_op(src[self_indices_to_src_indices[s]])` `self[t] = self[t]` and when `include_input=True` (regular scatter(reduce=op) behavior) `self[s] = reduction_op(self[s], src[self_indices_to_src_indices[s]])` `self[t] = self[t]` The [`optional_out` case of pytorch_scatter.scatter](https://github.com/rusty1s/pytorch_scatter/blob/master/csrc/scatter.cpp#L32) can then be handled by `torch.zeros(shape).scatter_reduce_(dim, index, src, reduce, include_input=False)` [ghstack-poisoned]

ghstack-source-id: 12c1ed8 Pull Request resolved: #74226

Update signature of `scatter_reduce_` to match `scatter_/scatter_add_` `Tensor.scatter_reduce_(int64 dim, Tensor index, Tensor src, str reduce, *, bool include_input=True)` - Update `scatter_reduce` to call into the cpu/cuda kernels for `scatter.reduce` - `scatter_reduce` now has the same shape constraints as `scatter_` and `scatter_add_` - Add an argument `include_input` which indicates whether the value in the `self` Tensor at a given position is included in the reduction with the elements from `src` scattered to that position. For `I_self = {all indices of self}` `I_src= {all indices of src}` `S = {indices of self modified by scatter}` `self_indices_to_src_indices : I_self --> I_src` maps indices in `self` to a tuple of indices in `src` scattered to that index of `self` Then for `s ∈ S` and `t ∈ I\S` when `include_input=False` `self[s] = reduction_op(src[self_indices_to_src_indices[s]])` `self[t] = self[t]` and when `include_input=True` (regular scatter(reduce=op) behavior) `self[s] = reduction_op(self[s], src[self_indices_to_src_indices[s]])` `self[t] = self[t]` The [`optional_out` case of pytorch_scatter.scatter](https://github.com/rusty1s/pytorch_scatter/blob/master/csrc/scatter.cpp#L32) can then be handled by `torch.zeros(shape).scatter_reduce_(dim, index, src, reduce, include_input=False)` [ghstack-poisoned]

cpuhrsch · 2022-03-17T16:22:24Z

aten/src/ATen/native/cuda/ScatterGatherKernel.cu

+    auto index_stride = self_dim_stride;
+
+
+    AT_DISPATCH_FLOATING_TYPES_AND2(


If the only difference between this kernel and cuda_scatter_gather_base_kernel is
AT_DISPATCH_FLOATING_TYPES_AND2 vs AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND3 it might be easier to explicitly check for a specific scalar type and error out in a unified kernel than copy-pasting the code and just changing this line.

The error message to throw can be found in the dispatch macro definitions.

test/forward_backward_compatibility/check_forward_backward_compatibility.py

torch/testing/_internal/common_methods_invocations.py

torch/_tensor_docs.py

test/test_torch.py

cpuhrsch

Looks pretty good, but I'd try to avoid the code duplication of copy-pasting cpu_scatter_gather_base_kernel to deal with the different set of dtypes.

cpuhrsch · 2022-03-24T19:24:17Z

aten/src/ATen/native/cpu/ScatterGatherKernel.cpp

    cpu_scatter_gather_base_kernel<>()(self, dim, index, value,
                                       "scatter_scalar_reduce_multiply_", reduce_multiply);
    break;
+  default:


Why did you add this?

Without this the build on CI will fail. I think it's due to this

Using the -Werror compiler flag, a switch statement over a value of an enum type without a default label will fail to compile if any enumerator of the enum doesn’t have a corresponding case. This is sometimes called an exhaustive or defaultless switch statement.

test/test_scatter_gather_ops.py

cpuhrsch

Looks great, just added two small comments

…CPU-only)" Update signature of `scatter_reduce_` to match `scatter_/scatter_add_` `Tensor.scatter_reduce_(int64 dim, Tensor index, Tensor src, str reduce)` - Add new reduction options in ScatterGatherKernel.cpp and update `scatter_reduce` to call into the cpu kernel for `scatter.reduce` - `scatter_reduce` now has the same shape constraints as `scatter_` and `scatter_add_` - Migrate `test/test_torch.py:test_scatter_reduce` to `test/test_scatter_gather_ops.py` [ghstack-poisoned]

mikaylagawarecki · 2022-03-29T18:18:50Z