Skip to content

Conversation

@mingfeima
Copy link
Collaborator

@mingfeima mingfeima commented Aug 3, 2022

Stack from ghstack:

Motivation of this PR

This PR is targeting at improving performance of scatter_add for GNN usage scenarios on PyG. Currently only CPU optimizations is covered.

Message Passing is the major step in GNN learning which means exchanging/aggregating info between nodes. And from the perf point of view, if the EdgeIndex is stored as [2, num_edges], scatter_reduce would be a major perf hotspot on current pytorch implementations.

To be more specific, in the process of message passing, scatter_add is used in a very similar way as index_select, except that the self tensor is written into while index_select is only reading. Therefore, the index tensor passed to scatter_add is an expanded tensor on dim0, which means all the rest of dims would end up with the same value.

Algorithm

Current impl on scatter would do parallel on the inner dims for such case which would cause bad perf: non-contiguous memory access pattern and non-vectorized.

This PR did sorting on the index to solve the write conflicts if we directly parallel on dim0. The algorithm is equivalent to:

  • convert memory format from COO to CSR
  • do spmm reduce

Perf improvement

The benchmark comes from https://github.com/pyg-team/pytorch_geometric/tree/master/examples, python reddit.py which runs model SAGE on dataset reddit.

CPU type: Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz

aten::scatter_add_ has been reduced from 37.797s to 5.989s:

  • breakdown before
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                                     aten::scatter_add_        49.00%       37.797s        49.00%       37.797s      41.445ms           912
                                     aten::index_select        19.74%       15.223s        19.74%       15.227s       6.678ms          2280
                                           aten::linear         0.01%       5.706ms        15.04%       11.602s      12.721ms           912
                                            aten::addmm         6.62%        5.108s         7.92%        6.112s      13.403ms           456
                                           aten::matmul         0.00%       2.339ms         7.10%        5.475s      12.006ms           456
  • breakdown after
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                                     aten::index_select        32.41%       14.677s        32.42%       14.681s       6.439ms          2280
                                           aten::linear         0.01%       6.665ms        26.43%       11.968s      13.123ms           912
                                            aten::addmm        11.76%        5.328s        13.76%        6.232s      13.667ms           456
                                     aten::scatter_add_        13.22%        5.989s        13.22%        5.989s       6.566ms           912
                                           aten::matmul         0.01%       2.303ms        12.63%        5.720s      12.543ms           456

cc @VitalyFedyunin @jgong5 @XiaobingSuper @sanchitintel @ashokei @jingxu10

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Aug 3, 2022

🔗 Helpful links

❌ 15 New Failures

As of commit 97c4f58 (more details on the Dr. CI page):

Expand to see more
  • 15/15 failures introduced in this PR

🕵️ 15 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages

See GitHub Actions build pull / linux-bionic-py3.7-clang9 / test (crossref, 1, 2, linux.2xlarge) (1/15)

Step: "Test" (full log | diagnosis details)

2022-08-15T02:43:48.1471359Z RuntimeError: test_ops failed!
2022-08-15T02:43:47.3129230Z FAILED test_ops.py::TestCommonCPU::test_noncontiguous_samples_scatter_add_cpu_float32
2022-08-15T02:43:47.3129825Z !!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!
2022-08-15T02:43:47.3130149Z !!!!!!!!!!!! xdist.dsession.Interrupted: stopping after 1 failures !!!!!!!!!!!!!
2022-08-15T02:43:47.3134495Z = 1 failed, 1182 passed, 294 skipped, 8 xfailed, 61 warnings, 2 rerun in 125.58s (0:02:05) =
2022-08-15T02:43:47.3379688Z Skip info is located in the xml test reports, please either go to s3 or the hud to download them
2022-08-15T02:43:48.1466833Z Traceback (most recent call last):
2022-08-15T02:43:48.1469113Z   File "test/run_test.py", line 990, in <module>
2022-08-15T02:43:48.1469308Z     main()
2022-08-15T02:43:48.1469504Z   File "test/run_test.py", line 968, in main
2022-08-15T02:43:48.1471087Z     raise RuntimeError(err_message)
2022-08-15T02:43:48.1471359Z RuntimeError: test_ops failed!
2022-08-15T02:43:48.4494770Z 
2022-08-15T02:43:48.4495011Z real	2m12.306s
2022-08-15T02:43:48.4495272Z user	11m18.926s
2022-08-15T02:43:48.4495443Z sys	0m17.345s
2022-08-15T02:43:48.4527143Z ##[error]Process completed with exit code 1.
2022-08-15T02:43:48.4563675Z Prepare all required actions
2022-08-15T02:43:48.4564000Z Getting action download info
2022-08-15T02:43:48.6382096Z ##[group]Run ./.github/actions/get-workflow-job-id
2022-08-15T02:43:48.6382319Z with:
2022-08-15T02:43:48.6382647Z   github-token: ***

See GitHub Actions build pull / win-vs2019-cuda11.6-py3 / build (2/15)

Step: "Build" (full log | diagnosis details)

2022-08-15T02:42:39.7559263Z C:\actions-runner\...error C3861: '__builtin_clz': identifier not found
2022-08-15T02:42:39.7553626Z         with
2022-08-15T02:42:39.7553891Z         [
2022-08-15T02:42:39.7554174Z             _Ty=int64_t,
2022-08-15T02:42:39.7554485Z             K=int64_t,
2022-08-15T02:42:39.7554765Z             V=int64_t
2022-08-15T02:42:39.7555045Z         ]
2022-08-15T02:42:39.7555939Z C:/actions-runner/_work/pytorch/pytorch/aten/src/ATen/native/cpu/ScatterGatherKernel.cpp(711): note: see reference to function template instantiation 'void at::native::`anonymous-namespace'::cpu_scatter_add_contig_kernel<scalar_t>(const at::Tensor &,const at::Tensor &,const at::Tensor &)' being compiled
2022-08-15T02:42:39.7557069Z C:\actions-runner\_work\pytorch\pytorch\aten\src\ATen/native/cpu/radix_sort.h(147): error C2131: expression did not evaluate to a constant
2022-08-15T02:42:39.7557854Z C:\actions-runner\_work\pytorch\pytorch\aten\src\ATen/native/cpu/radix_sort.h(147): note: failure was caused by a read of a variable outside its lifetime
2022-08-15T02:42:39.7558606Z C:\actions-runner\_work\pytorch\pytorch\aten\src\ATen/native/cpu/radix_sort.h(147): note: see usage of 'maxthreads'
2022-08-15T02:42:39.7559263Z C:\actions-runner\_work\pytorch\pytorch\aten\src\ATen/native/cpu/radix_sort.h(151): error C3861: '__builtin_clz': identifier not found
2022-08-15T02:42:40.6957750Z [4998/6393] Building CXX object caffe2\CMakeFiles\torch_cpu.dir\__\torch\csrc\jit\passes\bailout_graph.cpp.obj
2022-08-15T02:42:40.8620701Z [4999/6393] Building CXX object caffe2\CMakeFiles\torch_cpu.dir\__\torch\csrc\jit\passes\batch_mm.cpp.obj
2022-08-15T02:42:41.1086670Z [5000/6393] Building CXX object caffe2\CMakeFiles\torch_cpu.dir\__\torch\csrc\autograd\generated\VariableType_0.cpp.obj
2022-08-15T02:42:41.7574944Z [5001/6393] Building CXX object caffe2\CMakeFiles\torch_cpu.dir\__\torch\csrc\autograd\generated\Functions.cpp.obj
2022-08-15T02:42:41.9333048Z [5002/6393] Building CXX object caffe2\CMakeFiles\torch_cpu.dir\__\torch\csrc\autograd\generated\VariableType_4.cpp.obj
2022-08-15T02:42:42.0976104Z [5003/6393] Building CXX object caffe2\CMakeFiles\torch_cpu.dir\__\torch\csrc\autograd\generated\VariableType_2.cpp.obj
2022-08-15T02:42:42.2086477Z [5004/6393] Building CXX object caffe2\CMakeFiles\torch_cpu.dir\__\torch\csrc\autograd\generated\VariableType_1.cpp.obj
2022-08-15T02:42:42.2280204Z ninja: build stopped: subcommand failed.
2022-08-15T02:42:42.2387935Z -- Building version 1.13.0a0+git97c4f58
2022-08-15T02:42:42.2389628Z cmake -GNinja -DBUILD_ENVIRONMENT=win-vs2019-cuda11.6-py3 -DBUILD_PYTHON=True -DBUILD_SPLIT_CUDA=ON -DBUILD_TEST=True -DBUILD_TYPE=release -DBUILD_WHEEL=1 -DCMAKE_BUILD_TYPE=Release -DCMAKE_CUDA_COMPILER=C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v11.6/bin/nvcc.exe -DCMAKE_CUDA_COMPILER_LAUNCHER=C:/actions-runner/_work/pytorch/pytorch/build/win_tmp/bin/randomtemp.exe;C:/actions-runner/_work/pytorch/pytorch/build/win_tmp\bin\sccache.exe -DCMAKE_GENERATOR=Ninja -DCMAKE_INCLUDE_PATH=C:\actions-runner\_work\pytorch\pytorch\build\win_tmp\mkl\include -DCMAKE_INSTALL_PREFIX=C:\actions-runner\_work\pytorch\pytorch\torch -DCMAKE_PREFIX_PATH=C:\Jenkins\Miniconda3\Lib\site-packages -DCUDA_NVCC_EXECUTABLE=C:/actions-runner/_work/pytorch/pytorch/build/win_tmp/bin/nvcc.bat -DCUDNN_LIBRARY=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.6\lib\x64 -DNUMPY_INCLUDE_DIR=C:\Jenkins\Miniconda3\lib\site-packages\numpy\core\include -DPYTHON_EXECUTABLE=C:\Jenkins\Miniconda3\python.exe -DPYTHON_INCLUDE_DIR=C:\Jenkins\Miniconda3\Include -DPYTHON_LIBRARY=C:\Jenkins\Miniconda3/libs/python39.lib -DTORCH_BUILD_VERSION=1.13.0a0+git97c4f58 -DUSE_CUDA=1 -DUSE_NUMPY=True C:\actions-runner\_work\pytorch\pytorch

See GitHub Actions build pull / linux-focal-py3.7-gcc7 / test (functorch, 1, 1, linux.2xlarge) (3/15)

Step: "Test" (full log | diagnosis details)

2022-08-15T02:51:08.0500586Z RuntimeError: /var.../jenkins/workspace/functorch/test/test_ops failed!
2022-08-15T02:51:06.6731040Z 
2022-08-15T02:51:06.6731156Z FAILED (errors=61, skipped=2191, expected failures=482)
2022-08-15T02:51:06.6731161Z 
2022-08-15T02:51:06.6731243Z Generating XML reports...
2022-08-15T02:51:07.5216594Z Generated XML report: test-reports/python-unittest/functorch.test.test_ops/TEST-TestOperatorsCPU-20220815023902.xml
2022-08-15T02:51:08.0496230Z Traceback (most recent call last):
2022-08-15T02:51:08.0496510Z   File "test/run_test.py", line 990, in <module>
2022-08-15T02:51:08.0498392Z     main()
2022-08-15T02:51:08.0498597Z   File "test/run_test.py", line 968, in main
2022-08-15T02:51:08.0500314Z     raise RuntimeError(err_message)
2022-08-15T02:51:08.0500586Z RuntimeError: /var/lib/jenkins/workspace/functorch/test/test_ops failed!
2022-08-15T02:51:08.3249670Z ##[error]Process completed with exit code 1.
2022-08-15T02:51:08.3284668Z Prepare all required actions
2022-08-15T02:51:08.3284963Z Getting action download info
2022-08-15T02:51:08.5044917Z ##[group]Run ./.github/actions/get-workflow-job-id
2022-08-15T02:51:08.5045142Z with:
2022-08-15T02:51:08.5045465Z   github-token: ***
2022-08-15T02:51:08.5045715Z env:
2022-08-15T02:51:08.5045929Z   GIT_DEFAULT_BRANCH: master
2022-08-15T02:51:08.5046100Z ##[endgroup]
2022-08-15T02:51:08.5071332Z ##[group]Run nick-fields/retry@71062288b76e2b6214ebde0e673ce0de1755740a

See GitHub Actions build pull / linux-bionic-py3.7-clang9 / test (dynamo, 1, 2, linux.2xlarge) (4/15)

Step: "Test" (full log | diagnosis details)

2022-08-15T02:43:42.9554218Z RuntimeError: test_ops failed!
2022-08-15T02:43:42.2113320Z FAILED test_ops.py::TestCommonCPU::test_noncontiguous_samples_scatter_add_cpu_float32
2022-08-15T02:43:42.2113976Z !!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!
2022-08-15T02:43:42.2114659Z !!!!!!!!!!!! xdist.dsession.Interrupted: stopping after 1 failures !!!!!!!!!!!!!
2022-08-15T02:43:42.2119273Z = 1 failed, 1229 passed, 392 skipped, 8 xfailed, 63 warnings, 2 rerun in 128.03s (0:02:08) =
2022-08-15T02:43:42.2321172Z Skip info is located in the xml test reports, please either go to s3 or the hud to download them
2022-08-15T02:43:42.9548962Z Traceback (most recent call last):
2022-08-15T02:43:42.9549286Z   File "test/run_test.py", line 990, in <module>
2022-08-15T02:43:42.9551446Z     main()
2022-08-15T02:43:42.9551639Z   File "test/run_test.py", line 968, in main
2022-08-15T02:43:42.9553959Z     raise RuntimeError(err_message)
2022-08-15T02:43:42.9554218Z RuntimeError: test_ops failed!
2022-08-15T02:43:43.2098369Z 
2022-08-15T02:43:43.2098895Z real	2m14.114s
2022-08-15T02:43:43.2099111Z user	10m51.967s
2022-08-15T02:43:43.2099344Z sys	0m16.351s
2022-08-15T02:43:43.2131447Z ##[error]Process completed with exit code 1.
2022-08-15T02:43:43.2169566Z Prepare all required actions
2022-08-15T02:43:43.2169869Z Getting action download info
2022-08-15T02:43:43.3961681Z ##[group]Run ./.github/actions/get-workflow-job-id
2022-08-15T02:43:43.3961902Z with:
2022-08-15T02:43:43.3962222Z   github-token: ***

See GitHub Actions build pull / linux-bionic-py3.7-clang9 / test (crossref, 2, 2, linux.2xlarge) (5/15)

Step: "Test" (full log | diagnosis details)

2022-08-15T02:42:03.5075311Z RuntimeError: test_ops_gradients failed!
2022-08-15T02:42:02.8786218Z FAILED test_ops_gradients.py::TestGradientsCPU::test_fn_fwgrad_bwgrad_cummax_cpu_float64
2022-08-15T02:42:02.8787781Z !!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!
2022-08-15T02:42:02.8788822Z !!!!!!!!!!!! xdist.dsession.Interrupted: stopping after 1 failures !!!!!!!!!!!!!
2022-08-15T02:42:02.8793684Z = 1 failed, 66 passed, 930 skipped, 1 xfailed, 27 warnings, 2 rerun in 34.73s ==
2022-08-15T02:42:02.9055754Z Skip info is located in the xml test reports, please either go to s3 or the hud to download them
2022-08-15T02:42:03.5069397Z Traceback (most recent call last):
2022-08-15T02:42:03.5069851Z   File "test/run_test.py", line 990, in <module>
2022-08-15T02:42:03.5072266Z     main()
2022-08-15T02:42:03.5072623Z   File "test/run_test.py", line 968, in main
2022-08-15T02:42:03.5074913Z     raise RuntimeError(err_message)
2022-08-15T02:42:03.5075311Z RuntimeError: test_ops_gradients failed!
2022-08-15T02:42:03.8000210Z 
2022-08-15T02:42:03.8000603Z real	0m40.792s
2022-08-15T02:42:03.8000905Z user	2m49.626s
2022-08-15T02:42:03.8002234Z sys	0m11.847s
2022-08-15T02:42:03.8033398Z ##[error]Process completed with exit code 1.
2022-08-15T02:42:03.8068039Z Prepare all required actions
2022-08-15T02:42:03.8068319Z Getting action download info
2022-08-15T02:42:03.9528326Z ##[group]Run ./.github/actions/get-workflow-job-id
2022-08-15T02:42:03.9528545Z with:
2022-08-15T02:42:03.9528851Z   github-token: ***

See GitHub Actions build Lint / lintrunner (6/15)

Step: "Run lintrunner on all files" (full log | diagnosis details)

2022-08-15T02:30:26.3018707Z ##[error]This line has trailing spaces; please remove them.
2022-08-15T02:30:26.2564713Z ##[group]Run # Use jq to massage the JSON lint output into GitHub Actions workflow commands.
2022-08-15T02:30:26.2565692Z �[36;1m# Use jq to massage the JSON lint output into GitHub Actions workflow commands.�[0m
2022-08-15T02:30:26.2566128Z �[36;1mjq --raw-output \�[0m
2022-08-15T02:30:26.2566715Z �[36;1m  '"::\(if .severity == "advice" or .severity == "disabled" then "warning" else .severity end) file=\(.path),line=\(.line),col=\(.char),title=\(.code) \(.name)::" + (.description | gsub("\\n"; "%0A"))' \�[0m
2022-08-15T02:30:26.2567207Z �[36;1m  lint.json�[0m
2022-08-15T02:30:26.2612405Z shell: /usr/bin/bash -e {0}
2022-08-15T02:30:26.2612694Z env:
2022-08-15T02:30:26.2612951Z   pythonLocation: /opt/hostedtoolcache/Python/3.8.13/x64
2022-08-15T02:30:26.2613383Z   LD_LIBRARY_PATH: /opt/hostedtoolcache/Python/3.8.13/x64/lib
2022-08-15T02:30:26.2613678Z ##[endgroup]
2022-08-15T02:30:26.3018707Z ##[error]This line has trailing spaces; please remove them.
2022-08-15T02:30:26.3103813Z Post job cleanup.
2022-08-15T02:30:26.3134056Z Post job cleanup.
2022-08-15T02:30:26.4172869Z [command]/usr/bin/git version
2022-08-15T02:30:26.4220802Z git version 2.37.1
2022-08-15T02:30:26.4264122Z Temporarily overriding HOME='/home/runner/work/_temp/232cbb5d-f0f2-488a-b1b4-234c41b48133' before making global git config changes
2022-08-15T02:30:26.4264708Z Adding repository directory to the temporary git global config as a safe directory
2022-08-15T02:30:26.4270855Z [command]/usr/bin/git config --global --add safe.directory /home/runner/work/pytorch/pytorch
2022-08-15T02:30:26.4318871Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand
2022-08-15T02:30:26.4361608Z [command]/usr/bin/git submodule foreach --recursive git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :
2022-08-15T02:30:26.4613515Z Entering 'android/libs/fbjni'

See GitHub Actions build pull / win-vs2019-cpu-py3 / build (7/15)

Step: "Build" (full log | diagnosis details)

2022-08-15T02:41:04.0276255Z C:\actions-runner\...error C3861: '__builtin_clz': identifier not found
2022-08-15T02:41:04.0272874Z         with
2022-08-15T02:41:04.0273019Z         [
2022-08-15T02:41:04.0273208Z             _Ty=int64_t,
2022-08-15T02:41:04.0273390Z             K=int64_t,
2022-08-15T02:41:04.0273547Z             V=int64_t
2022-08-15T02:41:04.0273702Z         ]
2022-08-15T02:41:04.0274206Z C:/actions-runner/_work/pytorch/pytorch/aten/src/ATen/native/cpu/ScatterGatherKernel.cpp(711): note: see reference to function template instantiation 'void at::native::`anonymous-namespace'::cpu_scatter_add_contig_kernel<scalar_t>(const at::Tensor &,const at::Tensor &,const at::Tensor &)' being compiled
2022-08-15T02:41:04.0274880Z C:\actions-runner\_work\pytorch\pytorch\aten\src\ATen/native/cpu/radix_sort.h(147): error C2131: expression did not evaluate to a constant
2022-08-15T02:41:04.0275354Z C:\actions-runner\_work\pytorch\pytorch\aten\src\ATen/native/cpu/radix_sort.h(147): note: failure was caused by a read of a variable outside its lifetime
2022-08-15T02:41:04.0275810Z C:\actions-runner\_work\pytorch\pytorch\aten\src\ATen/native/cpu/radix_sort.h(147): note: see usage of 'maxthreads'
2022-08-15T02:41:04.0276255Z C:\actions-runner\_work\pytorch\pytorch\aten\src\ATen/native/cpu/radix_sort.h(151): error C3861: '__builtin_clz': identifier not found
2022-08-15T02:41:04.1337508Z [4606/5898] Building CXX object caffe2\CMakeFiles\torch_cpu.dir\__\aten\src\ATen\core\Dict.cpp.obj
2022-08-15T02:41:04.1509861Z [4607/5898] Building CXX object caffe2\CMakeFiles\torch_cpu.dir\__\aten\src\ATen\record_function.cpp.obj
2022-08-15T02:41:04.1648793Z [4608/5898] Building CXX object caffe2\CMakeFiles\torch_cpu.dir\__\aten\src\ATen\core\DeprecatedTypeProperties.cpp.obj
2022-08-15T02:41:04.2841918Z [4609/5898] Building CXX object caffe2\CMakeFiles\torch_cpu.dir\__\aten\src\ATen\detail\MetaGuardImpl.cpp.obj
2022-08-15T02:41:04.2849279Z [4610/5898] Building CXX object caffe2\CMakeFiles\torch_cpu.dir\__\aten\src\ATen\detail\ORTHooksInterface.cpp.obj
2022-08-15T02:41:04.2864411Z [4611/5898] Building CXX object caffe2\CMakeFiles\torch_cpu.dir\__\aten\src\ATen\detail\CUDAHooksInterface.cpp.obj
2022-08-15T02:41:04.3831766Z [4612/5898] Building CXX object caffe2\CMakeFiles\torch_cpu.dir\__\aten\src\ATen\core\BackendSelectFallbackKernel.cpp.obj
2022-08-15T02:41:04.3832837Z ninja: build stopped: subcommand failed.
2022-08-15T02:41:04.3924820Z -- Building version 1.13.0a0+git97c4f58
2022-08-15T02:41:04.3926162Z cmake -GNinja -DBUILD_ENVIRONMENT=win-vs2019-cpu-py3 -DBUILD_PYTHON=True -DBUILD_TEST=True -DBUILD_TYPE=release -DBUILD_WHEEL=1 -DCMAKE_BUILD_TYPE=Release -DCMAKE_GENERATOR=Ninja -DCMAKE_INCLUDE_PATH=C:\actions-runner\_work\pytorch\pytorch\build\win_tmp\mkl\include -DCMAKE_INSTALL_PREFIX=C:\actions-runner\_work\pytorch\pytorch\torch -DCMAKE_PREFIX_PATH=C:\Jenkins\Miniconda3\Lib\site-packages -DNUMPY_INCLUDE_DIR=C:\Jenkins\Miniconda3\lib\site-packages\numpy\core\include -DPYTHON_EXECUTABLE=C:\Jenkins\Miniconda3\python.exe -DPYTHON_INCLUDE_DIR=C:\Jenkins\Miniconda3\Include -DPYTHON_LIBRARY=C:\Jenkins\Miniconda3/libs/python39.lib -DTORCH_BUILD_VERSION=1.13.0a0+git97c4f58 -DUSE_CUDA=0 -DUSE_NUMPY=True C:\actions-runner\_work\pytorch\pytorch

See GitHub Actions build pull / linux-bionic-py3_7-clang8-xla / test (xla, 1, 1, linux.2xlarge) (8/15)

Step: "Test" (full log | diagnosis details)

2022-08-15T03:50:10.2653240Z ##[error]Process completed with exit code 1.
2022-08-15T03:50:10.2619031Z Non-cacheable calls                   0
2022-08-15T03:50:10.2619287Z Non-compilation calls                 0
2022-08-15T03:50:10.2619536Z Unsupported compiler calls            0
2022-08-15T03:50:10.2619750Z Average cache write               0.000 s
2022-08-15T03:50:10.2620044Z Average cache read miss           0.000 s
2022-08-15T03:50:10.2620375Z Average cache read hit            0.000 s
2022-08-15T03:50:10.2620599Z Failed distributed compilations       0
2022-08-15T03:50:10.2621208Z Cache location                  S3, bucket: Bucket(name=ossci-compiler-cache-circleci-v2, base_url=http://ossci-compiler-cache-circleci-v2.s3.amazonaws.com/)
2022-08-15T03:50:10.2621639Z + echo ::endgroup::
2022-08-15T03:50:10.2622329Z ##[endgroup]
2022-08-15T03:50:10.2653240Z ##[error]Process completed with exit code 1.
2022-08-15T03:50:10.2735724Z Prepare all required actions
2022-08-15T03:50:10.2736084Z Getting action download info
2022-08-15T03:50:10.4359679Z ##[group]Run ./.github/actions/get-workflow-job-id
2022-08-15T03:50:10.4359902Z with:
2022-08-15T03:50:10.4360223Z   github-token: ***
2022-08-15T03:50:10.4360380Z env:
2022-08-15T03:50:10.4360556Z   GIT_DEFAULT_BRANCH: master
2022-08-15T03:50:10.4360742Z ##[endgroup]
2022-08-15T03:50:10.4386364Z ##[group]Run nick-fields/retry@71062288b76e2b6214ebde0e673ce0de1755740a
2022-08-15T03:50:10.4386614Z with:

See GitHub Actions build pull / linux-bionic-py3.7-clang9 / test (default, 1, 2, linux.2xlarge) (9/15)

Step: "Test" (full log | diagnosis details)

2022-08-15T02:43:33.3992242Z RuntimeError: test_ops failed!
2022-08-15T02:43:32.5832049Z FAILED test_ops.py::TestCommonCPU::test_noncontiguous_samples_scatter_add_cpu_float32
2022-08-15T02:43:32.5833128Z !!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!
2022-08-15T02:43:32.5833782Z !!!!!!!!!!!! xdist.dsession.Interrupted: stopping after 1 failures !!!!!!!!!!!!!
2022-08-15T02:43:32.5840222Z = 1 failed, 1157 passed, 292 skipped, 8 xfailed, 60 warnings, 2 rerun in 112.94s (0:01:52) =
2022-08-15T02:43:32.6043948Z Skip info is located in the xml test reports, please either go to s3 or the hud to download them
2022-08-15T02:43:33.3987111Z Traceback (most recent call last):
2022-08-15T02:43:33.3987410Z   File "test/run_test.py", line 990, in <module>
2022-08-15T02:43:33.3989251Z     main()
2022-08-15T02:43:33.3989524Z   File "test/run_test.py", line 968, in main
2022-08-15T02:43:33.3991840Z     raise RuntimeError(err_message)
2022-08-15T02:43:33.3992242Z RuntimeError: test_ops failed!
2022-08-15T02:43:33.6811428Z 
2022-08-15T02:43:33.6811816Z real	1m59.622s
2022-08-15T02:43:33.6812106Z user	9m49.432s
2022-08-15T02:43:33.6812365Z sys	0m15.613s
2022-08-15T02:43:33.6843178Z ##[error]Process completed with exit code 1.
2022-08-15T02:43:33.6876899Z Prepare all required actions
2022-08-15T02:43:33.6877195Z Getting action download info
2022-08-15T02:43:33.8991396Z ##[group]Run ./.github/actions/get-workflow-job-id
2022-08-15T02:43:33.8991617Z with:
2022-08-15T02:43:33.8991940Z   github-token: ***

See GitHub Actions build pull / linux-bionic-py3.7-clang9 / test (functorch, 1, 1, linux.2xlarge) (10/15)

Step: "Test" (full log | diagnosis details)

2022-08-15T02:51:43.5862320Z RuntimeError: /var.../jenkins/workspace/functorch/test/test_ops failed!
2022-08-15T02:51:42.2049227Z 
2022-08-15T02:51:42.2049341Z FAILED (errors=61, skipped=2191, expected failures=482)
2022-08-15T02:51:42.2049347Z 
2022-08-15T02:51:42.2049428Z Generating XML reports...
2022-08-15T02:51:43.0504311Z Generated XML report: test-reports/python-unittest/functorch.test.test_ops/TEST-TestOperatorsCPU-20220815024044.xml
2022-08-15T02:51:43.5857779Z Traceback (most recent call last):
2022-08-15T02:51:43.5858066Z   File "test/run_test.py", line 990, in <module>
2022-08-15T02:51:43.5860091Z     main()
2022-08-15T02:51:43.5860375Z   File "test/run_test.py", line 968, in main
2022-08-15T02:51:43.5862064Z     raise RuntimeError(err_message)
2022-08-15T02:51:43.5862320Z RuntimeError: /var/lib/jenkins/workspace/functorch/test/test_ops failed!
2022-08-15T02:51:43.8369756Z ##[error]Process completed with exit code 1.
2022-08-15T02:51:43.8405805Z Prepare all required actions
2022-08-15T02:51:43.8406105Z Getting action download info
2022-08-15T02:51:44.0332146Z ##[group]Run ./.github/actions/get-workflow-job-id
2022-08-15T02:51:44.0332366Z with:
2022-08-15T02:51:44.0332697Z   github-token: ***
2022-08-15T02:51:44.0332854Z env:
2022-08-15T02:51:44.0333027Z   GIT_DEFAULT_BRANCH: master
2022-08-15T02:51:44.0333397Z ##[endgroup]
2022-08-15T02:51:44.0358400Z ##[group]Run nick-fields/retry@71062288b76e2b6214ebde0e673ce0de1755740a

See GitHub Actions build pull / linux-bionic-py3.7-clang9 / test (default, 2, 2, linux.2xlarge) (11/15)

Step: "Test" (full log | diagnosis details)

2022-08-15T02:42:01.6514252Z RuntimeError: test_ops_gradients failed!
2022-08-15T02:42:00.9775253Z FAILED test_ops_gradients.py::TestGradientsCPU::test_fn_fwgrad_bwgrad_cummax_cpu_float64
2022-08-15T02:42:00.9775599Z !!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!
2022-08-15T02:42:00.9776763Z !!!!!!!!!!!! xdist.dsession.Interrupted: stopping after 1 failures !!!!!!!!!!!!!
2022-08-15T02:42:00.9780699Z = 1 failed, 69 passed, 930 skipped, 1 xfailed, 27 warnings, 2 rerun in 28.39s ==
2022-08-15T02:42:00.9983269Z Skip info is located in the xml test reports, please either go to s3 or the hud to download them
2022-08-15T02:42:01.6509954Z Traceback (most recent call last):
2022-08-15T02:42:01.6510281Z   File "test/run_test.py", line 990, in <module>
2022-08-15T02:42:01.6511459Z     main()
2022-08-15T02:42:01.6511659Z   File "test/run_test.py", line 968, in main
2022-08-15T02:42:01.6513999Z     raise RuntimeError(err_message)
2022-08-15T02:42:01.6514252Z RuntimeError: test_ops_gradients failed!
2022-08-15T02:42:01.9667748Z 
2022-08-15T02:42:01.9668139Z real	0m34.536s
2022-08-15T02:42:01.9668373Z user	2m4.650s
2022-08-15T02:42:01.9668589Z sys	0m10.410s
2022-08-15T02:42:01.9698682Z ##[error]Process completed with exit code 1.
2022-08-15T02:42:01.9732888Z Prepare all required actions
2022-08-15T02:42:01.9733242Z Getting action download info
2022-08-15T02:42:02.1260834Z ##[group]Run ./.github/actions/get-workflow-job-id
2022-08-15T02:42:02.1261058Z with:
2022-08-15T02:42:02.1261405Z   github-token: ***

See GitHub Actions build pull / linux-bionic-py3.7-clang9 / test (dynamo, 2, 2, linux.2xlarge) (12/15)

Step: "Test" (full log | diagnosis details)

2022-08-15T02:42:18.6054707Z RuntimeError: test_ops_gradients failed!
2022-08-15T02:42:17.9579612Z FAILED test_ops_gradients.py::TestGradientsCPU::test_fn_fwgrad_bwgrad_cummax_cpu_float64
2022-08-15T02:42:17.9580989Z !!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!
2022-08-15T02:42:17.9583393Z !!!!!!!!!!!! xdist.dsession.Interrupted: stopping after 1 failures !!!!!!!!!!!!!
2022-08-15T02:42:17.9587378Z = 1 failed, 87 passed, 936 skipped, 1 xfailed, 27 warnings, 2 rerun in 37.45s ==
2022-08-15T02:42:17.9803775Z Skip info is located in the xml test reports, please either go to s3 or the hud to download them
2022-08-15T02:42:18.6050377Z Traceback (most recent call last):
2022-08-15T02:42:18.6050652Z   File "test/run_test.py", line 990, in <module>
2022-08-15T02:42:18.6052185Z     main()
2022-08-15T02:42:18.6052496Z   File "test/run_test.py", line 968, in main
2022-08-15T02:42:18.6054452Z     raise RuntimeError(err_message)
2022-08-15T02:42:18.6054707Z RuntimeError: test_ops_gradients failed!
2022-08-15T02:42:18.9116777Z 
2022-08-15T02:42:18.9117105Z real	0m44.108s
2022-08-15T02:42:18.9117332Z user	2m53.791s
2022-08-15T02:42:18.9117508Z sys	0m11.845s
2022-08-15T02:42:18.9151718Z ##[error]Process completed with exit code 1.
2022-08-15T02:42:18.9186271Z Prepare all required actions
2022-08-15T02:42:18.9186561Z Getting action download info
2022-08-15T02:42:19.1849380Z ##[group]Run ./.github/actions/get-workflow-job-id
2022-08-15T02:42:19.1849606Z with:
2022-08-15T02:42:19.1849948Z   github-token: ***

See GitHub Actions build pull / linux-focal-py3.7-gcc7 / test (default, 2, 2, linux.2xlarge) (13/15)

Step: "Test" (full log | diagnosis details)

2022-08-15T02:40:13.0681579Z RuntimeError: test_ops_gradients failed!
2022-08-15T02:40:12.4476717Z FAILED test_ops_gradients.py::TestGradientsCPU::test_fn_fwgrad_bwgrad_cummax_cpu_float64
2022-08-15T02:40:12.4491926Z !!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!
2022-08-15T02:40:12.4496545Z !!!!!!!!!!!! xdist.dsession.Interrupted: stopping after 1 failures !!!!!!!!!!!!!
2022-08-15T02:40:12.4500035Z = 1 failed, 69 passed, 930 skipped, 1 xfailed, 27 warnings, 2 rerun in 25.70s ==
2022-08-15T02:40:12.4761063Z Skip info is located in the xml test reports, please either go to s3 or the hud to download them
2022-08-15T02:40:13.0674055Z Traceback (most recent call last):
2022-08-15T02:40:13.0674464Z   File "test/run_test.py", line 990, in <module>
2022-08-15T02:40:13.0680504Z     main()
2022-08-15T02:40:13.0680938Z   File "test/run_test.py", line 968, in main
2022-08-15T02:40:13.0681351Z     raise RuntimeError(err_message)
2022-08-15T02:40:13.0681579Z RuntimeError: test_ops_gradients failed!
2022-08-15T02:40:13.3462450Z 
2022-08-15T02:40:13.3462980Z real	0m31.749s
2022-08-15T02:40:13.3463254Z user	1m31.328s
2022-08-15T02:40:13.3463506Z sys	0m4.062s
2022-08-15T02:40:13.3493673Z ##[error]Process completed with exit code 1.
2022-08-15T02:40:13.3528671Z Prepare all required actions
2022-08-15T02:40:13.3528970Z Getting action download info
2022-08-15T02:40:13.6327210Z ##[group]Run ./.github/actions/get-workflow-job-id
2022-08-15T02:40:13.6327428Z with:
2022-08-15T02:40:13.6327741Z   github-token: ***

See GitHub Actions build pull / linux-bionic-cuda11.6-py3.10-gcc7 / test (default, 3, 4, linux.4xlarge.nvidia.gpu) (14/15)

Step: "Test" (full log | diagnosis details)

2022-08-15T04:05:37.2475282Z RuntimeError: test_torch failed!
2022-08-15T04:05:36.6465218Z Generated XML report: test-reports/python-unittest/test_torch/TEST-TestTorchDeviceTypeCUDA-20220815040346.xml
2022-08-15T04:05:36.6468547Z Generated XML report: test-reports/python-unittest/test_torch/TEST-TestVitalSignsCudaCUDA-20220815040346.xml
2022-08-15T04:05:37.1199192Z [TORCH_VITAL] Dataloader.enabled		 True
2022-08-15T04:05:37.1199669Z [TORCH_VITAL] Dataloader.basic_unit_test		 TEST_VALUE_STRING
2022-08-15T04:05:37.1200036Z [TORCH_VITAL] CUDA.used		 true
2022-08-15T04:05:37.2470933Z Traceback (most recent call last):
2022-08-15T04:05:37.2471553Z   File "/var/lib/jenkins/workspace/test/run_test.py", line 990, in <module>
2022-08-15T04:05:37.2472406Z     main()
2022-08-15T04:05:37.2472917Z   File "/var/lib/jenkins/workspace/test/run_test.py", line 968, in main
2022-08-15T04:05:37.2474908Z     raise RuntimeError(err_message)
2022-08-15T04:05:37.2475282Z RuntimeError: test_torch failed!
2022-08-15T04:05:37.4966713Z 
2022-08-15T04:05:37.4967430Z real	54m39.668s
2022-08-15T04:05:37.4967904Z user	67m43.025s
2022-08-15T04:05:37.4968140Z sys	5m26.490s
2022-08-15T04:05:37.5006006Z ##[error]Process completed with exit code 1.
2022-08-15T04:05:37.5041946Z Prepare all required actions
2022-08-15T04:05:37.5042372Z Getting action download info
2022-08-15T04:05:37.6730990Z ##[group]Run ./.github/actions/get-workflow-job-id
2022-08-15T04:05:37.6731288Z with:
2022-08-15T04:05:37.6731775Z   github-token: ***

See GitHub Actions build pull / linux-focal-py3.7-gcc7 / test (default, 1, 2, linux.2xlarge) (15/15)

Step: "Test" (full log | diagnosis details)

2022-08-15T02:41:47.5190872Z RuntimeError: test_ops failed!
2022-08-15T02:41:46.7159955Z FAILED test_ops.py::TestCommonCPU::test_noncontiguous_samples_scatter_add_cpu_float32
2022-08-15T02:41:46.7160732Z !!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!
2022-08-15T02:41:46.7161264Z !!!!!!!!!!!! xdist.dsession.Interrupted: stopping after 1 failures !!!!!!!!!!!!!
2022-08-15T02:41:46.7166569Z = 1 failed, 1157 passed, 292 skipped, 8 xfailed, 60 warnings, 2 rerun in 115.87s (0:01:55) =
2022-08-15T02:41:46.7428031Z Skip info is located in the xml test reports, please either go to s3 or the hud to download them
2022-08-15T02:41:47.5186438Z Traceback (most recent call last):
2022-08-15T02:41:47.5187112Z   File "test/run_test.py", line 990, in <module>
2022-08-15T02:41:47.5188376Z     main()
2022-08-15T02:41:47.5188731Z   File "test/run_test.py", line 968, in main
2022-08-15T02:41:47.5190479Z     raise RuntimeError(err_message)
2022-08-15T02:41:47.5190872Z RuntimeError: test_ops failed!
2022-08-15T02:41:47.8060438Z 
2022-08-15T02:41:47.8060711Z real	2m2.351s
2022-08-15T02:41:47.8061050Z user	7m54.011s
2022-08-15T02:41:47.8061367Z sys	0m4.941s
2022-08-15T02:41:47.8091972Z ##[error]Process completed with exit code 1.
2022-08-15T02:41:47.8125116Z Prepare all required actions
2022-08-15T02:41:47.8125468Z Getting action download info
2022-08-15T02:41:47.9914713Z ##[group]Run ./.github/actions/get-workflow-job-id
2022-08-15T02:41:47.9914930Z with:
2022-08-15T02:41:47.9915255Z   github-token: ***

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

mingfeima added a commit that referenced this pull request Aug 3, 2022
ghstack-source-id: 1cf90bf
Pull Request resolved: #82703
@mingfeima mingfeima marked this pull request as draft August 3, 2022 05:41
@mingfeima
Copy link
Collaborator Author

mingfeima commented Aug 3, 2022

This PR is effort for optimizing pytorch-geometric performance on CPU, as plan #4891

TODO:

Delete the following items since perf impact is minor

  • try cpu memory allocator
  • add logic of can_use_32bit_index
  • add optimization when index and src are 1d, aka. inner_size is 1

Move the following items to next step

  • (TBD) extend the current optimization to other reduction type, e.g. max, min, mean
  • (TBD) extend the current optimization to gather which will be used for training

mingfeima added a commit that referenced this pull request Aug 15, 2022
ghstack-source-id: c28a917
Pull Request resolved: #82703
@yanbing-j yanbing-j added the intel This tag is for PR from Intel label Aug 22, 2022
@pytorch-bot
Copy link

pytorch-bot bot commented Sep 22, 2022

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/82703

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

✅ No Failures

As of commit 4f7a242:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@mingfeima mingfeima marked this pull request as ready for review September 22, 2022 05:59
### Motivation of this PR

This PR is targeting at improving performance of `scatter_add` for GNN usage scenarios on PyG. Currently only CPU optimizations is covered.

`Message Passing` is the major step in GNN learning which means exchanging/aggregating info between nodes. And from the perf point of view, if the `EdgeIndex` is stored as [2, num_edges], `scatter_reduce` would be a major perf hotspot on current pytorch implementations.

To be more specific, in the process of message passing, `scatter_add` is used in a very similar way as `index_select`, except that the `self` tensor is written into while `index_select` is only reading. Therefore, the `index` tensor passed to `scatter_add` is an expanded tensor on dim0, which means all the rest of dims would end up with the same value.

### Algorithm

Current impl on scatter would do parallel on the inner dims for such case which would cause bad perf: non-contiguous memory access pattern and non-vectorized.

This PR did sorting on the `index` to solve the write conflicts if we directly parallel on dim0. The algorithm is equivalent to:
* convert memory format from `COO` to `CSR`
* do spmm reduce

### Perf improvement

The benchmark comes from https://github.com/pyg-team/pytorch_geometric/tree/master/examples, `python reddit.py` which runs model SAGE on dataset reddit.

CPU type: Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz

` aten::scatter_add_` has been reduced from **37.797s** to **5.989s**:

* breakdown before
```
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                                     aten::scatter_add_        49.00%       37.797s        49.00%       37.797s      41.445ms           912
                                     aten::index_select        19.74%       15.223s        19.74%       15.227s       6.678ms          2280
                                           aten::linear         0.01%       5.706ms        15.04%       11.602s      12.721ms           912
                                            aten::addmm         6.62%        5.108s         7.92%        6.112s      13.403ms           456
                                           aten::matmul         0.00%       2.339ms         7.10%        5.475s      12.006ms           456
```

* breakdown after
```
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                                     aten::index_select        32.41%       14.677s        32.42%       14.681s       6.439ms          2280
                                           aten::linear         0.01%       6.665ms        26.43%       11.968s      13.123ms           912
                                            aten::addmm        11.76%        5.328s        13.76%        6.232s      13.667ms           456
                                     aten::scatter_add_        13.22%        5.989s        13.22%        5.989s       6.566ms           912
                                           aten::matmul         0.01%       2.303ms        12.63%        5.720s      12.543ms           456
```

[ghstack-poisoned]
@mingfeima
Copy link
Collaborator Author

@rusty1s could you please help review this one?

@rusty1s
Copy link
Contributor

rusty1s commented Sep 23, 2022

Yes, I can take a look ASAP.

@rusty1s
Copy link
Contributor

rusty1s commented Sep 26, 2022

cc @mananshah99

Copy link
Contributor

@rusty1s rusty1s left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR. This is really cool :)

The overall implementation makes sense to me. I guess there might exist some threshold on K and index.numel() when spmm is actually preferred.

In addition, we can also think about wrapping this into a more high-level segment_add implementation which assumes sorted indices as input.

### Motivation of this PR

This PR is targeting at improving performance of `scatter_add` for GNN usage scenarios on PyG. Currently only CPU optimizations is covered.

`Message Passing` is the major step in GNN learning which means exchanging/aggregating info between nodes. And from the perf point of view, if the `EdgeIndex` is stored as [2, num_edges], `scatter_reduce` would be a major perf hotspot on current pytorch implementations.

To be more specific, in the process of message passing, `scatter_add` is used in a very similar way as `index_select`, except that the `self` tensor is written into while `index_select` is only reading. Therefore, the `index` tensor passed to `scatter_add` is an expanded tensor on dim0, which means all the rest of dims would end up with the same value.

### Algorithm

Current impl on scatter would do parallel on the inner dims for such case which would cause bad perf: non-contiguous memory access pattern and non-vectorized.

This PR did sorting on the `index` to solve the write conflicts if we directly parallel on dim0. The algorithm is equivalent to:
* convert memory format from `COO` to `CSR`
* do spmm reduce

### Perf improvement

The benchmark comes from https://github.com/pyg-team/pytorch_geometric/tree/master/examples, `python reddit.py` which runs model SAGE on dataset reddit.

CPU type: Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz

` aten::scatter_add_` has been reduced from **37.797s** to **5.989s**:

* breakdown before
```
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                                     aten::scatter_add_        49.00%       37.797s        49.00%       37.797s      41.445ms           912
                                     aten::index_select        19.74%       15.223s        19.74%       15.227s       6.678ms          2280
                                           aten::linear         0.01%       5.706ms        15.04%       11.602s      12.721ms           912
                                            aten::addmm         6.62%        5.108s         7.92%        6.112s      13.403ms           456
                                           aten::matmul         0.00%       2.339ms         7.10%        5.475s      12.006ms           456
```

* breakdown after
```
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                                     aten::index_select        32.41%       14.677s        32.42%       14.681s       6.439ms          2280
                                           aten::linear         0.01%       6.665ms        26.43%       11.968s      13.123ms           912
                                            aten::addmm        11.76%        5.328s        13.76%        6.232s      13.667ms           456
                                     aten::scatter_add_        13.22%        5.989s        13.22%        5.989s       6.566ms           912
                                           aten::matmul         0.01%       2.303ms        12.63%        5.720s      12.543ms           456
```

[ghstack-poisoned]
@mingfeima mingfeima requested a review from rusty1s September 29, 2022 06:51
@facebook-github-bot
Copy link
Contributor

/easycla

As part of the transition to the PyTorch Foundation, this project now requires contributions be covered under the new CLA. See #85559 for additional details.

This comment will trigger a new check of this PR. If you are already covered, you will simply see a new "EasyCLA" check that passes. If you are not covered, a bot will leave a new comment with a link to sign.

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Oct 4, 2022

CLA Signed

The committers listed above are authorized under a signed CLA.

### Motivation of this PR

This PR is targeting at improving performance of `scatter_add` for GNN usage scenarios on PyG. Currently only CPU optimizations is covered.

`Message Passing` is the major step in GNN learning which means exchanging/aggregating info between nodes. And from the perf point of view, if the `EdgeIndex` is stored as [2, num_edges], `scatter_reduce` would be a major perf hotspot on current pytorch implementations.

To be more specific, in the process of message passing, `scatter_add` is used in a very similar way as `index_select`, except that the `self` tensor is written into while `index_select` is only reading. Therefore, the `index` tensor passed to `scatter_add` is an expanded tensor on dim0, which means all the rest of dims would end up with the same value.

### Algorithm

Current impl on scatter would do parallel on the inner dims for such case which would cause bad perf: non-contiguous memory access pattern and non-vectorized.

This PR did sorting on the `index` to solve the write conflicts if we directly parallel on dim0. The algorithm is equivalent to:
* convert memory format from `COO` to `CSR`
* do spmm reduce

### Perf improvement

The benchmark comes from https://github.com/pyg-team/pytorch_geometric/tree/master/examples, `python reddit.py` which runs model SAGE on dataset reddit.

CPU type: Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz

` aten::scatter_add_` has been reduced from **37.797s** to **5.989s**:

* breakdown before
```
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                                     aten::scatter_add_        49.00%       37.797s        49.00%       37.797s      41.445ms           912
                                     aten::index_select        19.74%       15.223s        19.74%       15.227s       6.678ms          2280
                                           aten::linear         0.01%       5.706ms        15.04%       11.602s      12.721ms           912
                                            aten::addmm         6.62%        5.108s         7.92%        6.112s      13.403ms           456
                                           aten::matmul         0.00%       2.339ms         7.10%        5.475s      12.006ms           456
```

* breakdown after
```
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                                     aten::index_select        32.41%       14.677s        32.42%       14.681s       6.439ms          2280
                                           aten::linear         0.01%       6.665ms        26.43%       11.968s      13.123ms           912
                                            aten::addmm        11.76%        5.328s        13.76%        6.232s      13.667ms           456
                                     aten::scatter_add_        13.22%        5.989s        13.22%        5.989s       6.566ms           912
                                           aten::matmul         0.01%       2.303ms        12.63%        5.720s      12.543ms           456
```

[ghstack-poisoned]
@zhuhaozhe zhuhaozhe closed this Oct 20, 2022
@zhuhaozhe zhuhaozhe reopened this Oct 20, 2022
### Motivation of this PR

This PR is targeting at improving performance of `scatter_add` for GNN usage scenarios on PyG. Currently only CPU optimizations is covered.

`Message Passing` is the major step in GNN learning which means exchanging/aggregating info between nodes. And from the perf point of view, if the `EdgeIndex` is stored as [2, num_edges], `scatter_reduce` would be a major perf hotspot on current pytorch implementations.

To be more specific, in the process of message passing, `scatter_add` is used in a very similar way as `index_select`, except that the `self` tensor is written into while `index_select` is only reading. Therefore, the `index` tensor passed to `scatter_add` is an expanded tensor on dim0, which means all the rest of dims would end up with the same value.

### Algorithm

Current impl on scatter would do parallel on the inner dims for such case which would cause bad perf: non-contiguous memory access pattern and non-vectorized.

This PR did sorting on the `index` to solve the write conflicts if we directly parallel on dim0. The algorithm is equivalent to:
* convert memory format from `COO` to `CSR`
* do spmm reduce

### Perf improvement

The benchmark comes from https://github.com/pyg-team/pytorch_geometric/tree/master/examples, `python reddit.py` which runs model SAGE on dataset reddit.

CPU type: Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz

` aten::scatter_add_` has been reduced from **37.797s** to **5.989s**:

* breakdown before
```
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                                     aten::scatter_add_        49.00%       37.797s        49.00%       37.797s      41.445ms           912
                                     aten::index_select        19.74%       15.223s        19.74%       15.227s       6.678ms          2280
                                           aten::linear         0.01%       5.706ms        15.04%       11.602s      12.721ms           912
                                            aten::addmm         6.62%        5.108s         7.92%        6.112s      13.403ms           456
                                           aten::matmul         0.00%       2.339ms         7.10%        5.475s      12.006ms           456
```

* breakdown after
```
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                                     aten::index_select        32.41%       14.677s        32.42%       14.681s       6.439ms          2280
                                           aten::linear         0.01%       6.665ms        26.43%       11.968s      13.123ms           912
                                            aten::addmm        11.76%        5.328s        13.76%        6.232s      13.667ms           456
                                     aten::scatter_add_        13.22%        5.989s        13.22%        5.989s       6.566ms           912
                                           aten::matmul         0.01%       2.303ms        12.63%        5.720s      12.543ms           456
```

[ghstack-poisoned]
### Motivation of this PR

This PR is targeting at improving performance of `scatter_add` for GNN usage scenarios on PyG. Currently only CPU optimizations is covered.

`Message Passing` is the major step in GNN learning which means exchanging/aggregating info between nodes. And from the perf point of view, if the `EdgeIndex` is stored as [2, num_edges], `scatter_reduce` would be a major perf hotspot on current pytorch implementations.

To be more specific, in the process of message passing, `scatter_add` is used in a very similar way as `index_select`, except that the `self` tensor is written into while `index_select` is only reading. Therefore, the `index` tensor passed to `scatter_add` is an expanded tensor on dim0, which means all the rest of dims would end up with the same value.

### Algorithm

Current impl on scatter would do parallel on the inner dims for such case which would cause bad perf: non-contiguous memory access pattern and non-vectorized.

This PR did sorting on the `index` to solve the write conflicts if we directly parallel on dim0. The algorithm is equivalent to:
* convert memory format from `COO` to `CSR`
* do spmm reduce

### Perf improvement

The benchmark comes from https://github.com/pyg-team/pytorch_geometric/tree/master/examples, `python reddit.py` which runs model SAGE on dataset reddit.

CPU type: Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz

` aten::scatter_add_` has been reduced from **37.797s** to **5.989s**:

* breakdown before
```
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                                     aten::scatter_add_        49.00%       37.797s        49.00%       37.797s      41.445ms           912
                                     aten::index_select        19.74%       15.223s        19.74%       15.227s       6.678ms          2280
                                           aten::linear         0.01%       5.706ms        15.04%       11.602s      12.721ms           912
                                            aten::addmm         6.62%        5.108s         7.92%        6.112s      13.403ms           456
                                           aten::matmul         0.00%       2.339ms         7.10%        5.475s      12.006ms           456
```

* breakdown after
```
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                                     aten::index_select        32.41%       14.677s        32.42%       14.681s       6.439ms          2280
                                           aten::linear         0.01%       6.665ms        26.43%       11.968s      13.123ms           912
                                            aten::addmm        11.76%        5.328s        13.76%        6.232s      13.667ms           456
                                     aten::scatter_add_        13.22%        5.989s        13.22%        5.989s       6.566ms           912
                                           aten::matmul         0.01%       2.303ms        12.63%        5.720s      12.543ms           456
```

cc @VitalyFedyunin jgong5 @XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
@mingfeima mingfeima requested review from ezyang, jgong5 and rusty1s and removed request for rusty1s January 3, 2023 03:25
@ezyang ezyang changed the title optimize scatte_add performance for gnn usage on CPU optimize scatter_add performance for gnn usage on CPU Jan 3, 2023
### Motivation of this PR

This PR is targeting at improving performance of `scatter_add` for GNN usage scenarios on PyG. Currently only CPU optimizations is covered.

`Message Passing` is the major step in GNN learning which means exchanging/aggregating info between nodes. And from the perf point of view, if the `EdgeIndex` is stored as [2, num_edges], `scatter_reduce` would be a major perf hotspot on current pytorch implementations.

To be more specific, in the process of message passing, `scatter_add` is used in a very similar way as `index_select`, except that the `self` tensor is written into while `index_select` is only reading. Therefore, the `index` tensor passed to `scatter_add` is an expanded tensor on dim0, which means all the rest of dims would end up with the same value.

### Algorithm

Current impl on scatter would do parallel on the inner dims for such case which would cause bad perf: non-contiguous memory access pattern and non-vectorized.

This PR did sorting on the `index` to solve the write conflicts if we directly parallel on dim0. The algorithm is equivalent to:
* convert memory format from `COO` to `CSR`
* do spmm reduce

### Perf improvement

The benchmark comes from https://github.com/pyg-team/pytorch_geometric/tree/master/examples, `python reddit.py` which runs model SAGE on dataset reddit.

CPU type: Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz

` aten::scatter_add_` has been reduced from **37.797s** to **5.989s**:

* breakdown before
```
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                                     aten::scatter_add_        49.00%       37.797s        49.00%       37.797s      41.445ms           912
                                     aten::index_select        19.74%       15.223s        19.74%       15.227s       6.678ms          2280
                                           aten::linear         0.01%       5.706ms        15.04%       11.602s      12.721ms           912
                                            aten::addmm         6.62%        5.108s         7.92%        6.112s      13.403ms           456
                                           aten::matmul         0.00%       2.339ms         7.10%        5.475s      12.006ms           456
```

* breakdown after
```
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                                     aten::index_select        32.41%       14.677s        32.42%       14.681s       6.439ms          2280
                                           aten::linear         0.01%       6.665ms        26.43%       11.968s      13.123ms           912
                                            aten::addmm        11.76%        5.328s        13.76%        6.232s      13.667ms           456
                                     aten::scatter_add_        13.22%        5.989s        13.22%        5.989s       6.566ms           912
                                           aten::matmul         0.01%       2.303ms        12.63%        5.720s      12.543ms           456
```

cc @VitalyFedyunin jgong5 @XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
### Motivation of this PR

This PR is targeting at improving performance of `scatter_add` for GNN usage scenarios on PyG. Currently only CPU optimizations is covered.

`Message Passing` is the major step in GNN learning which means exchanging/aggregating info between nodes. And from the perf point of view, if the `EdgeIndex` is stored as [2, num_edges], `scatter_reduce` would be a major perf hotspot on current pytorch implementations.

To be more specific, in the process of message passing, `scatter_add` is used in a very similar way as `index_select`, except that the `self` tensor is written into while `index_select` is only reading. Therefore, the `index` tensor passed to `scatter_add` is an expanded tensor on dim0, which means all the rest of dims would end up with the same value.

### Algorithm

Current impl on scatter would do parallel on the inner dims for such case which would cause bad perf: non-contiguous memory access pattern and non-vectorized.

This PR did sorting on the `index` to solve the write conflicts if we directly parallel on dim0. The algorithm is equivalent to:
* convert memory format from `COO` to `CSR`
* do spmm reduce

### Perf improvement

The benchmark comes from https://github.com/pyg-team/pytorch_geometric/tree/master/examples, `python reddit.py` which runs model SAGE on dataset reddit.

CPU type: Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz

` aten::scatter_add_` has been reduced from **37.797s** to **5.989s**:

* breakdown before
```
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                                     aten::scatter_add_        49.00%       37.797s        49.00%       37.797s      41.445ms           912
                                     aten::index_select        19.74%       15.223s        19.74%       15.227s       6.678ms          2280
                                           aten::linear         0.01%       5.706ms        15.04%       11.602s      12.721ms           912
                                            aten::addmm         6.62%        5.108s         7.92%        6.112s      13.403ms           456
                                           aten::matmul         0.00%       2.339ms         7.10%        5.475s      12.006ms           456
```

* breakdown after
```
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                                     aten::index_select        32.41%       14.677s        32.42%       14.681s       6.439ms          2280
                                           aten::linear         0.01%       6.665ms        26.43%       11.968s      13.123ms           912
                                            aten::addmm        11.76%        5.328s        13.76%        6.232s      13.667ms           456
                                     aten::scatter_add_        13.22%        5.989s        13.22%        5.989s       6.566ms           912
                                           aten::matmul         0.01%       2.303ms        12.63%        5.720s      12.543ms           456
```

cc @VitalyFedyunin jgong5 @XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
### Motivation of this PR

This PR is targeting at improving performance of `scatter_add` for GNN usage scenarios on PyG. Currently only CPU optimizations is covered.

`Message Passing` is the major step in GNN learning which means exchanging/aggregating info between nodes. And from the perf point of view, if the `EdgeIndex` is stored as [2, num_edges], `scatter_reduce` would be a major perf hotspot on current pytorch implementations.

To be more specific, in the process of message passing, `scatter_add` is used in a very similar way as `index_select`, except that the `self` tensor is written into while `index_select` is only reading. Therefore, the `index` tensor passed to `scatter_add` is an expanded tensor on dim0, which means all the rest of dims would end up with the same value.

### Algorithm

Current impl on scatter would do parallel on the inner dims for such case which would cause bad perf: non-contiguous memory access pattern and non-vectorized.

This PR did sorting on the `index` to solve the write conflicts if we directly parallel on dim0. The algorithm is equivalent to:
* convert memory format from `COO` to `CSR`
* do spmm reduce

### Perf improvement

The benchmark comes from https://github.com/pyg-team/pytorch_geometric/tree/master/examples, `python reddit.py` which runs model SAGE on dataset reddit.

CPU type: Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz

` aten::scatter_add_` has been reduced from **37.797s** to **5.989s**:

* breakdown before
```
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                                     aten::scatter_add_        49.00%       37.797s        49.00%       37.797s      41.445ms           912
                                     aten::index_select        19.74%       15.223s        19.74%       15.227s       6.678ms          2280
                                           aten::linear         0.01%       5.706ms        15.04%       11.602s      12.721ms           912
                                            aten::addmm         6.62%        5.108s         7.92%        6.112s      13.403ms           456
                                           aten::matmul         0.00%       2.339ms         7.10%        5.475s      12.006ms           456
```

* breakdown after
```
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                                     aten::index_select        32.41%       14.677s        32.42%       14.681s       6.439ms          2280
                                           aten::linear         0.01%       6.665ms        26.43%       11.968s      13.123ms           912
                                            aten::addmm        11.76%        5.328s        13.76%        6.232s      13.667ms           456
                                     aten::scatter_add_        13.22%        5.989s        13.22%        5.989s       6.566ms           912
                                           aten::matmul         0.01%       2.303ms        12.63%        5.720s      12.543ms           456
```

cc @VitalyFedyunin jgong5 @XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
Copy link
Contributor

@ezyang ezyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just reviewed for the framework side of things, did not review algos

@ezyang
Copy link
Contributor

ezyang commented Jan 10, 2023

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: This PR is too stale; the last push date was more than 3 days ago. Please rebase and try again. You can rebase by leaving the following comment on this PR:
@pytorchbot rebase

Details for Dev Infra team Raised by workflow job

### Motivation of this PR

This PR is targeting at improving performance of `scatter_add` for GNN usage scenarios on PyG. Currently only CPU optimizations is covered.

`Message Passing` is the major step in GNN learning which means exchanging/aggregating info between nodes. And from the perf point of view, if the `EdgeIndex` is stored as [2, num_edges], `scatter_reduce` would be a major perf hotspot on current pytorch implementations.

To be more specific, in the process of message passing, `scatter_add` is used in a very similar way as `index_select`, except that the `self` tensor is written into while `index_select` is only reading. Therefore, the `index` tensor passed to `scatter_add` is an expanded tensor on dim0, which means all the rest of dims would end up with the same value.

### Algorithm

Current impl on scatter would do parallel on the inner dims for such case which would cause bad perf: non-contiguous memory access pattern and non-vectorized.

This PR did sorting on the `index` to solve the write conflicts if we directly parallel on dim0. The algorithm is equivalent to:
* convert memory format from `COO` to `CSR`
* do spmm reduce

### Perf improvement

The benchmark comes from https://github.com/pyg-team/pytorch_geometric/tree/master/examples, `python reddit.py` which runs model SAGE on dataset reddit.

CPU type: Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz

` aten::scatter_add_` has been reduced from **37.797s** to **5.989s**:

* breakdown before
```
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                                     aten::scatter_add_        49.00%       37.797s        49.00%       37.797s      41.445ms           912
                                     aten::index_select        19.74%       15.223s        19.74%       15.227s       6.678ms          2280
                                           aten::linear         0.01%       5.706ms        15.04%       11.602s      12.721ms           912
                                            aten::addmm         6.62%        5.108s         7.92%        6.112s      13.403ms           456
                                           aten::matmul         0.00%       2.339ms         7.10%        5.475s      12.006ms           456
```

* breakdown after
```
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                                     aten::index_select        32.41%       14.677s        32.42%       14.681s       6.439ms          2280
                                           aten::linear         0.01%       6.665ms        26.43%       11.968s      13.123ms           912
                                            aten::addmm        11.76%        5.328s        13.76%        6.232s      13.667ms           456
                                     aten::scatter_add_        13.22%        5.989s        13.22%        5.989s       6.566ms           912
                                           aten::matmul         0.01%       2.303ms        12.63%        5.720s      12.543ms           456
```

cc @VitalyFedyunin jgong5 @XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
@mingfeima
Copy link
Collaborator Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@facebook-github-bot facebook-github-bot deleted the gh/mingfeima/84/head branch June 8, 2023 18:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request cla signed intel priority matters to intel architecture from performance wise intel This tag is for PR from Intel Merged module: cpu CPU specific problem (e.g., perf, algorithm) open source topic: performance topic category

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.