Skip to content

Conversation

@v0dro
Copy link
Contributor

@v0dro v0dro commented Apr 12, 2020

This PR implements gh-33389.

As a result of this PR, users can now specify various reduction modes for scatter operations. Currently, add, subtract, multiply and divide have been implemented, and adding new ones is not hard.

While we now allow dynamic runtime selection of reduction modes, the performance is the same as as was the case for the scatter_add_ method in the master branch. Proof can be seen in the graph below, which compares scatter_add_ in the master branch (blue) and scatter_(reduce="add") from this PR (orange).
scatter-regression py csv

The script used for benchmarking is as follows:

import os
import sys
import torch
import time
import numpy
from IPython import get_ipython

Ms=256
Ns=512
dim = 0
top_power = 2
ipython = get_ipython()

plot_name = os.path.basename(__file__)
branch = sys.argv[1]
fname = open(plot_name + ".csv", "a+")

for pM in range(top_power):
    M = Ms * (2 ** pM)
    for pN in range(top_power):
        N = Ns * (2 ** pN)
        input_one = torch.rand(M, N)
        index = torch.tensor(numpy.random.randint(0, M, (M, N)))
        res = torch.randn(M, N)

        test_case = f"{M}x{N}"
        print(test_case)
        tobj = ipython.magic("timeit -o res.scatter_(dim, index, input_one, reduce=\"add\")")

        fname.write(f"{test_case},{branch},{tobj.average},{tobj.stdev}\n")

fname.close()

Additionally, one can see that various reduction modes take almost the same time to execute:

op: add
70.6 µs ± 27.3 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
26.1 µs ± 26.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
op: subtract
71 µs ± 20.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
26.4 µs ± 34.4 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
op: multiply
70.9 µs ± 31.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
27.4 µs ± 29.3 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
op: divide
164 µs ± 48.8 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
52.3 µs ± 132 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Script:

import torch
import time
import numpy
from IPython import get_ipython

ipython = get_ipython()

nrows = 3000
ncols = 10000
dims = [nrows, ncols]

res = torch.randint(5, 10, dims)
idx1 = torch.randint(dims[0], (1, dims[1])).long()
src1 = torch.randint(5, 10, (1, dims[1]))
idx2 = torch.randint(dims[1], (dims[0], 1)).long()
src2 = torch.randint(5, 10, (dims[0], 1))

for op in ["add", "subtract", "multiply", "divide"]:
    print(f"op: {op}")
    ipython.magic("timeit res.scatter_(0, idx1, src1, reduce=op)")
    ipython.magic("timeit res.scatter_(1, idx2, src2, reduce=op)")

@dr-ci
Copy link

dr-ci bot commented Apr 12, 2020

💊 CI failures summary and remediations

As of commit 4b15a86 (more details on the Dr. CI page):


  • 2/2 failures possibly* introduced in this PR
    • 2/2 non-CircleCI failure(s)

Extra GitHub checks: 1 failed


ci.pytorch.org: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 186 times.

@v0dro
Copy link
Contributor Author

v0dro commented Apr 16, 2020

@nikitaved can you review this? Also I'm getting a failure for XLA is that OK?

@v0dro
Copy link
Contributor Author

v0dro commented Jun 7, 2020

@ngimel @rgommers do have a look at the latest commit.

Comment on lines 27 to 30
template <typename scalar_t>
void operator()(scalar_t * self_data, scalar_t * src_data) {
*self_data += *src_data;
};
Copy link
Collaborator

@nikitaved nikitaved Jun 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
template <typename scalar_t>
void operator()(scalar_t * self_data, scalar_t * src_data) {
*self_data += *src_data;
};
template <typename scalar_t>
constexpr void operator()(scalar_t * self_data, scalar_t * src_data) const {
*self_data += *src_data;
};

Starting from C++14 this implies inlining. Same for other functors. Could you please give it a try? So that it works for a wider range of compilers, as long as they follow the standard.

Copy link
Collaborator

@nikitaved nikitaved Jun 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, but it could be that constexpr might not work with void functions however, at least in C++11 this is the case, or functions with side effects.

@ngimel
Copy link
Collaborator

ngimel commented Jun 11, 2020

Dispatch needlessly compiles for all the dtypes, not just for floating point types. Fixing it however would require large copypaste, so in the interest of code clarity (but not binary size :-) ) I'm inclined to let it go as is, after you try @nikitaved's suggestion. Thoughts? If you come up with a clever way to avoid extra dispatch without copy paste, that would be awesome to, but if not, that's not a blocker.

@v0dro
Copy link
Contributor Author

v0dro commented Jun 12, 2020

I have introduced constexpr, but I don't think reintroducing unordered map will make things any faster. If you look at the following MWE, optimizations do not kick in even when the functors are declared constexpr when using the functor inside an unordered map:

#include <unordered_map>
#include <functional>
#include <iostream>
#include <chrono>

class F {
public:
  constexpr void operator() (int & a, int & b) const {
    a += b;
  }
};

int main(int argc, char* argv[]) {
  F fun;
  
  int a = atoi(argv[1]);
  int b = atoi(argv[2]);
  std::chrono::time_point<std::chrono::system_clock> start, stop;
  auto time = 0.0;
  using binary_t = std::function<void(int&, int&)>;
  std::unordered_map<int, binary_t> funcs;
  
  funcs[0] = fun;
  start = std::chrono::system_clock::now();
  for (long long i = 0; i < 10000000; ++i) {
    funcs[0](a, b);
  }
  stop = std::chrono::system_clock::now();
  
  std::cout << "time with unordered_map: " << std::chrono::duration_cast<std::chrono::nanoseconds>(stop - start).count() << std::endl;
  std::cout << "a: " << a << std::endl;
  a = 1;

  start = std::chrono::system_clock::now();
  for (long long i = 0; i < 10000000; ++i) {
    fun(a, b);
  }
  stop = std::chrono::system_clock::now();
  
  std::cout << "time without map: " << std::chrono::duration_cast<std::chrono::nanoseconds>(stop - start).count() << std::endl;
  std::cout << "a: " << a << std::endl;
}

Results:

time with unordered_map: 50887068
a: 30000001
time without map: 44
a: 30000001

@v0dro
Copy link
Contributor Author

v0dro commented Jun 12, 2020

New benchmark:
scatter-regression.py.csv.pdf

@ngimel
Copy link
Collaborator

ngimel commented Jun 16, 2020

Is this ready? If so, can you please rebase so that we can get signal from CI, right now too many tests are failing.

@ngimel
Copy link
Collaborator

ngimel commented Jun 19, 2020

CI errors are real.

@v0dro
Copy link
Contributor Author

v0dro commented Jun 24, 2020

@ngimel errors are fixed. Could you please have a look? The windows failures seem unrelated.

@ngimel
Copy link
Collaborator

ngimel commented Jun 24, 2020

YEah, windows failures are unrelated, but can you please rebase so that we can get a signal from windows builds?

@v0dro
Copy link
Contributor Author

v0dro commented Jun 27, 2020

@ngimel updated and ready to merge.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@ngimel merged this pull request in 9ca4a46.

1 similar comment
@facebook-github-bot
Copy link
Contributor

@ngimel merged this pull request in 9ca4a46.

facebook-github-bot pushed a commit that referenced this pull request Jul 6, 2020
…ction methods. (#40962)

Summary:
Follow up to #36447 . Update for #33389.

Also removes unused `unordered_map` include from the CPP file.

Pull Request resolved: #40962

Differential Revision: D22376253

Pulled By: ngimel

fbshipit-source-id: 4e7432190e9a847321aec6d6f6634056fa69bdb8
csarofeen pushed a commit to csarofeen/pytorch that referenced this pull request Jul 7, 2020
…ction methods. (pytorch#40962)

Summary:
Follow up to pytorch#36447 . Update for pytorch#33389.

Also removes unused `unordered_map` include from the CPP file.

Pull Request resolved: pytorch#40962

Differential Revision: D22376253

Pulled By: ngimel

fbshipit-source-id: 4e7432190e9a847321aec6d6f6634056fa69bdb8
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

scatter_ supporting different reduction modes

8 participants