-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Parallelize cpu index_put accumulate float path with cpu_atomic_add_float #29705
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
0a13c4a to
77fe6b7
Compare
|
A index_put performance test case is put in run_index_put.sh Test scripts is like: Run this test and without this enhancement, the result is like below on SKX machine:
@VitalyFedyunin @jgong5 @Jianhui-Li please have a code review on this |
77fe6b7 to
ee912fa
Compare
💊 CircleCI build failures summary and remediationsAs of commit fd03348 (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no CircleCI failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions on the GitHub issue tracker. This comment has been revised 27 times. |
7a768e6 to
a600266
Compare
a600266 to
b93526b
Compare
|
You are benchmarking single threaded solution vs multithreading. Please also measure performance impact between running old code single thread and new code single thread. |
VitalyFedyunin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Linter failures are valid.
b93526b to
46f8b4d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check #13420 comments why it looks like duplicated code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
keep it not touched at this time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please put comment with github issue/pr explaining value selection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is based on the DLRM experience and the test unit I written. As above one, it need a more time to analysis the threads launch overhead and computations relation to find a more perfect grain size, currently I didn't dig into that. Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More extensive benchmarking for different sizes will be valuable, if you've already done this, please let us know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test result updated according to the lastest pytorch code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So why would this PR improve single-threaded performance? Is it just flakyness of the benchmark? Also, why original number for 1 core and 28 cores different - they should be the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is most likely variance , for single-core, it will go into the previous sequence index_put and not to use atomic add float, as I add an if condition there:
bool use_parallel_for = ((iter.numel() >= internal::GRAIN_SIZE) && (at::get_num_threads() > 1));
if (iter.dtype() == at::ScalarType::Float && use_parallel_for) {
//cpu_atomic_add_float path
} else {
//original path
}
I enhance the test case with running 100 iterations instead of 10 to get the average performance on 100 iteration and updated as above. And why the modified code look a little better, it may due to code layout change
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
…l PyTorch team Except catArray related codes, all other codes are in the following four PRs: pytorch#23057 add_out_dense_sparse enhancement pytorch#24385 embedding_bag_forward index_select_add enhancement pytorch#27804 embedding_backward_sparse_fast_path_sum pytorch#29705 index_put accumate path enhancement for float type
089299a to
a2d9b63
Compare
a2d9b63 to
46b588d
Compare
|
Single-thread performance improvement in the last table is highly suspect - atomic implementation of index_add cannot possibly be faster than regular addition, and grain sizes should not affect single threaded execution. |
8234391 to
d1916be
Compare
|
@pytorchbot retest please |
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
Fails Android and iOS build with errors like: |
|
With @ngimel help we identified that it is fault of |
|
__mm_pause is x86-specific https://gcc.gnu.org/onlinedocs/gcc-7.1.0/gcc/x86-specific-memory-model-extensions-for-transactional-memory.html. |
c161aba to
dd3cf0a
Compare
|
@ngimel please have a check on the modification. |
dd3cf0a to
fd03348
Compare
|
Looks good, we'll need to rerun internal builds. Unfortunately I don't know how to reproduce them externally, we'll probably need to put some mobile build into public CI. |
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
@VitalyFedyunin merged this pull request in 65ff064. |
This is try to parallelize index_put accumulate path for float type on CPU. cpu_atomic_add_float is implemented by using atomic_compare_exchange_strong function.
for DLRM benchmark, index_put_impl function time can be reduced from 827.741ms to 116.646ms for 1000 batches
Add a parameter "grain_size" to TensorIterator::for_each to fine tune the index_put performance
The default value of grain_size is internal::GRAIN_SIZE. The index_put grain size is tuned to 3000 and cpu_kernel_vec grain size is tuned to 1024. The following is the grain size impact on the DLRM ops
( index_put_impl based on index_put been parallellized with cpu_atomic_add_float):
The more reason for grain size tuning, please further look at PR#30803
to get the DLRM performance here, please also have a look at
PR#23057, PR#24385 and PR#27804
and expose the env vars as below: