-
Notifications
You must be signed in to change notification settings - Fork 26.3k
optimize norm on ATen CPU backend #11565
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@pytorchbot retest this please |
|
@ezyang please help review the fails. Seems not related to this PR. |
|
Looks like this PR is a duplicate of #10535 |
|
The failures are unrelated. @mingfeima is it safe to assume you resubmitted this PR on @xhzhao's behalf? |
|
@cpuhrsch can you take a look at this? |
|
For my part, the patch looks basically reasonable, but I am not an expert in the AVX parallelization. |
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
cc @colesbury who has looked into related code @mingfeima - Thank you for this! Could I also ask you to also look into smaller tensors to make sure that the constant overhead etc. stays the same and if this scales as expected. And what about single core performance and NUMA behavior, i.e. try binding memory and cpu to the same node? Also could you also (briefly) look at the behavior regarding different CPU capabilities using the environment variables [here].(
|
| int64_t n_rounded = round_down(n, WIDTH); | ||
| scalar_t result1 = norm_reduce128(data, n_rounded, pval); | ||
| scalar_t result2 = norm_reduce_sequential(data + n_rounded, n - n_rounded, stride, pval); | ||
| result = std::pow(std::pow(result1, pval) + std::pow(result2, pval), 1.0/pval); |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
| if (self.type().is_sparse()) { | ||
| return at::native_norm(self, p); | ||
| } else { | ||
| AT_CHECK(self.type().backend() == Backend::CPU || self.type().backend() == Backend::CUDA, |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
|
Generally looks good - @mingfeima I'd also be curious about and much appreciative of your opinion on the user-experience of using the abstractions (vec256 and parallel) and any feedback related to it, if you want to give any. EDIT: One more comment: How does this behave with respect to numerical stability (for say the 2-norm or 3-norm) in comparison to the TH implementation? |
|
@cpuhrsch thanks for the review, will do. |
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Summary:
current torch.norm() runs sequentially on CPU. This PR did parallelization and vectorization of torch.norm() on ATen CPU path, roughly provide 2 order of magnitude performance boost.
Performance is benchmarks on Xeon skylake 8180, 2*28 cores 2.5GHz, using the following script:
```python
import torch
from time import time
count = 1000
size = 1000*1000
def test_norm(p=2):
a = torch.randn(size)
tstart = time()
for i in range(count):
torch.norm(a, p)
tend = time()
print("norm on size %d tensor p = %d: %f s" % (size, p, (tend-tstart)))
for p in range(4):
test_norm(p)
```
without this optimization,
```
(intel-pytorch) [mingfeim@mlt-skx065 unit_tests]$ python test_norm.py
norm on size 1000000 tensor p = 0: 1.071235 s
norm on size 1000000 tensor p = 1: 1.069149 s
norm on size 1000000 tensor p = 2: 1.068212 s
norm on size 1000000 tensor p = 3: 69.735312 s
```
and with this optimization,
```
(pytorch-tf) [mingfeim@mlt-skx053 unit_tests]$ python test_norm.py
norm on size 1000000 tensor p = 0: 0.127507 s
norm on size 1000000 tensor p = 1: 0.011867 s
norm on size 1000000 tensor p = 2: 0.011907 s
norm on size 1000000 tensor p = 3: 0.014470 s
```
Pull Request resolved: pytorch/pytorch#11565
Differential Revision: D9804484
Pulled By: ezyang
fbshipit-source-id: 52899f30ac26139d00684d07edfb47cb9b25d871
current torch.norm() runs sequentially on CPU. This PR did parallelization and vectorization of torch.norm() on ATen CPU path, roughly provide 2 order of magnitude performance boost.
Performance is benchmarks on Xeon skylake 8180, 2*28 cores @2.5GHz, using the following script:
without this optimization,
and with this optimization,