Skip to content

Conversation

@mingfeima
Copy link
Collaborator

current torch.norm() runs sequentially on CPU. This PR did parallelization and vectorization of torch.norm() on ATen CPU path, roughly provide 2 order of magnitude performance boost.

Performance is benchmarks on Xeon skylake 8180, 2*28 cores @2.5GHz, using the following script:

import torch
from time import time

count = 1000
size = 1000*1000

def test_norm(p=2):
    a = torch.randn(size)
    tstart = time()
    for i in range(count):
        torch.norm(a, p)
    tend = time()
    print("norm on size %d tensor p = %d: %f s" % (size, p, (tend-tstart)))

for p in range(4):
    test_norm(p)

without this optimization,

(intel-pytorch) [mingfeim@mlt-skx065 unit_tests]$ python test_norm.py
norm on size 1000000 tensor p = 0: 1.071235 s
norm on size 1000000 tensor p = 1: 1.069149 s
norm on size 1000000 tensor p = 2: 1.068212 s
norm on size 1000000 tensor p = 3: 69.735312 s

and with this optimization,

(pytorch-tf) [mingfeim@mlt-skx053 unit_tests]$ python test_norm.py
norm on size 1000000 tensor p = 0: 0.127507 s
norm on size 1000000 tensor p = 1: 0.011867 s
norm on size 1000000 tensor p = 2: 0.011907 s
norm on size 1000000 tensor p = 3: 0.014470 s

@ezyang
Copy link
Contributor

ezyang commented Sep 12, 2018

@pytorchbot retest this please

@mingfeima
Copy link
Collaborator Author

@ezyang please help review the fails. Seems not related to this PR.

@fmassa
Copy link
Member

fmassa commented Sep 12, 2018

Looks like this PR is a duplicate of #10535

@ezyang
Copy link
Contributor

ezyang commented Sep 12, 2018

The failures are unrelated. @mingfeima is it safe to assume you resubmitted this PR on @xhzhao's behalf?

@ezyang
Copy link
Contributor

ezyang commented Sep 12, 2018

@cpuhrsch can you take a look at this?

@ezyang
Copy link
Contributor

ezyang commented Sep 13, 2018

For my part, the patch looks basically reasonable, but I am not an expert in the AVX parallelization.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@cpuhrsch
Copy link
Contributor

cc @colesbury who has looked into related code

@mingfeima - Thank you for this! Could I also ask you to also look into smaller tensors to make sure that the constant overhead etc. stays the same and if this scales as expected. And what about single core performance and NUMA behavior, i.e. try binding memory and cpu to the same node?

Also could you also (briefly) look at the behavior regarding different CPU capabilities using the environment variables [here].(

if (strcmp(envar, "avx2") == 0) {
). Just in case that for capabilities the performance is much worse.

int64_t n_rounded = round_down(n, WIDTH);
scalar_t result1 = norm_reduce128(data, n_rounded, pval);
scalar_t result2 = norm_reduce_sequential(data + n_rounded, n - n_rounded, stride, pval);
result = std::pow(std::pow(result1, pval) + std::pow(result2, pval), 1.0/pval);

This comment was marked as off-topic.

if (self.type().is_sparse()) {
return at::native_norm(self, p);
} else {
AT_CHECK(self.type().backend() == Backend::CPU || self.type().backend() == Backend::CUDA,

This comment was marked as off-topic.

This comment was marked as off-topic.

@cpuhrsch
Copy link
Contributor

cpuhrsch commented Sep 13, 2018

Generally looks good - @mingfeima I'd also be curious about and much appreciative of your opinion on the user-experience of using the abstractions (vec256 and parallel) and any feedback related to it, if you want to give any.

EDIT: One more comment: How does this behave with respect to numerical stability (for say the 2-norm or 3-norm) in comparison to the TH implementation?

@mingfeima
Copy link
Collaborator Author

@fmassa @ezyang this one is authored by @xhzhao, we are on the same team. Pretty much you see the entire intel-pytorch team on this thread. Previously we also have @MlWoo on the team, yet he quits this summer.
I thought xhzhao hadn't pull request it... my mistake... anyway, either one should be OK.

@mingfeima
Copy link
Collaborator Author

@cpuhrsch thanks for the review, will do.
Honestly i'd like to say parallel and vec256 work like a charm. With parallel and vec256 it is very easy to construct high performance kernels in ATen/native, without touching any pragma omp and intrinsic.
A little suggestion is that you probably need to add omp_in_parallel check inside parallel_for and parallel_reduce. It could be possible that parallel is called in nested loop unintentionally, in that case, it's going to cause oversubscription.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

zdevito pushed a commit to zdevito/ATen that referenced this pull request Sep 14, 2018
Summary:
current torch.norm() runs sequentially on CPU. This PR did parallelization and vectorization of torch.norm() on ATen CPU path, roughly provide 2 order of magnitude performance boost.

Performance is benchmarks on Xeon skylake 8180, 2*28 cores 2.5GHz, using the following script:
```python
import torch
from time import time

count = 1000
size = 1000*1000

def test_norm(p=2):
    a = torch.randn(size)
    tstart = time()
    for i in range(count):
        torch.norm(a, p)
    tend = time()
    print("norm on size %d tensor p = %d: %f s" % (size, p, (tend-tstart)))

for p in range(4):
    test_norm(p)
```

without this optimization,
```
(intel-pytorch) [mingfeim@mlt-skx065 unit_tests]$ python test_norm.py
norm on size 1000000 tensor p = 0: 1.071235 s
norm on size 1000000 tensor p = 1: 1.069149 s
norm on size 1000000 tensor p = 2: 1.068212 s
norm on size 1000000 tensor p = 3: 69.735312 s
```

and with this optimization,
```
(pytorch-tf) [mingfeim@mlt-skx053 unit_tests]$ python test_norm.py
norm on size 1000000 tensor p = 0: 0.127507 s
norm on size 1000000 tensor p = 1: 0.011867 s
norm on size 1000000 tensor p = 2: 0.011907 s
norm on size 1000000 tensor p = 3: 0.014470 s
```
Pull Request resolved: pytorch/pytorch#11565

Differential Revision: D9804484

Pulled By: ezyang

fbshipit-source-id: 52899f30ac26139d00684d07edfb47cb9b25d871
@cpuhrsch cpuhrsch mentioned this pull request Sep 19, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants