optimize norm on ATen CPU backend #11565

mingfeima · 2018-09-12T03:05:16Z

current torch.norm() runs sequentially on CPU. This PR did parallelization and vectorization of torch.norm() on ATen CPU path, roughly provide 2 order of magnitude performance boost.

Performance is benchmarks on Xeon skylake 8180, 2*28 cores @2.5GHz, using the following script:

import torch
from time import time

count = 1000
size = 1000*1000

def test_norm(p=2):
    a = torch.randn(size)
    tstart = time()
    for i in range(count):
        torch.norm(a, p)
    tend = time()
    print("norm on size %d tensor p = %d: %f s" % (size, p, (tend-tstart)))

for p in range(4):
    test_norm(p)

without this optimization,

(intel-pytorch) [mingfeim@mlt-skx065 unit_tests]$ python test_norm.py
norm on size 1000000 tensor p = 0: 1.071235 s
norm on size 1000000 tensor p = 1: 1.069149 s
norm on size 1000000 tensor p = 2: 1.068212 s
norm on size 1000000 tensor p = 3: 69.735312 s

and with this optimization,

(pytorch-tf) [mingfeim@mlt-skx053 unit_tests]$ python test_norm.py
norm on size 1000000 tensor p = 0: 0.127507 s
norm on size 1000000 tensor p = 1: 0.011867 s
norm on size 1000000 tensor p = 2: 0.011907 s
norm on size 1000000 tensor p = 3: 0.014470 s

ezyang · 2018-09-12T03:36:03Z

@pytorchbot retest this please

mingfeima · 2018-09-12T04:48:10Z

@ezyang please help review the fails. Seems not related to this PR.

fmassa · 2018-09-12T12:47:23Z

Looks like this PR is a duplicate of #10535

ezyang · 2018-09-12T23:28:05Z

The failures are unrelated. @mingfeima is it safe to assume you resubmitted this PR on @xhzhao's behalf?

ezyang · 2018-09-12T23:59:35Z

@cpuhrsch can you take a look at this?

ezyang · 2018-09-13T00:05:08Z

For my part, the patch looks basically reasonable, but I am not an expert in the AVX parallelization.

facebook-github-bot

ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

cpuhrsch · 2018-09-13T00:20:01Z

cc @colesbury who has looked into related code

@mingfeima - Thank you for this! Could I also ask you to also look into smaller tensors to make sure that the constant overhead etc. stays the same and if this scales as expected. And what about single core performance and NUMA behavior, i.e. try binding memory and cpu to the same node?

Also could you also (briefly) look at the behavior regarding different CPU capabilities using the environment variables [here].(

pytorch/aten/src/ATen/native/DispatchStub.cpp

Line 14 in 0b2e72d

if (strcmp(envar, "avx2") == 0) {

). Just in case that for capabilities the performance is much worse.

aten/src/ATen/native/cpu/ReduceOpsKernel.cpp

+      int64_t n_rounded = round_down(n, WIDTH);
+      scalar_t result1 = norm_reduce128(data, n_rounded, pval);
+      scalar_t result2 = norm_reduce_sequential(data + n_rounded, n - n_rounded, stride, pval);
+      result = std::pow(std::pow(result1, pval) + std::pow(result2, pval), 1.0/pval);


aten/src/ATen/native/ReduceOps.cpp

+  if (self.type().is_sparse()) {
+    return at::native_norm(self, p);
+  } else {
+    AT_CHECK(self.type().backend() == Backend::CPU || self.type().backend() == Backend::CUDA,


cpuhrsch · 2018-09-13T00:26:51Z

Generally looks good - @mingfeima I'd also be curious about and much appreciative of your opinion on the user-experience of using the abstractions (vec256 and parallel) and any feedback related to it, if you want to give any.

EDIT: One more comment: How does this behave with respect to numerical stability (for say the 2-norm or 3-norm) in comparison to the TH implementation?

mingfeima · 2018-09-13T00:43:38Z

@fmassa @ezyang this one is authored by @xhzhao, we are on the same team. Pretty much you see the entire intel-pytorch team on this thread. Previously we also have @MlWoo on the team, yet he quits this summer.
I thought xhzhao hadn't pull request it... my mistake... anyway, either one should be OK.

mingfeima · 2018-09-13T01:04:26Z

@cpuhrsch thanks for the review, will do.
Honestly i'd like to say parallel and vec256 work like a charm. With parallel and vec256 it is very easy to construct high performance kernels in ATen/native, without touching any pragma omp and intrinsic.
A little suggestion is that you probably need to add omp_in_parallel check inside parallel_for and parallel_reduce. It could be possible that parallel is called in nested loop unintentionally, in that case, it's going to cause oversubscription.

facebook-github-bot

ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary: current torch.norm() runs sequentially on CPU. This PR did parallelization and vectorization of torch.norm() on ATen CPU path, roughly provide 2 order of magnitude performance boost. Performance is benchmarks on Xeon skylake 8180, 2*28 cores 2.5GHz, using the following script: ```python import torch from time import time count = 1000 size = 1000*1000 def test_norm(p=2): a = torch.randn(size) tstart = time() for i in range(count): torch.norm(a, p) tend = time() print("norm on size %d tensor p = %d: %f s" % (size, p, (tend-tstart))) for p in range(4): test_norm(p) ``` without this optimization, ``` (intel-pytorch) [mingfeim@mlt-skx065 unit_tests]$ python test_norm.py norm on size 1000000 tensor p = 0: 1.071235 s norm on size 1000000 tensor p = 1: 1.069149 s norm on size 1000000 tensor p = 2: 1.068212 s norm on size 1000000 tensor p = 3: 69.735312 s ``` and with this optimization, ``` (pytorch-tf) [mingfeim@mlt-skx053 unit_tests]$ python test_norm.py norm on size 1000000 tensor p = 0: 0.127507 s norm on size 1000000 tensor p = 1: 0.011867 s norm on size 1000000 tensor p = 2: 0.011907 s norm on size 1000000 tensor p = 3: 0.014470 s ``` Pull Request resolved: pytorch/pytorch#11565 Differential Revision: D9804484 Pulled By: ezyang fbshipit-source-id: 52899f30ac26139d00684d07edfb47cb9b25d871

xhzhao added 2 commits September 12, 2018 10:47

parallel norm in Aten

92d0cbf

vectorize optimization for p=1/2/3 with stride=1

0b2e72d

mingfeima requested review from apaszke, colesbury, ezyang, gchanan, soumith and zdevito as code owners September 12, 2018 03:05

facebook-github-bot reviewed Sep 13, 2018

View reviewed changes

cpuhrsch reviewed Sep 13, 2018

View reviewed changes

facebook-github-bot reviewed Sep 13, 2018

View reviewed changes

facebook-github-bot closed this in 99c0b96 Sep 14, 2018

fmassa mentioned this pull request Sep 14, 2018

parallel norm operation for ATen on CPU #10535

Closed

cpuhrsch mentioned this pull request Sep 19, 2018

migrate PReLU to ATen #11758

Closed

xhzhao mentioned this pull request Oct 8, 2018

bugfix for norm_all when tensor size > grain_size, add test case #12444

Closed

ezyang added open source merged labels Jun 24, 2019

optimize norm on ATen CPU backend #11565

optimize norm on ATen CPU backend #11565

Uh oh!

Conversation

mingfeima commented Sep 12, 2018

Uh oh!

ezyang commented Sep 12, 2018

Uh oh!

mingfeima commented Sep 12, 2018

Uh oh!

fmassa commented Sep 12, 2018

Uh oh!

ezyang commented Sep 12, 2018

Uh oh!

ezyang commented Sep 12, 2018

Uh oh!

ezyang commented Sep 13, 2018

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

cpuhrsch commented Sep 13, 2018

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

cpuhrsch commented Sep 13, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mingfeima commented Sep 13, 2018

Uh oh!

mingfeima commented Sep 13, 2018

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

cpuhrsch commented Sep 13, 2018 •

edited

Loading