Parallelize cpu index_put accumulate float path with cpu_atomic_add_float #29705

JianpingChen066 · 2019-11-13T04:08:56Z

This is try to parallelize index_put accumulate path for float type on CPU. cpu_atomic_add_float is implemented by using atomic_compare_exchange_strong function.
for DLRM benchmark, index_put_impl function time can be reduced from 827.741ms to 116.646ms for 1000 batches

Add a parameter "grain_size" to TensorIterator::for_each to fine tune the index_put performance
The default value of grain_size is internal::GRAIN_SIZE. The index_put grain size is tuned to 3000 and cpu_kernel_vec grain size is tuned to 1024. The following is the grain size impact on the DLRM ops
( index_put_impl based on index_put been parallellized with cpu_atomic_add_float):

Op Name	without small grain_size	with 1024 as grain_size in cpu_kernel_vec and 3000 in cpu_index_kernel
add_	11.985s	11.601s
mm	9.706s	9.518s
addmm	5.380s	5.247s
_embedding_bag	2.992s	2.663s
_embedding_bag_backward	1.330s	1.354s
threshold_backward	686.920ms	659.169ms
index_put_impl	489.411ms	116.646ms
bmm	413.129ms	362.967ms
zero_	379.659ms	310.623ms
add	205.904ms	171.111ms
cat	187.101ms	175.621ms
Self CPU time total (s)	36.544	34.742
Average ms per iteration	38.25	36.44

The more reason for grain size tuning, please further look at PR#30803
to get the DLRM performance here, please also have a look at
PR#23057, PR#24385 and PR#27804
and expose the env vars as below:

export LD_PRELOAD=$HOME/anaconda3/lib/libjemalloc.so  (conda install jemalloc)
export KMP_BLOCKTIME=1
export KMP_AFFINITY="granularity=fine,compact,1,0"

JianpingChen066 · 2019-12-02T04:59:54Z

A index_put performance test case is put in run_index_put.sh
It codes is like below:

import time
import random
import torch

from torch.testing._internal.common_utils import TestCase, run_tests
from hypothesis import given, settings
from hypothesis import strategies as st

class indexPutTest(TestCase):
  @settings(deadline=None)
  @given(batch_size=st.sampled_from([1000, 10000, 100000, 1000000]))
  def test_index_put(self, batch_size):
    print('\ntest_index_put:')
    random.seed(100)
    index0 = random.sample(range(batch_size), batch_size//10)
    index1 = random.sample(range(64), 30)
    index2 = [(x, y) for x in index0 for y in index1]
    indices = torch.LongTensor(index2)
    indices = torch.cat([indices, indices], dim=0)
    src = torch.zeros(batch_size, 64, device='cpu')
    value = torch.ones(indices.shape[0])
    indices_tuple = tuple(indices.t())
    start = time.time()
    for i in range(100):
      src.index_put_(indices_tuple , value, accumulate=True)
    end = time.time()
    i = 0
    sizeofList = len(index2)
    while i < sizeofList :
      x, y = index2[i]
      if (src[x, y] != 200.0):
        print('not match !! src[{},{}] = {:.2f}'.format(x, y, src[x, y]))
      i += 1
    print('indices size:{}, time: {:.6f} ms/per_index_put'.format(len(indices), (end-start)*10))

if __name__=='__main__':
  run_tests()

Test scripts is like:

ncores=28
max_core=$((ncores-1))
export KMP_BLOCKTIME=1
export KMP_AFFINITY="granularity=fine,compact,1,0"
numactl --physcpubind=0-$max_core python ./test_index_put.py

Run this test and without this enhancement, the result is like below on SKX machine:

indices size	Origin with 28 cores	Origin with 1 core	This enhance with 28 cores	This enhance with 1 core
6000	0.051765 ms	0.042577 ms	0.049045 ms	0.040627 ms
6000000	84.923289 ms	82.634895 ms	4.652929 ms	79.517519 ms
600000	4.992383 ms	5.002289 ms	0.492473 ms	4.733541 ms
60000	0.436990 ms	0.416648 ms	0.103984 ms	0.397174 ms

@VitalyFedyunin @jgong5 @Jianhui-Li please have a code review on this

kostmo · 2019-12-16T08:46:06Z

💊 CircleCI build failures summary and remediations

As of commit fd03348 (more details on the Dr. CI page):

💚 💚 Looks good so far! There are no CircleCI failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

This comment has been revised 27 times.

VitalyFedyunin · 2020-02-03T16:39:24Z

You are benchmarking single threaded solution vs multithreading. Please also measure performance impact between running old code single thread and new code single thread.

VitalyFedyunin

Linter failures are valid.

aten/src/ATen/native/TensorIterator.cpp

aten/src/ATen/native/cpu/IndexKernel.cpp

aten/src/ATen/native/cpu/Loops.h

VitalyFedyunin · 2020-02-06T20:55:24Z

aten/src/ATen/native/cpu/IndexKernel.cpp

Please check #13420 comments why it looks like duplicated code.

keep it not touched at this time

aten/src/ATen/native/cpu/IndexKernel.cpp

aten/src/ATen/native/cpu/Loops.h

VitalyFedyunin · 2020-02-06T20:58:34Z

aten/src/ATen/native/cpu/IndexKernel.cpp

Please put comment with github issue/pr explaining value selection.

This is based on the DLRM experience and the test unit I written. As above one, it need a more time to analysis the threads launch overhead and computations relation to find a more perfect grain size, currently I didn't dig into that. Thanks

More extensive benchmarking for different sizes will be valuable, if you've already done this, please let us know.

Test result updated according to the lastest pytorch code

So why would this PR improve single-threaded performance? Is it just flakyness of the benchmark? Also, why original number for 1 core and 28 cores different - they should be the same?

This is most likely variance , for single-core, it will go into the previous sequence index_put and not to use atomic add float, as I add an if condition there:

bool use_parallel_for = ((iter.numel() >= internal::GRAIN_SIZE) && (at::get_num_threads() > 1)); if (iter.dtype() == at::ScalarType::Float && use_parallel_for) { //cpu_atomic_add_float path } else { //original path }

I enhance the test case with running 100 iterations instead of 10 to get the average performance on 100 iteration and updated as above. And why the modified code look a little better, it may due to code layout change

facebook-github-bot

@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

aten/src/ATen/native/cpu/AtomicAddFloat.h

…l PyTorch team Except catArray related codes, all other codes are in the following four PRs: pytorch#23057 add_out_dense_sparse enhancement pytorch#24385 embedding_bag_forward index_select_add enhancement pytorch#27804 embedding_backward_sparse_fast_path_sum pytorch#29705 index_put accumate path enhancement for float type

ngimel · 2020-03-10T20:40:36Z

Single-thread performance improvement in the last table is highly suspect - atomic implementation of index_add cannot possibly be faster than regular addition, and grain sizes should not affect single threaded execution.

JianpingChen066 · 2020-03-11T08:44:03Z

@pytorchbot retest please

facebook-github-bot

@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot

@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

VitalyFedyunin · 2020-03-18T19:56:47Z

Fails Android and iOS build with errors like:

... /usr/lib/clang/8.0.1/include/mmintrin.h:47:5: error: use of undeclared identifier '__builtin_ia32_emms'; did you mean '__builtin_isless'?
    __builtin_ia32_emms();
    ^
... /usr/lib/clang/8.0.1/include/mmintrin.h:47:25: error: too few arguments to function call, expected 2, have 0
    __builtin_ia32_emms();
                        ^
... /usr/lib/clang/8.0.1/include/mmintrin.h:64:19: error: use of undeclared identifier '__builtin_ia32_vec_init_v2si'
    return (__m64)__builtin_ia32_vec_init_v2si(__i, 0);

VitalyFedyunin · 2020-03-18T19:57:27Z

With @ngimel help we identified that it is fault of #include <immintrin.h>

ngimel · 2020-03-18T20:52:41Z

__mm_pause is x86-specific https://gcc.gnu.org/onlinedocs/gcc-7.1.0/gcc/x86-specific-memory-model-extensions-for-transactional-memory.html.
Guarding immintrin.h and _mm_pause for mobile builds would probably solve the issue.

JianpingChen066 · 2020-03-19T05:57:02Z

@ngimel please have a check on the modification.
By the way, for mobile build, whether it can be cross-complied ? If I want to check the build for mobile locally, which command line I should to use ? Thanks

ngimel · 2020-03-19T06:10:13Z

Looks good, we'll need to rerun internal builds. Unfortunately I don't know how to reproduce them externally, we'll probably need to put some mobile build into public CI.

…mic_add_float

facebook-github-bot

@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2020-03-21T08:21:51Z

@VitalyFedyunin merged this pull request in 65ff064.

JianpingChen066 force-pushed the index_put branch 4 times, most recently from 0a13c4a to 77fe6b7 Compare December 2, 2019 04:50

JianpingChen066 changed the title ~~Parallelize cpu index_put accumulate float type path by using cpu_ato…~~ Parallelize cpu index_put accumulate float path with cpu_atomic_add_float Dec 5, 2019

Jianhui-Li mentioned this pull request Dec 5, 2019

Pytorch openmp thread number tuning option for CPU trainning #30803

Open

JianpingChen066 force-pushed the index_put branch from 77fe6b7 to ee912fa Compare December 16, 2019 08:25

Jianhui-Li mentioned this pull request Dec 17, 2019

Optimizing DLRM for CPU #31356

Open

JianpingChen066 force-pushed the index_put branch 2 times, most recently from 7a768e6 to a600266 Compare December 25, 2019 05:33

JianpingChen066 force-pushed the index_put branch from a600266 to b93526b Compare January 2, 2020 06:46

pytorchbot added the open source label Jan 6, 2020

ezyang requested a review from VitalyFedyunin February 3, 2020 16:01

ezyang added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Feb 3, 2020

VitalyFedyunin suggested changes Feb 3, 2020

View reviewed changes

aten/src/ATen/native/TensorIterator.cpp Outdated Show resolved Hide resolved

aten/src/ATen/native/cpu/IndexKernel.cpp Outdated Show resolved Hide resolved

aten/src/ATen/native/cpu/Loops.h Outdated Show resolved Hide resolved

JianpingChen066 force-pushed the index_put branch from b93526b to 46f8b4d Compare February 6, 2020 04:07

VitalyFedyunin reviewed Feb 6, 2020

View reviewed changes

facebook-github-bot reviewed Feb 6, 2020

View reviewed changes

VitalyFedyunin reviewed Feb 6, 2020

View reviewed changes

aten/src/ATen/native/cpu/AtomicAddFloat.h Outdated Show resolved Hide resolved

JianpingChen066 force-pushed the index_put branch 3 times, most recently from 089299a to a2d9b63 Compare February 24, 2020 09:20

JianpingChen066 requested a review from VitalyFedyunin March 4, 2020 04:36

JianpingChen066 force-pushed the index_put branch from a2d9b63 to 46b588d Compare March 10, 2020 07:58

JianpingChen066 force-pushed the index_put branch 2 times, most recently from 8234391 to d1916be Compare March 11, 2020 06:36

ngimel approved these changes Mar 12, 2020

View reviewed changes

facebook-github-bot reviewed Mar 12, 2020

View reviewed changes

facebook-github-bot reviewed Mar 16, 2020

View reviewed changes

JianpingChen066 force-pushed the index_put branch 2 times, most recently from c161aba to dd3cf0a Compare March 19, 2020 05:54

JianpingChen066 force-pushed the index_put branch from dd3cf0a to fd03348 Compare March 19, 2020 06:07

Parallelize cpu index_put accumulate float type path by using cpu_ato…

fd03348

…mic_add_float

facebook-github-bot reviewed Mar 20, 2020

View reviewed changes

facebook-github-bot closed this in 65ff064 Mar 21, 2020

facebook-github-bot added the merged label Mar 21, 2020

h6197627 mentioned this pull request Mar 23, 2020

Caffe2 mobile building broken #35211

Closed

XiaobingSuper pushed a commit to XiaobingSuper/pytorch that referenced this pull request Apr 30, 2020

Port pytorch#29705

f4802fe

mruberry added the Merged label Oct 28, 2020

peterbell10 mentioned this pull request Feb 17, 2021

Migrate mode from TH to ATen #52043

Closed

Parallelize cpu index_put accumulate float path with cpu_atomic_add_float #29705

Parallelize cpu index_put accumulate float path with cpu_atomic_add_float #29705

Uh oh!

Conversation

JianpingChen066 commented Nov 13, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JianpingChen066 commented Dec 2, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kostmo commented Dec 16, 2019 • edited by dr-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CircleCI build failures summary and remediations

Uh oh!

VitalyFedyunin commented Feb 3, 2020

Uh oh!

VitalyFedyunin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ngimel commented Mar 10, 2020

Uh oh!

JianpingChen066 commented Mar 11, 2020

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

VitalyFedyunin commented Mar 18, 2020

Uh oh!

VitalyFedyunin commented Mar 18, 2020

Uh oh!

ngimel commented Mar 18, 2020

Uh oh!

JianpingChen066 commented Mar 19, 2020

Uh oh!

ngimel commented Mar 19, 2020

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Mar 21, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

JianpingChen066 commented Nov 13, 2019 •

edited

Loading

JianpingChen066 commented Dec 2, 2019 •

edited

Loading

kostmo commented Dec 16, 2019 •

edited by dr-ci bot

Loading