Skip to content

Conversation

@JianpingChen066
Copy link
Contributor

@JianpingChen066 JianpingChen066 commented Nov 13, 2019

This is try to parallelize index_put accumulate path for float type on CPU. cpu_atomic_add_float is implemented by using atomic_compare_exchange_strong function.
for DLRM benchmark, index_put_impl function time can be reduced from 827.741ms to 116.646ms for 1000 batches

Add a parameter "grain_size" to TensorIterator::for_each to fine tune the index_put performance
The default value of grain_size is internal::GRAIN_SIZE. The index_put grain size is tuned to 3000 and cpu_kernel_vec grain size is tuned to 1024. The following is the grain size impact on the DLRM ops
( index_put_impl based on index_put been parallellized with cpu_atomic_add_float):

Op Name without small grain_size with 1024 as grain_size in cpu_kernel_vec and 3000 in cpu_index_kernel
add_ 11.985s 11.601s
mm 9.706s 9.518s
addmm 5.380s 5.247s
_embedding_bag 2.992s 2.663s
_embedding_bag_backward 1.330s 1.354s
threshold_backward 686.920ms 659.169ms
index_put_impl 489.411ms 116.646ms
bmm 413.129ms 362.967ms
zero_ 379.659ms 310.623ms
add 205.904ms 171.111ms
cat 187.101ms 175.621ms
Self CPU time total (s) 36.544 34.742
Average ms per iteration 38.25 36.44

The more reason for grain size tuning, please further look at PR#30803
to get the DLRM performance here, please also have a look at
PR#23057, PR#24385 and PR#27804
and expose the env vars as below:

export LD_PRELOAD=$HOME/anaconda3/lib/libjemalloc.so  (conda install jemalloc)
export KMP_BLOCKTIME=1
export KMP_AFFINITY="granularity=fine,compact,1,0"

@JianpingChen066 JianpingChen066 force-pushed the index_put branch 4 times, most recently from 0a13c4a to 77fe6b7 Compare December 2, 2019 04:50
@JianpingChen066
Copy link
Contributor Author

JianpingChen066 commented Dec 2, 2019

A index_put performance test case is put in run_index_put.sh
It codes is like below:

import time
import random
import torch

from torch.testing._internal.common_utils import TestCase, run_tests
from hypothesis import given, settings
from hypothesis import strategies as st

class indexPutTest(TestCase):
  @settings(deadline=None)
  @given(batch_size=st.sampled_from([1000, 10000, 100000, 1000000]))
  def test_index_put(self, batch_size):
    print('\ntest_index_put:')
    random.seed(100)
    index0 = random.sample(range(batch_size), batch_size//10)
    index1 = random.sample(range(64), 30)
    index2 = [(x, y) for x in index0 for y in index1]
    indices = torch.LongTensor(index2)
    indices = torch.cat([indices, indices], dim=0)
    src = torch.zeros(batch_size, 64, device='cpu')
    value = torch.ones(indices.shape[0])
    indices_tuple = tuple(indices.t())
    start = time.time()
    for i in range(100):
      src.index_put_(indices_tuple , value, accumulate=True)
    end = time.time()
    i = 0
    sizeofList = len(index2)
    while i < sizeofList :
      x, y = index2[i]
      if (src[x, y] != 200.0):
        print('not match !! src[{},{}] = {:.2f}'.format(x, y, src[x, y]))
      i += 1
    print('indices size:{}, time: {:.6f} ms/per_index_put'.format(len(indices), (end-start)*10))

if __name__=='__main__':
  run_tests()

Test scripts is like:

ncores=28
max_core=$((ncores-1))
export KMP_BLOCKTIME=1
export KMP_AFFINITY="granularity=fine,compact,1,0"
numactl --physcpubind=0-$max_core python ./test_index_put.py

Run this test and without this enhancement, the result is like below on SKX machine:

indices size Origin with 28 cores Origin with 1 core This enhance with 28 cores This enhance with 1 core
6000 0.051765 ms 0.042577 ms 0.049045 ms 0.040627 ms
6000000 84.923289 ms 82.634895 ms 4.652929 ms 79.517519 ms
600000 4.992383 ms 5.002289 ms 0.492473 ms 4.733541 ms
60000 0.436990 ms 0.416648 ms 0.103984 ms 0.397174 ms

@VitalyFedyunin @jgong5 @Jianhui-Li please have a code review on this

@JianpingChen066 JianpingChen066 changed the title Parallelize cpu index_put accumulate float type path by using cpu_ato… Parallelize cpu index_put accumulate float path with cpu_atomic_add_float Dec 5, 2019
@kostmo
Copy link
Member

kostmo commented Dec 16, 2019

💊 CircleCI build failures summary and remediations

As of commit fd03348 (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no CircleCI failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

This comment has been revised 27 times.

@JianpingChen066 JianpingChen066 force-pushed the index_put branch 2 times, most recently from 7a768e6 to a600266 Compare December 25, 2019 05:33
@ezyang ezyang requested a review from VitalyFedyunin February 3, 2020 16:01
@ezyang ezyang added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Feb 3, 2020
@VitalyFedyunin
Copy link
Contributor

You are benchmarking single threaded solution vs multithreading. Please also measure performance impact between running old code single thread and new code single thread.

Copy link
Contributor

@VitalyFedyunin VitalyFedyunin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Linter failures are valid.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check #13420 comments why it looks like duplicated code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep it not touched at this time

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please put comment with github issue/pr explaining value selection.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is based on the DLRM experience and the test unit I written. As above one, it need a more time to analysis the threads launch overhead and computations relation to find a more perfect grain size, currently I didn't dig into that. Thanks

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More extensive benchmarking for different sizes will be valuable, if you've already done this, please let us know.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test result updated according to the lastest pytorch code

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So why would this PR improve single-threaded performance? Is it just flakyness of the benchmark? Also, why original number for 1 core and 28 cores different - they should be the same?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is most likely variance , for single-core, it will go into the previous sequence index_put and not to use atomic add float, as I add an if condition there:

bool use_parallel_for = ((iter.numel() >= internal::GRAIN_SIZE) && (at::get_num_threads() > 1));
if (iter.dtype() == at::ScalarType::Float && use_parallel_for) {
     //cpu_atomic_add_float path
} else {
    //original path
}

I enhance the test case with running 100 iterations instead of 10 to get the average performance on 100 iteration and updated as above. And why the modified code look a little better, it may due to code layout change

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

ddkalamk added a commit to ddkalamk/pytorch that referenced this pull request Feb 13, 2020
…l PyTorch team

Except catArray related codes, all other codes are in the following four PRs:
pytorch#23057  add_out_dense_sparse enhancement
pytorch#24385 embedding_bag_forward index_select_add enhancement
pytorch#27804 embedding_backward_sparse_fast_path_sum
pytorch#29705 index_put accumate path enhancement for float type
@JianpingChen066 JianpingChen066 force-pushed the index_put branch 3 times, most recently from 089299a to a2d9b63 Compare February 24, 2020 09:20
@ngimel
Copy link
Collaborator

ngimel commented Mar 10, 2020

Single-thread performance improvement in the last table is highly suspect - atomic implementation of index_add cannot possibly be faster than regular addition, and grain sizes should not affect single threaded execution.

@JianpingChen066 JianpingChen066 force-pushed the index_put branch 2 times, most recently from 8234391 to d1916be Compare March 11, 2020 06:36
@JianpingChen066
Copy link
Contributor Author

@pytorchbot retest please

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@VitalyFedyunin
Copy link
Contributor

Fails Android and iOS build with errors like:

... /usr/lib/clang/8.0.1/include/mmintrin.h:47:5: error: use of undeclared identifier '__builtin_ia32_emms'; did you mean '__builtin_isless'?
    __builtin_ia32_emms();
    ^
... /usr/lib/clang/8.0.1/include/mmintrin.h:47:25: error: too few arguments to function call, expected 2, have 0
    __builtin_ia32_emms();
                        ^
... /usr/lib/clang/8.0.1/include/mmintrin.h:64:19: error: use of undeclared identifier '__builtin_ia32_vec_init_v2si'
    return (__m64)__builtin_ia32_vec_init_v2si(__i, 0);

@VitalyFedyunin
Copy link
Contributor

With @ngimel help we identified that it is fault of #include <immintrin.h>

@ngimel
Copy link
Collaborator

ngimel commented Mar 18, 2020

__mm_pause is x86-specific https://gcc.gnu.org/onlinedocs/gcc-7.1.0/gcc/x86-specific-memory-model-extensions-for-transactional-memory.html.
Guarding immintrin.h and _mm_pause for mobile builds would probably solve the issue.

@JianpingChen066 JianpingChen066 force-pushed the index_put branch 2 times, most recently from c161aba to dd3cf0a Compare March 19, 2020 05:54
@JianpingChen066
Copy link
Contributor Author

@ngimel please have a check on the modification.
By the way, for mobile build, whether it can be cross-complied ? If I want to check the build for mobile locally, which command line I should to use ? Thanks

@ngimel
Copy link
Collaborator

ngimel commented Mar 19, 2020

Looks good, we'll need to rerun internal builds. Unfortunately I don't know how to reproduce them externally, we'll probably need to put some mobile build into public CI.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@VitalyFedyunin merged this pull request in 65ff064.

XiaobingSuper pushed a commit to XiaobingSuper/pytorch that referenced this pull request Apr 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants