Enhance Tensor indexSelect performance #23055

JianpingChen066 · 2019-07-19T03:45:54Z

This is try to reduce the overhead on the index_select on CPU path at DLRM (https://github.com/facebookresearch/dlrm). To make src as contiguous can make it go into the parallelied path in Tensor indexSelect function

JianpingChen066 · 2019-07-19T03:47:29Z

@VitalyFedyunin @fmassa @jgong5 please have a review on this . Thanks

fmassa · 2019-07-19T11:16:44Z

Can you share some representative sizes of the tensors that are used in this model?

Also, can you share some benchmark numbers before / after?

cpuhrsch · 2019-07-19T19:39:56Z

@zou3519 - you might care about this as well since it's related to EmbeddingBag conceptually.

JianpingChen066 · 2019-07-22T05:04:29Z

@fmassa
For DLRM benchmark, run the benchmark test as below:
numactl --physcpubind=0-27 -m 0 python dlrm_s_pytorch.py --mini-batch-size=2048 --num-batches=1000 --data-generation=random --arch-mlp-bot=512-512-64 --arch-mlp-top=1024-1024-1024-1 --arch-sparse-feature-size=64 --arch-embedding-size=1000000-1000000-1000000-1000000-1000000-1000000-1000000-1000000 --num-indices-per-lookup=100 --arch-interaction-op=dot --numpy-rand-seed=727 --print-freq=100 --print-time --enable-profiling

And when run this benchmark, it's better to "export KMP_BLOCKTIME=1", then

without this patch , the CPU path benchmark result is as below by using the latest pytorch version (2ac9abf) :
Finished training it 100/1000 of epoch 0, 2969.25 ms/it, loss 0.220505, accuracy 0.000 %
Finished training it 200/1000 of epoch 0, 2997.01 ms/it, loss 0.085002, accuracy 0.000 %
Finished training it 300/1000 of epoch 0, 2971.44 ms/it, loss 0.084647, accuracy 0.000 %
Finished training it 400/1000 of epoch 0, 2976.13 ms/it, loss 0.084571, accuracy 0.000 %
Finished training it 500/1000 of epoch 0, 2972.50 ms/it, loss 0.084110, accuracy 0.000 %
Finished training it 600/1000 of epoch 0, 2977.38 ms/it, loss 0.084307, accuracy 0.000 %
Finished training it 700/1000 of epoch 0, 2976.58 ms/it, loss 0.084587, accuracy 0.000 %
Finished training it 800/1000 of epoch 0, 2980.39 ms/it, loss 0.084244, accuracy 0.000 %
Finished training it 900/1000 of epoch 0, 3008.62 ms/it, loss 0.084789, accuracy 0.000 %
Finished training it 1000/1000 of epoch 0, 2974.28 ms/it, loss 0.084187, accuracy 0.000 %

Apply this patch with same pytorch version, the result is as:
Finished training it 100/1000 of epoch 0, 1596.95 ms/it, loss 0.220505, accuracy 0.000 %
Finished training it 200/1000 of epoch 0, 1598.42 ms/it, loss 0.085002, accuracy 0.000 %
Finished training it 300/1000 of epoch 0, 1619.37 ms/it, loss 0.084647, accuracy 0.000 %
Finished training it 400/1000 of epoch 0, 1597.37 ms/it, loss 0.084571, accuracy 0.000 %
Finished training it 500/1000 of epoch 0, 1595.12 ms/it, loss 0.084110, accuracy 0.000 %
Finished training it 600/1000 of epoch 0, 1596.97 ms/it, loss 0.084307, accuracy 0.000 %
Finished training it 700/1000 of epoch 0, 1595.59 ms/it, loss 0.084587, accuracy 0.000 %
Finished training it 800/1000 of epoch 0, 1597.95 ms/it, loss 0.084244, accuracy 0.000 %
Finished training it 900/1000 of epoch 0, 1598.83 ms/it, loss 0.084789, accuracy 0.000 %
Finished training it 1000/1000 of epoch 0, 1601.12 ms/it, loss 0.084187, accuracy 0.000 %

Thanks

Jianhui-Li · 2019-07-22T22:20:25Z

@jspark1105 @dmudiger

dmudiger · 2019-07-23T02:59:47Z

cc: @bddppq (Junjie Bai), in reference to our earlier discussion

bddppq · 2019-07-23T07:44:15Z

@pytorchbot rebase this please

VitalyFedyunin · 2019-07-23T16:56:21Z

aten/src/TH/generic/THTensorEvenMoreMath.cpp

You can get better performance by moving all src_is_contiguous related code into this if branch.

Also feel free to simplify the code by:

src = THTensor_(newContiguous)(src); // do something with src c10::raw::intrusive_ptr::decref(src);

as THTensor_(newContiguous)(src) doesn't copy already contiguous tensors.

Yes, code has been refined
Thanks

facebook-github-bot

@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

apaszke

Are we sure this is always beneficial? While this might make it faster in some cases, for some (e.g. small/medium sized arrays that fit in cache) it's only likely to be slower...

vincentqb · 2019-08-02T19:34:01Z

Are we sure this is always beneficial? While this might make it faster in some cases, for some (e.g. small/medium sized arrays that fit in cache) it's only likely to be slower...

Due to this comment, I'm holding on landing this PR from the "needs land" queue.

JianpingChen066 · 2019-08-06T01:06:44Z

@apaszke
Hi, Adam
are there any existing test cases for small/medium sized arrays performance test thus I can have a check on this ? Or is there some heuristic limitation similar as TH_OMP_OVERHEAD_THRESHOLD thus I can use it to set a threshold on this ?
Thanks

JianpingChen066 · 2019-08-06T07:18:03Z

Hi, Adam
What do you think I also use TH_OMP_OVERHEAD_THRESHOLD as a threshold to do newContiguous for src like below:

auto src_size0 = THTensor_sizeLegacyNoScalars(src, 0);
ptrdiff_t rowsize = src_size0 == 0 ? 1 : THTensor_(nElement)(src) / src_size0;
auto omp_threshold = TH_OMP_OVERHEAD_THRESHOLD;
if (src->dim() > 1) {
omp_threshold = TH_OMP_OVERHEAD_THRESHOLD / rowsize;
}
if(!THTensor_(isContiguous)(src) && (numel > omp_threshold)) {
src = THTensor_(newContiguous)(src);
}
...
if(!THTensor_(isContiguous)(src) && (numel > omp_threshold)) {
c10::raw::intrusive_ptr::decref(src);
}

so this can ensure the parallel code must be executed and can get better performance than sequential code ?

Just with this kind of modification , DLRM benchmark can also get the performance as:
Finished training it 1000/1000 of epoch 0, 1542.06 ms/it, loss 0.098095, accuracy 0.000 %
about 2X improvement than the original one:
Finished training it 1000/1000 of epoch 0, 2974.28 ms/it, loss 0.084187, accuracy 0.000 %

Regards,
Jianping

JianpingChen066 · 2019-08-07T03:21:55Z

@VitalyFedyunin @fmassa @jgong5 please have a further review on this . Thanks

apaszke · 2019-08-08T14:40:01Z

@JianpingChen066 It's hard in general to pick the right threshold for every occasion, because it depends on the interactions between multiple non-trivial CPU features. Can you please try to run this operation on tensors of varying sizes (with indices both coalesced/sorted and completely scattered over the range) and report the timings before and after the patch?

JianpingChen066 · 2019-08-09T03:05:03Z

@apaszke
Hi, Adam
Are there any existing test cases for indexSelect functions ?
We only want src as contiguous as possible thus parallel version can work better than sequential version, it should have no relation with whether indices is coalesced or not, sorted or not and scattered or not ?

ezyang · 2019-08-09T14:47:26Z

@pytorchbot rebase this please

No OMP references please

ezyang · 2019-08-09T15:41:05Z

@VitalyFedyunin @ilia-cher Can we add some sort of lint for this rule? It's very easy to forget...

facebook-github-bot · 2019-08-09T19:13:28Z

@ezyang merged this pull request in 0002448.

VitalyFedyunin · 2019-08-09T19:56:48Z

PR reverted from the master. You can rebase and keep working on it.

This reverts commit 0002448.

JianpingChen066 · 2019-08-12T01:40:40Z

@VitalyFedyunin Yes, while TH_OMP_OVERHEAD_THRESHOLD is in the previous code and defined at THTensorApply.hpp. It is used by the indexSelect in the at::parallel_for as parallel threshold. May I change it to TH_PARALLEL_OVERHEAD_THRESHOLD ? Thanks

JianpingChen066 · 2019-08-12T08:04:32Z

@VitalyFedyunin @apaszke
I have write one test case with different index sizes and combined with src is contiguous or not as below:
import time
import random
import torch

from common_utils import TestCase, run_tests
from hypothesis import given, settings
from hypothesis import strategies as st

class indexSelectTest(TestCase):
@settings(deadline=None)
@given(batch_size=st.sampled_from([10000000, 1000, 10000, 1000000]))
def test_index_select(self, batch_size):
print('\ntest_index_select:')
random.seed(100)
index = random.sample(range(batch_size), batch_size//10)
src = torch.randn(batch_size, 64, device='cpu')
idx = torch.tensor(index, dtype=torch.long, device='cpu')
start = time.time()
for i in range(100):
dst = torch.index_select(src, 0, idx)
end = time.time()
print('index size:{}, time: {:.6f} ms/per_index_select'.format(len(index), (end-start)*10))

@settings(deadline=None)
@given(batch_size=st.sampled_from([10000000, 1000, 10000, 1000000]))
def test_index_select_non_contiguous(self, batch_size):
print('\ntest_index_select_non_contiguous:')
random.seed(100)
index = random.sample(range(batch_size), batch_size//10)
src = torch.randn(batch_size, 64, 2, device='cpu')
non_contiguous_src = src.select(2, 0)
idx = torch.tensor(index, dtype=torch.long, device='cpu')
start = time.time()
for i in range(100):
dst = torch.index_select(non_contiguous_src, 0, idx)
end = time.time()
print('index size:{}, time: {:.6f} ms/per_index_select'.format(len(index), (end-start)*10))

if name=='main':
run_tests()

The result for the original indexSelect function is like below:
test_index_select:
index size:1000000, time: 65.922041 ms/per_index_select
index size:100000, time: 6.105187 ms/per_index_select
index size:1000, time: 0.055368 ms/per_index_select
index size:100, time: 0.005496 ms/per_index_select
test_index_select_non_contiguous:
index size:1000000, time: 2191.913469 ms/per_index_select
index size:100000, time: 213.450601 ms/per_index_select
index size:1000, time: 1.596482 ms/per_index_select
index size:100, time: 0.161395 ms/per_index_select

The result for this patch is like:
test_index_select:
index size:1000000, time: 66.463988 ms/per_index_select
index size:100000, time: 6.164207 ms/per_index_select
index size:1000, time: 0.056939 ms/per_index_select
index size:100, time: 0.005827 ms/per_index_select
test_index_select_non_contiguous:
index size:1000000, time: 750.100091 ms/per_index_select
index size:100000, time: 72.395074 ms/per_index_select
index size:1000, time: 1.787453 ms/per_index_select
index size:100, time: 0.172462 ms/per_index_select

The small size may have some impact while almost can be ignored and the bigger size with non-contiguous src (large than 100000) can get almost 3x improvement

May I also need to make test case to be checked in ?
Thanks

fmassa · 2019-08-15T06:15:29Z

Can you also add a test case to your benchmark where the number of indexed elements is different from the dimension being indexed? Similar tohttps://github.com//pull/13420
For example, with 1024×1024 -> 128×1024 ?

JianpingChen066 · 2019-08-16T06:17:05Z

@fmassa
Hi Francisco
I modified codes a little to make it has less impact when dim is not 0
and add some test cases into the test_index_select.py
The result is like below:
test_index_select:
| Old | New |
index size:1000000, | 40.03756 | 40.477538 |
index size:100000, | 5.590582 | 5.971432 |
index size:1000, | 0.078225 | 0.077724 |
index size:100, | 0.007057 | 0.006747 |
test_index_select_non_contiguous:
| Old | New |
index size:1000000, | 2231.867933 | 427.461815 |
index size:100000, | 240.299988 | 213.389063 |
index size:1000, | 1.578188 | 1.731372 |
index size:100, | 0.167203 | 0.173616 |
test_index_select_non_contiguous_dim0 :
| Old | New |
index size:128, | 0.341105 | 0.34554 |
index size:131072, | 891.35623 | 771.395373 |
index size:16384, | 111.90021 | 111.182475 |
index size:512, | 1.488233 | 1.529646 |
test_index_select_non_contiguous_dim1:
| Old | New |
index size:128, | 0.943565 | 0.925708 |
index size:131072, | 6306.76558 | 6298.30513 |
index size:16384, | 447.111034 | 445.347142 |
index size:512, | 7.793069 | 7.451367 |

I put my test case and test scripts at https://gist.github.com/JianpingChen066/189380ee159313b644ab1be6d601b57b

VitalyFedyunin

Overall good, but you can get MUCH better performance if you move this operator to aten (see examples in #24507

VitalyFedyunin · 2019-08-29T20:08:52Z

aten/src/TH/generic/THTensorApply.hpp

 #define ORDIN_TH_OMP_OVERHEAD_THRESHOLD 20000
 #define UNCERTAIN_TH_OMP_OVERHEAD_THRESHOLD 50000
 #define TH_OMP_OVERHEAD_THRESHOLD 100000
+#define TH_PARALLEL_OVERHEAD_THRESHOLD 80000


Please move it into the THTensorEvenMoreMath.cpp as constexpr, we are trying to get rid of THTensorApply.
Also something like parallel_overhead_threshold_index_select would be better as name ( + comment how we selected this 80000 value)

ifedan · 2019-10-16T20:35:32Z

Overall good, but you can get MUCH better performance if you move this operator to aten (see examples in #24507

I agree with @VitalyFedyunin , I think it's a good candidate to be moved from TH to ATen and to start using TensorIterator

JianpingChen066 · 2019-10-17T01:43:29Z

@ifedan @VitalyFedyunin
We are working on this to move indexSelect from TH to ATen, Thanks

JianpingChen066 · 2019-12-03T06:17:59Z

@ifedan @VitalyFedyunin
Mingfei is improving the index_select performance and move it to aten/src/ATen/native/Indexing.cpp, please refer his PR#30598. I will close this PR now.

Thanks

pytorchbot added module: cpu CPU specific problem (e.g., perf, algorithm) module: operators labels Jul 19, 2019

ezyang added the open source label Jul 19, 2019

cpuhrsch added module: performance Issues related to performance, either of kernel code or framework glue enhancement Not as big of a feature, but technically not a bug. Should be easy to fix high priority labels Jul 19, 2019

cpuhrsch added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jul 19, 2019

Jianhui-Li mentioned this pull request Jul 22, 2019

Enhance add_out_dense_sparse_cpu for hybrid sparse tensor #23057

Closed

JianpingChen066 mentioned this pull request Jul 23, 2019

Enhance tensor indexSelect and add_out_dense_sparse_cpu performance #23006

Closed

VitalyFedyunin self-requested a review July 23, 2019 16:48

VitalyFedyunin reviewed Jul 23, 2019

View reviewed changes

JianpingChen066 force-pushed the indexSelect branch from bacc00d to 060c90e Compare July 24, 2019 01:38

facebook-github-bot reviewed Aug 1, 2019

View reviewed changes

VitalyFedyunin previously approved these changes Aug 1, 2019

View reviewed changes

apaszke reviewed Aug 2, 2019

View reviewed changes

JianpingChen066 force-pushed the indexSelect branch from 060c90e to 74d1fdb Compare August 7, 2019 03:21

facebook-github-bot closed this in 0002448 Aug 9, 2019

facebook-github-bot added the merged label Aug 9, 2019

VitalyFedyunin reopened this Aug 9, 2019

zdevito closed this in zdevito/ATen@a6586ab Aug 9, 2019

zou3519 reopened this Aug 9, 2019

yf225 pushed a commit to yf225/pytorch that referenced this pull request Aug 11, 2019

Revert "Enhance Tensor indexSelect performance (pytorch#23055)"

c84839d

This reverts commit 0002448.

JianpingChen066 force-pushed the indexSelect branch from e59bedd to 42b934e Compare August 12, 2019 08:13

JianpingChen066 mentioned this pull request Aug 15, 2019

Enhance index_select_add<float> fast path with parallel version #24385

Closed

JianpingChen066 force-pushed the indexSelect branch 2 times, most recently from a6ed8c7 to c644da6 Compare August 16, 2019 05:53

JianpingChen066 force-pushed the indexSelect branch from c644da6 to da6dc65 Compare August 19, 2019 02:57

Enhance Tensor indexSelect performance

da6dc65

VitalyFedyunin suggested changes Aug 29, 2019

View reviewed changes

Merge branch 'master' into indexSelect

03184e4

JianpingChen066 closed this Dec 3, 2019

mruberry added the Merged label Oct 28, 2020

Enhance Tensor indexSelect performance #23055

Enhance Tensor indexSelect performance #23055

Uh oh!

Conversation

JianpingChen066 commented Jul 19, 2019

Uh oh!

JianpingChen066 commented Jul 19, 2019

Uh oh!

fmassa commented Jul 19, 2019

Uh oh!

cpuhrsch commented Jul 19, 2019

Uh oh!

JianpingChen066 commented Jul 22, 2019

Uh oh!

Jianhui-Li commented Jul 22, 2019

Uh oh!

dmudiger commented Jul 23, 2019 • edited by bddppq Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bddppq commented Jul 23, 2019

Uh oh!

VitalyFedyunin Jul 23, 2019

Choose a reason for hiding this comment

Uh oh!

JianpingChen066 Jul 24, 2019

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

apaszke left a comment

Choose a reason for hiding this comment

Uh oh!

vincentqb commented Aug 2, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JianpingChen066 commented Aug 6, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JianpingChen066 commented Aug 6, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JianpingChen066 commented Aug 7, 2019

Uh oh!

apaszke commented Aug 8, 2019

Uh oh!

JianpingChen066 commented Aug 9, 2019

Uh oh!

ezyang commented Aug 9, 2019

Uh oh!

ezyang commented Aug 9, 2019

Uh oh!

facebook-github-bot commented Aug 9, 2019

Uh oh!

VitalyFedyunin commented Aug 9, 2019

Uh oh!

JianpingChen066 commented Aug 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JianpingChen066 commented Aug 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fmassa commented Aug 15, 2019

Uh oh!

JianpingChen066 commented Aug 16, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

VitalyFedyunin left a comment

Choose a reason for hiding this comment

Uh oh!

VitalyFedyunin Aug 29, 2019

Choose a reason for hiding this comment

Uh oh!

ifedan commented Oct 16, 2019

Uh oh!

JianpingChen066 commented Oct 17, 2019

Uh oh!

JianpingChen066 commented Dec 3, 2019

Uh oh!

Reviewers

dmudiger commented Jul 23, 2019 •

edited by bddppq

Loading

vincentqb commented Aug 2, 2019 •

edited

Loading

JianpingChen066 commented Aug 6, 2019 •

edited

Loading

JianpingChen066 commented Aug 6, 2019 •

edited

Loading

JianpingChen066 commented Aug 12, 2019 •

edited

Loading

JianpingChen066 commented Aug 12, 2019 •

edited

Loading

JianpingChen066 commented Aug 16, 2019 •

edited

Loading