Skip to content

Conversation

@JianpingChen066
Copy link
Contributor

This is try to reduce the overhead on the index_select on CPU path at DLRM (https://github.com/facebookresearch/dlrm). To make src as contiguous can make it go into the parallelied path in Tensor indexSelect function

@pytorchbot pytorchbot added module: cpu CPU specific problem (e.g., perf, algorithm) module: operators labels Jul 19, 2019
@JianpingChen066
Copy link
Contributor Author

@VitalyFedyunin @fmassa @jgong5 please have a review on this . Thanks

@fmassa
Copy link
Member

fmassa commented Jul 19, 2019

Can you share some representative sizes of the tensors that are used in this model?

Also, can you share some benchmark numbers before / after?

@cpuhrsch cpuhrsch added module: performance Issues related to performance, either of kernel code or framework glue enhancement Not as big of a feature, but technically not a bug. Should be easy to fix high priority labels Jul 19, 2019
@cpuhrsch
Copy link
Contributor

@zou3519 - you might care about this as well since it's related to EmbeddingBag conceptually.

@cpuhrsch cpuhrsch added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jul 19, 2019
@JianpingChen066
Copy link
Contributor Author

@fmassa
For DLRM benchmark, run the benchmark test as below:
numactl --physcpubind=0-27 -m 0 python dlrm_s_pytorch.py --mini-batch-size=2048 --num-batches=1000 --data-generation=random --arch-mlp-bot=512-512-64 --arch-mlp-top=1024-1024-1024-1 --arch-sparse-feature-size=64 --arch-embedding-size=1000000-1000000-1000000-1000000-1000000-1000000-1000000-1000000 --num-indices-per-lookup=100 --arch-interaction-op=dot --numpy-rand-seed=727 --print-freq=100 --print-time --enable-profiling

And when run this benchmark, it's better to "export KMP_BLOCKTIME=1", then

without this patch , the CPU path benchmark result is as below by using the latest pytorch version (2ac9abf) :
Finished training it 100/1000 of epoch 0, 2969.25 ms/it, loss 0.220505, accuracy 0.000 %
Finished training it 200/1000 of epoch 0, 2997.01 ms/it, loss 0.085002, accuracy 0.000 %
Finished training it 300/1000 of epoch 0, 2971.44 ms/it, loss 0.084647, accuracy 0.000 %
Finished training it 400/1000 of epoch 0, 2976.13 ms/it, loss 0.084571, accuracy 0.000 %
Finished training it 500/1000 of epoch 0, 2972.50 ms/it, loss 0.084110, accuracy 0.000 %
Finished training it 600/1000 of epoch 0, 2977.38 ms/it, loss 0.084307, accuracy 0.000 %
Finished training it 700/1000 of epoch 0, 2976.58 ms/it, loss 0.084587, accuracy 0.000 %
Finished training it 800/1000 of epoch 0, 2980.39 ms/it, loss 0.084244, accuracy 0.000 %
Finished training it 900/1000 of epoch 0, 3008.62 ms/it, loss 0.084789, accuracy 0.000 %
Finished training it 1000/1000 of epoch 0, 2974.28 ms/it, loss 0.084187, accuracy 0.000 %

Apply this patch with same pytorch version, the result is as:
Finished training it 100/1000 of epoch 0, 1596.95 ms/it, loss 0.220505, accuracy 0.000 %
Finished training it 200/1000 of epoch 0, 1598.42 ms/it, loss 0.085002, accuracy 0.000 %
Finished training it 300/1000 of epoch 0, 1619.37 ms/it, loss 0.084647, accuracy 0.000 %
Finished training it 400/1000 of epoch 0, 1597.37 ms/it, loss 0.084571, accuracy 0.000 %
Finished training it 500/1000 of epoch 0, 1595.12 ms/it, loss 0.084110, accuracy 0.000 %
Finished training it 600/1000 of epoch 0, 1596.97 ms/it, loss 0.084307, accuracy 0.000 %
Finished training it 700/1000 of epoch 0, 1595.59 ms/it, loss 0.084587, accuracy 0.000 %
Finished training it 800/1000 of epoch 0, 1597.95 ms/it, loss 0.084244, accuracy 0.000 %
Finished training it 900/1000 of epoch 0, 1598.83 ms/it, loss 0.084789, accuracy 0.000 %
Finished training it 1000/1000 of epoch 0, 1601.12 ms/it, loss 0.084187, accuracy 0.000 %

Thanks

@Jianhui-Li
Copy link

@jspark1105 @dmudiger

@dmudiger
Copy link

dmudiger commented Jul 23, 2019

cc: @bddppq (Junjie Bai), in reference to our earlier discussion

@bddppq
Copy link
Contributor

bddppq commented Jul 23, 2019

@pytorchbot rebase this please

@VitalyFedyunin VitalyFedyunin self-requested a review July 23, 2019 16:48
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can get better performance by moving all src_is_contiguous related code into this if branch.

Also feel free to simplify the code by:

src = THTensor_(newContiguous)(src);
// do something with src
c10::raw::intrusive_ptr::decref(src);

as THTensor_(newContiguous)(src) doesn't copy already contiguous tensors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, code has been refined
Thanks

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

VitalyFedyunin
VitalyFedyunin previously approved these changes Aug 1, 2019
Copy link
Contributor

@apaszke apaszke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure this is always beneficial? While this might make it faster in some cases, for some (e.g. small/medium sized arrays that fit in cache) it's only likely to be slower...

@vincentqb
Copy link
Contributor

vincentqb commented Aug 2, 2019

Are we sure this is always beneficial? While this might make it faster in some cases, for some (e.g. small/medium sized arrays that fit in cache) it's only likely to be slower...

Due to this comment, I'm holding on landing this PR from the "needs land" queue.

@JianpingChen066
Copy link
Contributor Author

JianpingChen066 commented Aug 6, 2019

@apaszke
Hi, Adam
are there any existing test cases for small/medium sized arrays performance test thus I can have a check on this ? Or is there some heuristic limitation similar as TH_OMP_OVERHEAD_THRESHOLD thus I can use it to set a threshold on this ?
Thanks

@JianpingChen066
Copy link
Contributor Author

JianpingChen066 commented Aug 6, 2019

Hi, Adam
What do you think I also use TH_OMP_OVERHEAD_THRESHOLD as a threshold to do newContiguous for src like below:

auto src_size0 = THTensor_sizeLegacyNoScalars(src, 0);
ptrdiff_t rowsize = src_size0 == 0 ? 1 : THTensor_(nElement)(src) / src_size0;
auto omp_threshold = TH_OMP_OVERHEAD_THRESHOLD;
if (src->dim() > 1) {
omp_threshold = TH_OMP_OVERHEAD_THRESHOLD / rowsize;
}
if(!THTensor_(isContiguous)(src) && (numel > omp_threshold)) {
src = THTensor_(newContiguous)(src);
}
...
if(!THTensor_(isContiguous)(src) && (numel > omp_threshold)) {
c10::raw::intrusive_ptr::decref(src);
}

so this can ensure the parallel code must be executed and can get better performance than sequential code ?

Just with this kind of modification , DLRM benchmark can also get the performance as:
Finished training it 1000/1000 of epoch 0, 1542.06 ms/it, loss 0.098095, accuracy 0.000 %
about 2X improvement than the original one:
Finished training it 1000/1000 of epoch 0, 2974.28 ms/it, loss 0.084187, accuracy 0.000 %

Regards,
Jianping

@JianpingChen066
Copy link
Contributor Author

@VitalyFedyunin @fmassa @jgong5 please have a further review on this . Thanks

@apaszke
Copy link
Contributor

apaszke commented Aug 8, 2019

@JianpingChen066 It's hard in general to pick the right threshold for every occasion, because it depends on the interactions between multiple non-trivial CPU features. Can you please try to run this operation on tensors of varying sizes (with indices both coalesced/sorted and completely scattered over the range) and report the timings before and after the patch?

@JianpingChen066
Copy link
Contributor Author

@apaszke
Hi, Adam
Are there any existing test cases for indexSelect functions ?
We only want src as contiguous as possible thus parallel version can work better than sequential version, it should have no relation with whether indices is coalesced or not, sorted or not and scattered or not ?

@ezyang
Copy link
Contributor

ezyang commented Aug 9, 2019

@pytorchbot rebase this please

@VitalyFedyunin VitalyFedyunin dismissed their stale review August 9, 2019 15:13

No OMP references please

@ezyang
Copy link
Contributor

ezyang commented Aug 9, 2019

@VitalyFedyunin @ilia-cher Can we add some sort of lint for this rule? It's very easy to forget...

@facebook-github-bot
Copy link
Contributor

@ezyang merged this pull request in 0002448.

@VitalyFedyunin
Copy link
Contributor

PR reverted from the master. You can rebase and keep working on it.

@VitalyFedyunin VitalyFedyunin reopened this Aug 9, 2019
@zou3519 zou3519 reopened this Aug 9, 2019
yf225 pushed a commit to yf225/pytorch that referenced this pull request Aug 11, 2019
@JianpingChen066
Copy link
Contributor Author

JianpingChen066 commented Aug 12, 2019

@VitalyFedyunin Yes, while TH_OMP_OVERHEAD_THRESHOLD is in the previous code and defined at THTensorApply.hpp. It is used by the indexSelect in the at::parallel_for as parallel threshold. May I change it to TH_PARALLEL_OVERHEAD_THRESHOLD ? Thanks

@JianpingChen066
Copy link
Contributor Author

JianpingChen066 commented Aug 12, 2019

@VitalyFedyunin @apaszke
I have write one test case with different index sizes and combined with src is contiguous or not as below:
import time
import random
import torch

from common_utils import TestCase, run_tests
from hypothesis import given, settings
from hypothesis import strategies as st

class indexSelectTest(TestCase):
@settings(deadline=None)
@given(batch_size=st.sampled_from([10000000, 1000, 10000, 1000000]))
def test_index_select(self, batch_size):
print('\ntest_index_select:')
random.seed(100)
index = random.sample(range(batch_size), batch_size//10)
src = torch.randn(batch_size, 64, device='cpu')
idx = torch.tensor(index, dtype=torch.long, device='cpu')
start = time.time()
for i in range(100):
dst = torch.index_select(src, 0, idx)
end = time.time()
print('index size:{}, time: {:.6f} ms/per_index_select'.format(len(index), (end-start)*10))

@settings(deadline=None)
@given(batch_size=st.sampled_from([10000000, 1000, 10000, 1000000]))
def test_index_select_non_contiguous(self, batch_size):
print('\ntest_index_select_non_contiguous:')
random.seed(100)
index = random.sample(range(batch_size), batch_size//10)
src = torch.randn(batch_size, 64, 2, device='cpu')
non_contiguous_src = src.select(2, 0)
idx = torch.tensor(index, dtype=torch.long, device='cpu')
start = time.time()
for i in range(100):
dst = torch.index_select(non_contiguous_src, 0, idx)
end = time.time()
print('index size:{}, time: {:.6f} ms/per_index_select'.format(len(index), (end-start)*10))

if name=='main':
run_tests()

The result for the original indexSelect function is like below:
test_index_select:
index size:1000000, time: 65.922041 ms/per_index_select
index size:100000, time: 6.105187 ms/per_index_select
index size:1000, time: 0.055368 ms/per_index_select
index size:100, time: 0.005496 ms/per_index_select
test_index_select_non_contiguous:
index size:1000000, time: 2191.913469 ms/per_index_select
index size:100000, time: 213.450601 ms/per_index_select
index size:1000, time: 1.596482 ms/per_index_select
index size:100, time: 0.161395 ms/per_index_select

The result for this patch is like:
test_index_select:
index size:1000000, time: 66.463988 ms/per_index_select
index size:100000, time: 6.164207 ms/per_index_select
index size:1000, time: 0.056939 ms/per_index_select
index size:100, time: 0.005827 ms/per_index_select
test_index_select_non_contiguous:
index size:1000000, time: 750.100091 ms/per_index_select
index size:100000, time: 72.395074 ms/per_index_select
index size:1000, time: 1.787453 ms/per_index_select
index size:100, time: 0.172462 ms/per_index_select

The small size may have some impact while almost can be ignored and the bigger size with non-contiguous src (large than 100000) can get almost 3x improvement

May I also need to make test case to be checked in ?
Thanks

@fmassa
Copy link
Member

fmassa commented Aug 15, 2019

Can you also add a test case to your benchmark where the number of indexed elements is different from the dimension being indexed? Similar tohttps://github.com//pull/13420
For example, with 1024×1024 -> 128×1024 ?

@JianpingChen066 JianpingChen066 force-pushed the indexSelect branch 2 times, most recently from a6ed8c7 to c644da6 Compare August 16, 2019 05:53
@JianpingChen066
Copy link
Contributor Author

JianpingChen066 commented Aug 16, 2019

@fmassa
Hi Francisco
I modified codes a little to make it has less impact when dim is not 0
and add some test cases into the test_index_select.py
The result is like below:
test_index_select:
| Old | New |
index size:1000000, | 40.03756 | 40.477538 |
index size:100000, | 5.590582 | 5.971432 |
index size:1000, | 0.078225 | 0.077724 |
index size:100, | 0.007057 | 0.006747 |
test_index_select_non_contiguous:  
| Old | New |
index size:1000000, | 2231.867933 | 427.461815 |
index size:100000, | 240.299988 | 213.389063 |
index size:1000, | 1.578188 | 1.731372 |
index size:100, | 0.167203 | 0.173616 |
test_index_select_non_contiguous_dim0 :  
| Old | New |
index size:128, | 0.341105 | 0.34554 |
index size:131072, | 891.35623 | 771.395373 |
index size:16384, | 111.90021 | 111.182475 |
index size:512, | 1.488233 | 1.529646 |
test_index_select_non_contiguous_dim1:  
| Old | New |
index size:128, | 0.943565 | 0.925708 |
index size:131072, | 6306.76558 | 6298.30513 |
index size:16384, | 447.111034 | 445.347142 |
index size:512, | 7.793069 | 7.451367 |

I put my test case and test scripts at https://gist.github.com/JianpingChen066/189380ee159313b644ab1be6d601b57b

Copy link
Contributor

@VitalyFedyunin VitalyFedyunin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall good, but you can get MUCH better performance if you move this operator to aten (see examples in #24507

#define ORDIN_TH_OMP_OVERHEAD_THRESHOLD 20000
#define UNCERTAIN_TH_OMP_OVERHEAD_THRESHOLD 50000
#define TH_OMP_OVERHEAD_THRESHOLD 100000
#define TH_PARALLEL_OVERHEAD_THRESHOLD 80000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move it into the THTensorEvenMoreMath.cpp as constexpr, we are trying to get rid of THTensorApply.
Also something like parallel_overhead_threshold_index_select would be better as name ( + comment how we selected this 80000 value)

@ifedan
Copy link
Contributor

ifedan commented Oct 16, 2019

Overall good, but you can get MUCH better performance if you move this operator to aten (see examples in #24507

I agree with @VitalyFedyunin , I think it's a good candidate to be moved from TH to ATen and to start using TensorIterator

@JianpingChen066
Copy link
Contributor Author

@ifedan @VitalyFedyunin
We are working on this to move indexSelect from TH to ATen, Thanks

@JianpingChen066
Copy link
Contributor Author

@ifedan @VitalyFedyunin
Mingfei is improving the index_select performance and move it to aten/src/ATen/native/Indexing.cpp, please refer his PR#30598. I will close this PR now.

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement Not as big of a feature, but technically not a bug. Should be easy to fix high priority Merged module: cpu CPU specific problem (e.g., perf, algorithm) module: performance Issues related to performance, either of kernel code or framework glue open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.