-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Enhance Tensor indexSelect performance #23055
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@VitalyFedyunin @fmassa @jgong5 please have a review on this . Thanks |
|
Can you share some representative sizes of the tensors that are used in this model? Also, can you share some benchmark numbers before / after? |
|
@zou3519 - you might care about this as well since it's related to EmbeddingBag conceptually. |
|
@fmassa And when run this benchmark, it's better to "export KMP_BLOCKTIME=1", then without this patch , the CPU path benchmark result is as below by using the latest pytorch version (2ac9abf) : Apply this patch with same pytorch version, the result is as: Thanks |
|
cc: @bddppq (Junjie Bai), in reference to our earlier discussion |
|
@pytorchbot rebase this please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can get better performance by moving all src_is_contiguous related code into this if branch.
Also feel free to simplify the code by:
src = THTensor_(newContiguous)(src);
// do something with src
c10::raw::intrusive_ptr::decref(src);as THTensor_(newContiguous)(src) doesn't copy already contiguous tensors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, code has been refined
Thanks
bacc00d to
060c90e
Compare
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
apaszke
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure this is always beneficial? While this might make it faster in some cases, for some (e.g. small/medium sized arrays that fit in cache) it's only likely to be slower...
Due to this comment, I'm holding on landing this PR from the "needs land" queue. |
|
@apaszke |
|
Hi, Adam auto src_size0 = THTensor_sizeLegacyNoScalars(src, 0); so this can ensure the parallel code must be executed and can get better performance than sequential code ? Just with this kind of modification , DLRM benchmark can also get the performance as: Regards, |
060c90e to
74d1fdb
Compare
|
@VitalyFedyunin @fmassa @jgong5 please have a further review on this . Thanks |
|
@JianpingChen066 It's hard in general to pick the right threshold for every occasion, because it depends on the interactions between multiple non-trivial CPU features. Can you please try to run this operation on tensors of varying sizes (with indices both coalesced/sorted and completely scattered over the range) and report the timings before and after the patch? |
|
@apaszke |
|
@pytorchbot rebase this please |
|
@VitalyFedyunin @ilia-cher Can we add some sort of lint for this rule? It's very easy to forget... |
|
PR reverted from the master. You can rebase and keep working on it. |
This reverts commit 0002448.
|
@VitalyFedyunin Yes, while TH_OMP_OVERHEAD_THRESHOLD is in the previous code and defined at THTensorApply.hpp. It is used by the indexSelect in the at::parallel_for as parallel threshold. May I change it to TH_PARALLEL_OVERHEAD_THRESHOLD ? Thanks |
|
@VitalyFedyunin @apaszke from common_utils import TestCase, run_tests class indexSelectTest(TestCase): @settings(deadline=None) if name=='main': The result for the original indexSelect function is like below: The result for this patch is like: The small size may have some impact while almost can be ignored and the bigger size with non-contiguous src (large than 100000) can get almost 3x improvement May I also need to make test case to be checked in ? |
e59bedd to
42b934e
Compare
|
Can you also add a test case to your benchmark where the number of indexed elements is different from the dimension being indexed? Similar tohttps://github.com//pull/13420 |
a6ed8c7 to
c644da6
Compare
|
@fmassa I put my test case and test scripts at https://gist.github.com/JianpingChen066/189380ee159313b644ab1be6d601b57b |
c644da6 to
da6dc65
Compare
VitalyFedyunin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall good, but you can get MUCH better performance if you move this operator to aten (see examples in #24507
| #define ORDIN_TH_OMP_OVERHEAD_THRESHOLD 20000 | ||
| #define UNCERTAIN_TH_OMP_OVERHEAD_THRESHOLD 50000 | ||
| #define TH_OMP_OVERHEAD_THRESHOLD 100000 | ||
| #define TH_PARALLEL_OVERHEAD_THRESHOLD 80000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move it into the THTensorEvenMoreMath.cpp as constexpr, we are trying to get rid of THTensorApply.
Also something like parallel_overhead_threshold_index_select would be better as name ( + comment how we selected this 80000 value)
I agree with @VitalyFedyunin , I think it's a good candidate to be moved from TH to ATen and to start using TensorIterator |
|
@ifedan @VitalyFedyunin |
|
@ifedan @VitalyFedyunin Thanks |
This is try to reduce the overhead on the index_select on CPU path at DLRM (https://github.com/facebookresearch/dlrm). To make src as contiguous can make it go into the parallelied path in Tensor indexSelect function