-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Remove sync in embedding #70943
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove sync in embedding #70943
Conversation
CI Flow Status⚛️ CI FlowRuleset - Version:
You can add a comment to the PR and tag @pytorchbot with the following commands: # ciflow rerun, "ciflow/default" will always be added automatically
@pytorchbot ciflow rerun
# ciflow rerun with additional labels "-l <ciflow/label_name>", which is equivalent to adding these labels manually and trigger the rerun
@pytorchbot ciflow rerun -l ciflow/scheduled -l ciflow/slowFor more information, please take a look at the CI Flow Wiki. |
🔗 Helpful links
💊 CI failures summary and remediationsAs of commit 8a79275 (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Please report bugs/suggestions to the (internal) Dr. CI Users group. |
|
Benchmarked at https://github.com/zasdfgbnm/things/blob/master/2022/embedding-benchmark.ipynb, this PR is improving the performance compared to master: code: import torch
embedding1 = torch.nn.Embedding(28996, 768, padding_idx=0, device='cuda')
embedding2 = torch.nn.Embedding(512, 768, device='cuda')
embedding3 = torch.nn.Embedding(2, 768, device='cuda')
input1 = torch.randint(embedding1.num_embeddings, (32, 128), dtype=torch.long, device='cuda')
input2 = torch.randint(embedding2.num_embeddings, (32, 128), dtype=torch.long, device='cuda')
input3 = torch.randint(embedding3.num_embeddings, (32, 128), dtype=torch.long, device='cuda')
for _ in range(100):
torch.arange(1000000, device='cuda')
def run50sync(f):
for _ in range(50):
f()
torch.cuda.synchronize()
def benchmark():
torch.cuda.synchronize()
%timeit run50sync(lambda: embedding1(input1).sum().backward())
torch.cuda.synchronize()
%timeit run50sync(lambda: embedding2(input2).sum().backward())
torch.cuda.synchronize()
%timeit run50sync(lambda: embedding3(input3).sum().backward())shapes are extracted from huggingface bert example https://github.com/huggingface/transformers/tree/master/examples/pytorch/text-classification |
|
Marking as ready for review |
|
Rebased and should be ready for review. |
|
So there's no perf regression, but still memory regression? |
|
@ngimel You are right. There is memory regression. For speed, I am comparing |
|
@ngimel has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
|
@pytorchbot merge this please |
Summary: This together with #66580 and #68376 will remove all syncs in embedding. This PR includes #68376, please review after merging #68376 This PR introduces perf regressions and increases memory usage: - `exclusive_sum` is now computing the entire `numel` elements instead of `num_of_segments` elements, and the trailing `numel - num_of_segments` results will be discarded. - Some memory allocation now needs `numel` spaces instead of `num_of_segments` or `num_of_partial_segments`. These are the prices we must pay in order to get a sync-free implementation. I haven't done any benchmark yet. I will do it later. Pull Request resolved: #70943 Reviewed By: H-Huang Differential Revision: D34881660 Pulled By: ngimel fbshipit-source-id: b0760fa33608c46cd4145ceb09878bf94a9f959d
|
Hey @zasdfgbnm. |
`numel` is a too loose upper bound for `num_of_segments` and `num_of_partial_segments`. It causes perf regressions. This PR moves to a tighter upper bound. Benchmark with jupyter notebook: ```python import torch num_embeddings = 1024 embedding_dim = 512 e = torch.nn.Embedding(num_embeddings, embedding_dim).cuda() size = 1*1024*1024 i = torch.arange(size, device='cuda') % num_embeddings o = e(i) g = torch.randn_like(o) torch.cuda.synchronize() ``` ```python %%timeit o.backward(g, retain_graph=True) torch.cuda.synchronize() ``` Before #70943: 3.6 ms After #70943: 6.9 ms With this PR: 3.55 ms Pull Request resolved: #78588 Approved by: https://github.com/ngimel
Summary: `numel` is a too loose upper bound for `num_of_segments` and `num_of_partial_segments`. It causes perf regressions. This PR moves to a tighter upper bound. Benchmark with jupyter notebook: ```python import torch num_embeddings = 1024 embedding_dim = 512 e = torch.nn.Embedding(num_embeddings, embedding_dim).cuda() size = 1*1024*1024 i = torch.arange(size, device='cuda') % num_embeddings o = e(i) g = torch.randn_like(o) torch.cuda.synchronize() ``` ```python %%timeit o.backward(g, retain_graph=True) torch.cuda.synchronize() ``` Before #70943: 3.6 ms After #70943: 6.9 ms With this PR: 3.55 ms Pull Request resolved: #78588 Approved by: https://github.com/ngimel Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/5bcbad76148a2808ae8331b8cfbd42885269a8d9 Reviewed By: seemethere Differential Revision: D36815606 Pulled By: seemethere fbshipit-source-id: 0d4c349c7bd8aa8e2c9209d6ec15415867f722a0
This together with #66580 and #68376 will remove all syncs in embedding.
This PR includes #68376, please review after merging #68376
This PR introduces perf regressions and increases memory usage:
exclusive_sumis now computing the entirenumelelements instead ofnum_of_segmentselements, and the trailingnumel - num_of_segmentsresults will be discarded.numelspaces instead ofnum_of_segmentsornum_of_partial_segments.These are the prices we must pay in order to get a sync-free implementation.
I haven't done any benchmark yet. I will do it later.