Remove sync in embedding #70943

zasdfgbnm · 2022-01-06T19:23:31Z

This together with #66580 and #68376 will remove all syncs in embedding.

This PR includes #68376, please review after merging #68376

This PR introduces perf regressions and increases memory usage:

exclusive_sum is now computing the entire numel elements instead of num_of_segments elements, and the trailing numel - num_of_segments results will be discarded.
Some memory allocation now needs numel spaces instead of num_of_segments or num_of_partial_segments.

These are the prices we must pay in order to get a sync-free implementation.

I haven't done any benchmark yet. I will do it later.

pytorch-probot · 2022-01-06T19:23:34Z

CI Flow Status

⚛️ CI Flow

Ruleset - Version: v1
Ruleset - File: https://github.com/pytorch/pytorch/blob/01b9dc565e8cdd6fd13ad2ab8a80de76e192b880/.github/generated-ciflow-ruleset.json
PR ciflow labels: ciflow/default

Workflows	Labels (bold enabled)	Status
Triggered Workflows
linux-bionic-py3.7-clang9	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/noarch`, `ciflow/trunk`	✅ triggered
linux-docs	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/docs`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
linux-vulkan-bionic-py3.7-clang9	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`, `ciflow/vulkan`	✅ triggered
linux-xenial-cuda11.3-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
linux-xenial-cuda11.3-py3.7-gcc7-bazel-test	`ciflow/all`, `ciflow/bazel`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
linux-xenial-py3-clang5-mobile-build	`ciflow/all`, `ciflow/default`, `ciflow/linux`, `ciflow/mobile`, `ciflow/trunk`	✅ triggered
linux-xenial-py3-clang5-mobile-custom-build-static	`ciflow/all`, `ciflow/default`, `ciflow/linux`, `ciflow/mobile`, `ciflow/trunk`	✅ triggered
linux-xenial-py3.7-clang7-asan	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/sanitizers`, `ciflow/trunk`	✅ triggered
linux-xenial-py3.7-clang7-onnx	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/onnx`, `ciflow/trunk`	✅ triggered
linux-xenial-py3.7-gcc5.4	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
linux-xenial-py3.7-gcc7	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
linux-xenial-py3.7-gcc7-no-ops	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single	`ciflow/all`, `ciflow/android`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit	`ciflow/all`, `ciflow/android`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
win-vs2019-cpu-py3	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/trunk`, `ciflow/win`	✅ triggered
win-vs2019-cuda11.3-py3	`ciflow/all`, `ciflow/cuda`, `ciflow/default`, `ciflow/trunk`, `ciflow/win`	✅ triggered
Skipped Workflows
caffe2-linux-xenial-py3.7-gcc5.4	`ciflow/all`, `ciflow/cpu`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
docker-builds	`ciflow/all`, `ciflow/trunk`	🚫 skipped
ios-12-5-1-arm64	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
ios-12-5-1-arm64-coreml	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
ios-12-5-1-arm64-custom-ops	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
ios-12-5-1-arm64-full-jit	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
ios-12-5-1-arm64-metal	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
ios-12-5-1-x86-64	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
ios-12-5-1-x86-64-coreml	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
ios-12-5-1-x86-64-full-jit	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
libtorch-linux-xenial-cuda10.2-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
libtorch-linux-xenial-cuda11.3-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
linux-bionic-cuda10.2-py3.9-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/slow`, `ciflow/trunk`	🚫 skipped
linux-docs-push	`ciflow/all`, `ciflow/cpu`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
linux-xenial-cuda11.3-py3.7-gcc7-no-ops	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
macos-10-15-py3-arm64	`ciflow/all`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
macos-10-15-py3-lite-interpreter-x86-64	`ciflow/all`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
macos-11-py3-x86-64	`ciflow/all`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
parallelnative-linux-xenial-py3.7-gcc5.4	`ciflow/all`, `ciflow/cpu`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
periodic-libtorch-linux-bionic-cuda11.5-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
periodic-libtorch-linux-xenial-cuda11.1-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
periodic-linux-bionic-cuda11.5-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/scheduled`, `ciflow/slow`, `ciflow/slow-gradcheck`	🚫 skipped
periodic-linux-xenial-cuda11.1-py3.7-gcc7-debug	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
periodic-win-vs2019-cuda11.1-py3	`ciflow/all`, `ciflow/cuda`, `ciflow/scheduled`, `ciflow/win`	🚫 skipped
periodic-win-vs2019-cuda11.5-py3	`ciflow/all`, `ciflow/cuda`, `ciflow/scheduled`, `ciflow/win`	🚫 skipped
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-build	`ciflow/all`, `ciflow/android`, `ciflow/cpu`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped

You can add a comment to the PR and tag @pytorchbot with the following commands:

# ciflow rerun, "ciflow/default" will always be added automatically
@pytorchbot ciflow rerun

# ciflow rerun with additional labels "-l <ciflow/label_name>", which is equivalent to adding these labels manually and trigger the rerun
@pytorchbot ciflow rerun -l ciflow/scheduled -l ciflow/slow

For more information, please take a look at the CI Flow Wiki.

facebook-github-bot · 2022-01-06T19:23:37Z

🔗 Helpful links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/70943
📄 Preview docs built from this PR
📄 Preview C++ docs built from this PR
🔧 Opt-in to CIFlow to control what jobs run on your PRs

💊 CI failures summary and remediations

As of commit 8a79275 (more details on the Dr. CI page):

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

…e-by-key

zasdfgbnm · 2022-03-08T01:36:53Z

Benchmarked at https://github.com/zasdfgbnm/things/blob/master/2022/embedding-benchmark.ipynb, this PR is improving the performance compared to master:

# master (fcb7e528d986fb4df148a43959cd25fe3cfa5842)
36.5 ms ± 38.3 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
19 ms ± 225 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
18 ms ± 538 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# this PR (4214bd808e136baec1e6aad7c16f975772370aca)
33.9 ms ± 19.4 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
13.4 ms ± 1.95 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
11.1 ms ± 31.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

code:

import torch

embedding1 = torch.nn.Embedding(28996, 768, padding_idx=0, device='cuda')
embedding2 = torch.nn.Embedding(512, 768, device='cuda')
embedding3 = torch.nn.Embedding(2, 768, device='cuda')

input1 = torch.randint(embedding1.num_embeddings, (32, 128), dtype=torch.long, device='cuda')
input2 = torch.randint(embedding2.num_embeddings, (32, 128), dtype=torch.long, device='cuda')
input3 = torch.randint(embedding3.num_embeddings, (32, 128), dtype=torch.long, device='cuda')

for _ in range(100):
    torch.arange(1000000, device='cuda')
    
def run50sync(f):
    for _ in range(50):
        f()
    torch.cuda.synchronize()

def benchmark():
    torch.cuda.synchronize()
    %timeit run50sync(lambda: embedding1(input1).sum().backward())
    torch.cuda.synchronize()
    %timeit run50sync(lambda: embedding2(input2).sum().backward())
    torch.cuda.synchronize()
    %timeit run50sync(lambda: embedding3(input3).sum().backward())

shapes are extracted from huggingface bert example https://github.com/huggingface/transformers/tree/master/examples/pytorch/text-classification

python run_glue.py \
  --model_name_or_path bert-base-cased \
  --dataset_name imdb  \
  --do_train \
  --do_predict \
  --max_seq_length 128 \
  --per_device_train_batch_size 32 \
  --learning_rate 2e-5 \
  --num_train_epochs 3 \
  --output_dir /tmp/imdb/

zasdfgbnm · 2022-03-08T01:37:17Z

Marking as ready for review

zasdfgbnm · 2022-03-08T22:49:26Z

Rebased and should be ready for review.

ngimel · 2022-03-09T19:07:06Z

So there's no perf regression, but still memory regression?

zasdfgbnm · 2022-03-09T20:08:43Z

@ngimel You are right. There is memory regression. For speed, I am comparing cub::DeviceSelect::UniqueByKey + this PR vs thrust::unique_by_key without this PR and the former is faster.

facebook-github-bot · 2022-03-14T23:46:37Z

@ngimel has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

ngimel · 2022-03-16T16:50:02Z

@pytorchbot merge this please

Summary: This together with #66580 and #68376 will remove all syncs in embedding. This PR includes #68376, please review after merging #68376 This PR introduces perf regressions and increases memory usage: - `exclusive_sum` is now computing the entire `numel` elements instead of `num_of_segments` elements, and the trailing `numel - num_of_segments` results will be discarded. - Some memory allocation now needs `numel` spaces instead of `num_of_segments` or `num_of_partial_segments`. These are the prices we must pay in order to get a sync-free implementation. I haven't done any benchmark yet. I will do it later. Pull Request resolved: #70943 Reviewed By: H-Huang Differential Revision: D34881660 Pulled By: ngimel fbshipit-source-id: b0760fa33608c46cd4145ceb09878bf94a9f959d

github-actions · 2022-03-16T17:30:38Z

Hey @zasdfgbnm.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

`numel` is a too loose upper bound for `num_of_segments` and `num_of_partial_segments`. It causes perf regressions. This PR moves to a tighter upper bound. Benchmark with jupyter notebook: ```python import torch num_embeddings = 1024 embedding_dim = 512 e = torch.nn.Embedding(num_embeddings, embedding_dim).cuda() size = 1*1024*1024 i = torch.arange(size, device='cuda') % num_embeddings o = e(i) g = torch.randn_like(o) torch.cuda.synchronize() ``` ```python %%timeit o.backward(g, retain_graph=True) torch.cuda.synchronize() ``` Before #70943: 3.6 ms After #70943: 6.9 ms With this PR: 3.55 ms Pull Request resolved: #78588 Approved by: https://github.com/ngimel

Summary: `numel` is a too loose upper bound for `num_of_segments` and `num_of_partial_segments`. It causes perf regressions. This PR moves to a tighter upper bound. Benchmark with jupyter notebook: ```python import torch num_embeddings = 1024 embedding_dim = 512 e = torch.nn.Embedding(num_embeddings, embedding_dim).cuda() size = 1*1024*1024 i = torch.arange(size, device='cuda') % num_embeddings o = e(i) g = torch.randn_like(o) torch.cuda.synchronize() ``` ```python %%timeit o.backward(g, retain_graph=True) torch.cuda.synchronize() ``` Before #70943: 3.6 ms After #70943: 6.9 ms With this PR: 3.55 ms Pull Request resolved: #78588 Approved by: https://github.com/ngimel Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/5bcbad76148a2808ae8331b8cfbd42885269a8d9 Reviewed By: seemethere Differential Revision: D36815606 Pulled By: seemethere fbshipit-source-id: 0d4c349c7bd8aa8e2c9209d6ec15415867f722a0

zasdfgbnm added 6 commits November 15, 2021 11:41

Use cub::DeviceSelect::UniqueByKey for EmbeddingBackward

a113fba

thrust::discard_iterator

8b21b1a

Merge branch 'master' into unique-by-key

f6f05d0

Merge remote-tracking branch 'origin/unique-by-key' into rm-more-sync

d89fa8b

Merge branch 'master' of github.com:pytorch/pytorch into unique-by-key

5b6a825

Remove sync in embedding

57777fb

pytorch-probot bot added the ciflow/default label Jan 6, 2022

facebook-github-bot added the cla signed label Jan 6, 2022

zasdfgbnm marked this pull request as draft January 6, 2022 19:24

zasdfgbnm requested a review from ngimel January 6, 2022 19:26

pytorchbot added the open source label Jan 6, 2022

zasdfgbnm added 9 commits January 6, 2022 11:59

fix legacy thrust path

01b9dc5

Merge branch 'master' of github.com:pytorch/pytorch into unique-by-key

9d8a856

Merge branch 'unique-by-key' of github.com:pytorch/pytorch into uniqu…

a57ff86

…e-by-key

save

8a38207

fix msvc

8b4bf92

Merge branch 'master' of github.com:pytorch/pytorch into unique-by-key

d689c04

Merge branch 'master' of github.com:pytorch/pytorch into unique-by-key

92cacff

Merge branch 'master' of github.com:pytorch/pytorch into unique-by-key

14830df

Merge branch 'unique-by-key' into rm-more-sync

4214bd8

zasdfgbnm marked this pull request as ready for review March 8, 2022 01:37

Merge branch 'master' of github.com:pytorch/pytorch into rm-more-sync

93247ea

bdhirsh added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Mar 8, 2022

Merge branch 'master' of github.com:pytorch/pytorch into rm-more-sync

8a79275

ngimel approved these changes Mar 14, 2022

View reviewed changes

pytorchmergebot closed this in 11dc158 Mar 16, 2022

zasdfgbnm deleted the rm-more-sync branch March 16, 2022 17:48

zasdfgbnm added the topic: performance topic category label Mar 16, 2022

zasdfgbnm added a commit that referenced this pull request May 31, 2022

Fix perf regression introduced in #70943

12c1a52

zasdfgbnm mentioned this pull request May 31, 2022

Fix perf regression introduced in #70943 #78588

Closed

WBobby mentioned this pull request Aug 17, 2022

Add ROCm5.2.3/AMDGPU support for PyTorch WBobby/pytorch#2

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Remove sync in embedding #70943

Remove sync in embedding #70943

Uh oh!

zasdfgbnm commented Jan 6, 2022 •

edited

Loading

Uh oh!

pytorch-probot bot commented Jan 6, 2022 •

edited

Loading

⚛️ CI Flow

Uh oh!

facebook-github-bot commented Jan 6, 2022 •

edited

Loading

Uh oh!

zasdfgbnm commented Mar 8, 2022

Uh oh!

zasdfgbnm commented Mar 8, 2022

Uh oh!

zasdfgbnm commented Mar 8, 2022

Uh oh!

ngimel commented Mar 9, 2022

Uh oh!

zasdfgbnm commented Mar 9, 2022

Uh oh!

facebook-github-bot commented Mar 14, 2022

Uh oh!

ngimel commented Mar 16, 2022

Uh oh!

github-actions bot commented Mar 16, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Remove sync in embedding #70943

Remove sync in embedding #70943

Uh oh!

Conversation

zasdfgbnm commented Jan 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-probot bot commented Jan 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚛️ CI Flow

Uh oh!

facebook-github-bot commented Jan 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful links

💊 CI failures summary and remediations

Uh oh!

zasdfgbnm commented Mar 8, 2022

Uh oh!

zasdfgbnm commented Mar 8, 2022

Uh oh!

zasdfgbnm commented Mar 8, 2022

Uh oh!

ngimel commented Mar 9, 2022

Uh oh!

zasdfgbnm commented Mar 9, 2022

Uh oh!

facebook-github-bot commented Mar 14, 2022

Uh oh!

ngimel commented Mar 16, 2022

Uh oh!

github-actions bot commented Mar 16, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

zasdfgbnm commented Jan 6, 2022 •

edited

Loading

pytorch-probot bot commented Jan 6, 2022 •

edited

Loading

facebook-github-bot commented Jan 6, 2022 •

edited

Loading