Skip to content

Conversation

@zasdfgbnm
Copy link
Collaborator

@zasdfgbnm zasdfgbnm commented Jan 6, 2022

This together with #66580 and #68376 will remove all syncs in embedding.

This PR includes #68376, please review after merging #68376

This PR introduces perf regressions and increases memory usage:

  • exclusive_sum is now computing the entire numel elements instead of num_of_segments elements, and the trailing numel - num_of_segments results will be discarded.
  • Some memory allocation now needs numel spaces instead of num_of_segments or num_of_partial_segments.

These are the prices we must pay in order to get a sync-free implementation.

I haven't done any benchmark yet. I will do it later.

@pytorch-probot
Copy link

pytorch-probot bot commented Jan 6, 2022

CI Flow Status

⚛️ CI Flow

Ruleset - Version: v1
Ruleset - File: https://github.com/pytorch/pytorch/blob/01b9dc565e8cdd6fd13ad2ab8a80de76e192b880/.github/generated-ciflow-ruleset.json
PR ciflow labels: ciflow/default

Workflows Labels (bold enabled) Status
Triggered Workflows
linux-bionic-py3.7-clang9 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/noarch, ciflow/trunk ✅ triggered
linux-docs ciflow/all, ciflow/cpu, ciflow/default, ciflow/docs, ciflow/linux, ciflow/trunk ✅ triggered
linux-vulkan-bionic-py3.7-clang9 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk, ciflow/vulkan ✅ triggered
linux-xenial-cuda11.3-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-cuda11.3-py3.7-gcc7-bazel-test ciflow/all, ciflow/bazel, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-py3-clang5-mobile-build ciflow/all, ciflow/default, ciflow/linux, ciflow/mobile, ciflow/trunk ✅ triggered
linux-xenial-py3-clang5-mobile-custom-build-static ciflow/all, ciflow/default, ciflow/linux, ciflow/mobile, ciflow/trunk ✅ triggered
linux-xenial-py3.7-clang7-asan ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/sanitizers, ciflow/trunk ✅ triggered
linux-xenial-py3.7-clang7-onnx ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/onnx, ciflow/trunk ✅ triggered
linux-xenial-py3.7-gcc5.4 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-py3.7-gcc7 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-py3.7-gcc7-no-ops ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single ciflow/all, ciflow/android, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit ciflow/all, ciflow/android, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
win-vs2019-cpu-py3 ciflow/all, ciflow/cpu, ciflow/default, ciflow/trunk, ciflow/win ✅ triggered
win-vs2019-cuda11.3-py3 ciflow/all, ciflow/cuda, ciflow/default, ciflow/trunk, ciflow/win ✅ triggered
Skipped Workflows
caffe2-linux-xenial-py3.7-gcc5.4 ciflow/all, ciflow/cpu, ciflow/linux, ciflow/trunk 🚫 skipped
docker-builds ciflow/all, ciflow/trunk 🚫 skipped
ios-12-5-1-arm64 ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-arm64-coreml ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-arm64-custom-ops ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-arm64-full-jit ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-arm64-metal ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-x86-64 ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-x86-64-coreml ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-x86-64-full-jit ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
libtorch-linux-xenial-cuda10.2-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/trunk 🚫 skipped
libtorch-linux-xenial-cuda11.3-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/trunk 🚫 skipped
linux-bionic-cuda10.2-py3.9-gcc7 ciflow/all, ciflow/cuda, ciflow/linux, ciflow/slow, ciflow/trunk 🚫 skipped
linux-docs-push ciflow/all, ciflow/cpu, ciflow/linux, ciflow/scheduled 🚫 skipped
linux-xenial-cuda11.3-py3.7-gcc7-no-ops ciflow/all, ciflow/cuda, ciflow/linux, ciflow/trunk 🚫 skipped
macos-10-15-py3-arm64 ciflow/all, ciflow/macos, ciflow/trunk 🚫 skipped
macos-10-15-py3-lite-interpreter-x86-64 ciflow/all, ciflow/macos, ciflow/trunk 🚫 skipped
macos-11-py3-x86-64 ciflow/all, ciflow/macos, ciflow/trunk 🚫 skipped
parallelnative-linux-xenial-py3.7-gcc5.4 ciflow/all, ciflow/cpu, ciflow/linux, ciflow/trunk 🚫 skipped
periodic-libtorch-linux-bionic-cuda11.5-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-libtorch-linux-xenial-cuda11.1-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-linux-bionic-cuda11.5-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled, ciflow/slow, ciflow/slow-gradcheck 🚫 skipped
periodic-linux-xenial-cuda11.1-py3.7-gcc7-debug ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-win-vs2019-cuda11.1-py3 ciflow/all, ciflow/cuda, ciflow/scheduled, ciflow/win 🚫 skipped
periodic-win-vs2019-cuda11.5-py3 ciflow/all, ciflow/cuda, ciflow/scheduled, ciflow/win 🚫 skipped
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-build ciflow/all, ciflow/android, ciflow/cpu, ciflow/linux, ciflow/trunk 🚫 skipped

You can add a comment to the PR and tag @pytorchbot with the following commands:
# ciflow rerun, "ciflow/default" will always be added automatically
@pytorchbot ciflow rerun

# ciflow rerun with additional labels "-l <ciflow/label_name>", which is equivalent to adding these labels manually and trigger the rerun
@pytorchbot ciflow rerun -l ciflow/scheduled -l ciflow/slow

For more information, please take a look at the CI Flow Wiki.

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Jan 6, 2022

🔗 Helpful links

💊 CI failures summary and remediations

As of commit 8a79275 (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@zasdfgbnm zasdfgbnm marked this pull request as draft January 6, 2022 19:24
@zasdfgbnm zasdfgbnm requested a review from ngimel January 6, 2022 19:26
@zasdfgbnm
Copy link
Collaborator Author

Benchmarked at https://github.com/zasdfgbnm/things/blob/master/2022/embedding-benchmark.ipynb, this PR is improving the performance compared to master:

# master (fcb7e528d986fb4df148a43959cd25fe3cfa5842)
36.5 ms ± 38.3 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
19 ms ± 225 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
18 ms ± 538 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# this PR (4214bd808e136baec1e6aad7c16f975772370aca)
33.9 ms ± 19.4 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
13.4 ms ± 1.95 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
11.1 ms ± 31.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

code:

import torch

embedding1 = torch.nn.Embedding(28996, 768, padding_idx=0, device='cuda')
embedding2 = torch.nn.Embedding(512, 768, device='cuda')
embedding3 = torch.nn.Embedding(2, 768, device='cuda')

input1 = torch.randint(embedding1.num_embeddings, (32, 128), dtype=torch.long, device='cuda')
input2 = torch.randint(embedding2.num_embeddings, (32, 128), dtype=torch.long, device='cuda')
input3 = torch.randint(embedding3.num_embeddings, (32, 128), dtype=torch.long, device='cuda')

for _ in range(100):
    torch.arange(1000000, device='cuda')
    
def run50sync(f):
    for _ in range(50):
        f()
    torch.cuda.synchronize()

def benchmark():
    torch.cuda.synchronize()
    %timeit run50sync(lambda: embedding1(input1).sum().backward())
    torch.cuda.synchronize()
    %timeit run50sync(lambda: embedding2(input2).sum().backward())
    torch.cuda.synchronize()
    %timeit run50sync(lambda: embedding3(input3).sum().backward())

shapes are extracted from huggingface bert example https://github.com/huggingface/transformers/tree/master/examples/pytorch/text-classification

python run_glue.py \
  --model_name_or_path bert-base-cased \
  --dataset_name imdb  \
  --do_train \
  --do_predict \
  --max_seq_length 128 \
  --per_device_train_batch_size 32 \
  --learning_rate 2e-5 \
  --num_train_epochs 3 \
  --output_dir /tmp/imdb/

@zasdfgbnm
Copy link
Collaborator Author

Marking as ready for review

@zasdfgbnm zasdfgbnm marked this pull request as ready for review March 8, 2022 01:37
@bdhirsh bdhirsh added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Mar 8, 2022
@zasdfgbnm
Copy link
Collaborator Author

Rebased and should be ready for review.

@ngimel
Copy link
Collaborator

ngimel commented Mar 9, 2022

So there's no perf regression, but still memory regression?

@zasdfgbnm
Copy link
Collaborator Author

@ngimel You are right. There is memory regression. For speed, I am comparing cub::DeviceSelect::UniqueByKey + this PR vs thrust::unique_by_key without this PR and the former is faster.

@facebook-github-bot
Copy link
Contributor

@ngimel has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@ngimel
Copy link
Collaborator

ngimel commented Mar 16, 2022

@pytorchbot merge this please

facebook-github-bot pushed a commit that referenced this pull request Mar 16, 2022
Summary:
This together with #66580 and #68376 will remove all syncs in embedding.

This PR includes #68376, please review after merging #68376

This PR introduces perf regressions and increases memory usage:
- `exclusive_sum` is now computing the entire `numel` elements instead of `num_of_segments` elements, and the trailing `numel - num_of_segments` results will be discarded.
- Some memory allocation now needs `numel` spaces instead of `num_of_segments` or `num_of_partial_segments`.

These are the prices we must pay in order to get a sync-free implementation.

I haven't done any benchmark yet. I will do it later.

Pull Request resolved: #70943

Reviewed By: H-Huang

Differential Revision: D34881660

Pulled By: ngimel

fbshipit-source-id: b0760fa33608c46cd4145ceb09878bf94a9f959d
@github-actions
Copy link
Contributor

Hey @zasdfgbnm.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

@zasdfgbnm zasdfgbnm deleted the rm-more-sync branch March 16, 2022 17:48
@zasdfgbnm zasdfgbnm added the topic: performance topic category label Mar 16, 2022
zasdfgbnm added a commit that referenced this pull request May 31, 2022
pytorchmergebot pushed a commit that referenced this pull request Jun 1, 2022
`numel` is a too loose upper bound for `num_of_segments` and `num_of_partial_segments`. It causes perf regressions.

This PR moves to a tighter upper bound.

Benchmark with jupyter notebook:
```python
import torch
num_embeddings = 1024
embedding_dim = 512
e = torch.nn.Embedding(num_embeddings, embedding_dim).cuda()

size = 1*1024*1024
i = torch.arange(size, device='cuda') % num_embeddings
o = e(i)
g = torch.randn_like(o)
torch.cuda.synchronize()
```
```python
%%timeit
o.backward(g, retain_graph=True)
torch.cuda.synchronize()
```

Before #70943: 3.6 ms
After #70943: 6.9 ms
With this PR: 3.55 ms
Pull Request resolved: #78588
Approved by: https://github.com/ngimel
facebook-github-bot pushed a commit that referenced this pull request Jun 2, 2022
Summary:
`numel` is a too loose upper bound for `num_of_segments` and `num_of_partial_segments`. It causes perf regressions.

This PR moves to a tighter upper bound.

Benchmark with jupyter notebook:
```python
import torch
num_embeddings = 1024
embedding_dim = 512
e = torch.nn.Embedding(num_embeddings, embedding_dim).cuda()

size = 1*1024*1024
i = torch.arange(size, device='cuda') % num_embeddings
o = e(i)
g = torch.randn_like(o)
torch.cuda.synchronize()
```
```python
%%timeit
o.backward(g, retain_graph=True)
torch.cuda.synchronize()
```

Before #70943: 3.6 ms
After #70943: 6.9 ms
With this PR: 3.55 ms

Pull Request resolved: #78588
Approved by: https://github.com/ngimel

Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/5bcbad76148a2808ae8331b8cfbd42885269a8d9

Reviewed By: seemethere

Differential Revision: D36815606

Pulled By: seemethere

fbshipit-source-id: 0d4c349c7bd8aa8e2c9209d6ec15415867f722a0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed open source topic: performance topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants