[RFC] grouping tensors in C++ for fused adam(w) #94344

crcrpar · 2023-02-07T22:02:47Z

This duplicates the core functionality of https://github.com/pytorch/pytorch/blob/master/torch/utils/_foreach_utils.py#L21 in C++.

The changes are

manually setting device_guard and calling device_check in each fused kernel
incrementing step counts and optionally decrementing them

Ideally there should be an equivalent in C++ which we can use for foreach optimizer and each foreach function so that we can remove the for-loop in Python. Note that once having each foreach function takes care of this grouping, the foreach optimizers could redundantly do the grouping; so still iterating over the lists of grouped tensors could make sense in foreach optimizers.

related: #58833, #89591 (comment)

cc @ptrblck @ngimel

pytorch-bot · 2023-02-07T22:02:51Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/94344

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 Failures

As of commit f0b08c8:

NEW FAILURES - The following jobs have failed:

lintrunner / linux-job (gh)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

albanD · 2023-02-07T23:30:08Z

Nice!
Any number on perf improvement?

crcrpar · 2023-02-09T00:26:31Z

with huggingface/transformers example of gpt2-medium (script) on A100x4

master (2.0.0a0+git1767026)

***** train metrics *****
  epoch                    =        3.0
  train_loss               =     2.8572
  train_runtime            = 0:01:49.16
  train_samples            =       1200
  train_samples_per_second =     32.978
  train_steps_per_second   =      1.044

this pr

***** train metrics *****
  epoch                    =        3.0
  train_loss               =     2.8572
  train_runtime            = 0:01:39.96
  train_samples            =       1200
  train_samples_per_second =     36.014
  train_steps_per_second   =       1.14

machine: DGXStation (A100x4, AMD EPYC7742 x 64)

Command:

python run_clm.py --model_name_or_path gpt2-medium --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --per_device_train_batch_size 8 --per_device_eval_batch_size 8 --do_train --do_eval --output_dir /workspace/results --max_train_samples 1200 --warmup_steps 8 --overwrite_output_dir --optim adamw_torch

In the first run, this check

pytorch/torch/optim/adamw.py

Line 60 in f9cc12e

if not all(

errored out so I removed it and things worked. So I'll remove.

crcrpar · 2023-02-09T01:47:03Z

t5-large

master

***** train metrics *****
  epoch                    =        1.0
  train_loss               =     2.6017
  train_runtime            = 0:00:39.23
  train_samples            =       5000
  train_samples_per_second =     127.43
  train_steps_per_second   =      1.019

this pr

***** train metrics *****
  epoch                    =        1.0
  train_loss               =     2.6017
  train_runtime            = 0:00:39.14
  train_samples            =       5000
  train_samples_per_second =    127.737
  train_steps_per_second   =      1.022

command:

python run_translation.py --model_name_or_path t5-large --do_train --label_smoothing 0.1 --logging_strategy no --save_strategy no --per_device_train_batch_size 32 --max_source_length 512 --max_target_length 512 --num_train_epochs 1 --source_lang en --target_lang ro --dataset_name wmt16 --dataset_config ro-en --source_prefix "translate English to Romanian:"  --warmup_steps 50 --max_train_samples 5000 --dataloader_num_workers 2 --output_dir /workspace/result_t5-large --overwrite_output_dir --optim adamw_torch

janeyx99 · 2023-02-16T21:55:00Z

Thanks for looking into this! Is your long term goal here to replace the python grouping util with a generalized C++ util or is this PR just for fusedAdamW?

crcrpar · 2023-02-16T23:06:39Z

Is your long term goal here to replace the python grouping util with a generalized C++ util or is this PR just for fusedAdamW?

The former. that said, I'm not sure how to avoid multiple groupings for _multi_tensor optimizer functions at the moment

janeyx99 · 2023-02-17T22:52:15Z

Yea, I agree with the concern if we were to expand foreach grouping to C++ as a generic util.

I have a more general concern about adding grouping for general foreach ops (like add, mul, etc)--will this add significant perf cost for foreach calls where the tensors were already properly grouped? What is the perf difference if we added grouping for every foreach op vs the existing checks we do today? Do you have some benchmarks comparing the existing checks logic vs the grouping logic?

If the difference is significant, would a potential solution be to allow users to pass in a skip_grouping flag that says "I'm sure I don't need grouping, just run + if it fails it's my fault"? This would require an API change though...which....is likely BC breaking.

cc'ing @ngimel as well

crcrpar · 2023-02-17T23:04:03Z

If the difference is significant, would a potential solution be to allow users to pass in a skip_grouping flag that says "I'm sure I don't need grouping, just run + if it fails it's my fault"? This would require an API change though...which....is likely BC breaking.

I had a similar thought: adding a flag to each foreach function and/or expose from cpp to python some function which takes list of Tensors and returns Dict[torch.device, Dict[torch.dtype, List[List[Tensor]] which could be utilized by

pytorch/torch/cuda/amp/grad_scaler.py

Line 195 in c16b291

def _unscale_grads_(self, optimizer, inv_scale, found_inf, allow_fp16):

as well

crcrpar · 2023-02-22T22:53:01Z

@pytorchbot rebase

pytorchmergebot · 2023-02-22T22:55:00Z

@pytorchbot successfully started a rebase job. Check the current status here

pytorchmergebot · 2023-02-22T22:55:07Z

Successfully rebased fused-optimizers-faster-grouping onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout fused-optimizers-faster-grouping && git pull --rebase)

janeyx99 · 2023-02-27T18:26:44Z

I would agree that having a separate C++ util exposed that could be optionally called from relevant endpoints would be better than changing the API haha. Do you have the numbers for how long the grouping would take with the gpt2-medium params?

…95847) Fixes #95781. The cause seems to be that the current implementation doesn't correctly pass `found_inf` when `grad_scale` is `None`. Therefore parameters can get mistakenly updated by gradients whose some elements are invalid, i.e. nan or inf. Related #94060 I forgot about this wrong handling after #94344 Pull Request resolved: #95847 Approved by: https://github.com/janeyx99

…ytorch#95847) Fixes pytorch#95781. The cause seems to be that the current implementation doesn't correctly pass `found_inf` when `grad_scale` is `None`. Therefore parameters can get mistakenly updated by gradients whose some elements are invalid, i.e. nan or inf. Related pytorch#94060 I forgot about this wrong handling after pytorch#94344 Pull Request resolved: pytorch#95847 Approved by: https://github.com/janeyx99

crcrpar · 2023-03-10T05:00:12Z

some more numbers (top: this pr, bottom: master)

"Self" numbers look good.

FP32 - the same gpt2 script above

                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                              Optimizer.step#AdamW.step         0.14%      21.845ms         0.46%      72.387ms       3.619ms       0.000us         0.00%     161.130ms       8.056ms            20  
                                    aten::_fused_adamw_         0.04%       6.821ms         0.32%      50.472ms       1.262ms     161.130ms         1.68%     161.130ms       4.028ms            40  
                    Optimizer.zero_grad#AdamW.zero_grad         0.09%      14.349ms         5.52%     865.262ms      43.263ms       0.000us         0.00%      19.468ms     973.400us            2

---

                              Optimizer.step#AdamW.step         0.19%      29.690ms         0.46%      71.838ms       3.592ms       0.000us         0.00%     161.619ms       8.081ms            20  
                                    aten::_foreach_add_         0.02%       2.650ms         0.02%       3.294ms      82.350us     180.000us         0.00%     180.000us       4.500us            40  
                                    aten::_fused_adamw_         0.02%       3.216ms         0.25%      38.790ms     969.750us     161.439ms         1.69%     161.439ms       4.036ms            40  
                    Optimizer.zero_grad#AdamW.zero_grad         0.09%      14.145ms         5.53%     867.559ms      43.378ms       0.000us         0.00%      19.454ms     972.700us            2

FP16 AMP (autocast & GradScaler)

                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                              Optimizer.step#AdamW.step         0.06%       8.618ms         0.13%      19.301ms     965.050us       0.000us         0.00%     162.830ms       8.142ms            20
                                    aten::_fused_adamw_         0.06%       8.422ms         0.07%      10.665ms     266.625us     162.830ms         1.91%     162.830ms       4.071ms            40
                    Optimizer.zero_grad#AdamW.zero_grad         0.10%      14.414ms         0.36%      52.794ms       2.640ms       0.000us         0.00%      19.235ms     961.750us            20

---

                              Optimizer.step#AdamW.step         0.13%      18.071ms         0.20%      28.279ms       1.414ms       0.000us         0.00%     162.600ms       8.130ms            20
                                    aten::_foreach_add_         0.02%       2.582ms         0.02%       3.014ms      75.350us     160.000us         0.00%     160.000us       4.000us            40  
                                    aten::_fused_adamw_         0.02%       2.775ms         0.03%       4.242ms     106.050us     162.197ms         1.90%     162.197ms       4.055ms            40  
                                    aten::_foreach_sub_         0.02%       2.478ms         0.02%       2.938ms      73.450us     243.000us         0.00%     243.000us       6.075us            40
                    Optimizer.zero_grad#AdamW.zero_grad         0.10%      13.828ms         0.37%      53.763ms       2.688ms       0.000us         0.00%      19.232ms     961.600us            20

…ytorch#95847) Fixes pytorch#95781. The cause seems to be that the current implementation doesn't correctly pass `found_inf` when `grad_scale` is `None`. Therefore parameters can get mistakenly updated by gradients whose some elements are invalid, i.e. nan or inf. Related pytorch#94060 I forgot about this wrong handling after pytorch#94344 Pull Request resolved: pytorch#95847 Approved by: https://github.com/janeyx99

aten/src/ATen/native/cuda/fused_adam_utils.cuh

janeyx99 · 2023-03-20T18:35:48Z

aten/src/ATen/native/cuda/fused_adam_utils.cuh

is only_device_check simply is_step_tensor? nit on a more indicative name if so

aten/src/ATen/native/cuda/fused_adam_utils.cuh

janeyx99 · 2023-03-20T20:32:43Z

The numbers do look good! I have some questions:

Is it correct that this PR moved the grouping logic onto CUDA by calling it within the kernel? If so, are the numbers for fused_adam(w) basically showing that the grouping adds no significant time at all (and takes shorter than the previous checking?)
I'm trying to think ahead to what the generic grouping function in c++ would look like. It seems like the distance between the current grouping operator and the generic one is mainly the check ensuring the tensorlists were similar enough. Could we already make this generic by allowing a flag to do this check vs not? Or we could have another util that will just group a list of tensors instead of a nested list.
I would prefer if we could already land the generalized version in this PR (and have it live in foreach_utils) if there are no downsides.

crcrpar · 2023-03-28T20:33:10Z

Is it correct that this PR moved the grouping logic onto CUDA by calling it within the kernel? If so, are the numbers for fused_adam(w) basically showing that the grouping adds no significant time at all (and takes shorter than the previous checking?)

yes. IIUC the difference mainly come from C++ or Python including the setting device guard and handling of map/dict of device and Tensor for grad scaler & found inf.

I'm trying to think ahead to what the generic grouping function in c++ would look like. It seems like the distance between the current grouping operator and the generic one is mainly the check ensuring the tensorlists were similar enough. Could we already make this generic by allowing a flag to do this check vs not? Or we could have another util that will just group a list of tensors instead of a nested list.

I would prefer if we could already land the generalized version in this PR (and have it live in foreach_utils) if there are no downsides.

I've been played with torch/csrc/Module.cpp and am yet to figure out what will suit better with SGD's logic here:

pytorch/torch/optim/sgd.py

Line 49 in ae3316c

momentum_buffer_list.append(None)

I guess having the flag to tell whether or not momentum buffers need initialization will work well, e.g. https://github.com/NVIDIA/apex/blob/7150e20cc3adef34e3f36261a1070fc0882f16a7/apex/optimizers/fused_sgd.py#L121-L136, which I think backward breaking though

…ytorch#95847) Fixes pytorch#95781. The cause seems to be that the current implementation doesn't correctly pass `found_inf` when `grad_scale` is `None`. Therefore parameters can get mistakenly updated by gradients whose some elements are invalid, i.e. nan or inf. Related pytorch#94060 I forgot about this wrong handling after pytorch#94344 Pull Request resolved: pytorch#95847 Approved by: https://github.com/janeyx99

janeyx99 · 2023-03-31T14:51:43Z

aten/src/ATen/native/ForeachUtils.h

should this logic just be moved before the for loop so it doesn't keep checking with count?

janeyx99 · 2023-03-31T15:00:28Z

aten/src/ATen/native/ForeachUtils.h

To my understanding, this function would be only used by the optimizers, yes?

janeyx99 · 2023-03-31T15:10:13Z

aten/src/ATen/native/cuda/FusedAdamWKernel.cu

Will you be switching this to call the generic foreach util once that's ready? I think that would be ideal (vs having a copy in the fused_adam_utils) and I'd be curious if the perf results would be different.

janeyx99 · 2023-03-31T15:21:28Z

yes. IIUC the difference mainly come from C++ or Python including the setting device guard and handling of map/dict of device and Tensor for grad scaler & found inf.

Ah, so this means the wins won't be as prominent for the other optimizers/foreach ops, but should be a general win nonetheless.

I've been played with torch/csrc/Module.cpp and am yet to figure out what will suit better with SGD's logic here:

pytorch/torch/optim/sgd.py

Line 49 in ae3316c

momentum_buffer_list.append(None)

I guess having the flag to tell whether or not momentum buffers need initialization will work well, e.g. https://github.com/NVIDIA/apex/blob/7150e20cc3adef34e3f36261a1070fc0882f16a7/apex/optimizers/fused_sgd.py#L121-L136, which I think backward breaking though

Is the problem here that the momentum buffer list may be all Nones initially? I think the way I dealt with it before is just including the with_indices flag and having sgd use the indices to update the original buffer. Is that insufficient?

…95847) (#97885) Fixes #95781. The cause seems to be that the current implementation doesn't correctly pass `found_inf` when `grad_scale` is `None`. Therefore parameters can get mistakenly updated by gradients whose some elements are invalid, i.e. nan or inf. Related #94060 I forgot about this wrong handling after #94344 Pull Request resolved: #95847 Approved by: https://github.com/janeyx99

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

janeyx99 · 2023-04-24T13:42:08Z

hey! why did you close the pr? sorry if you had been waiting on a review—i did not think it was ready before and would be happy to review

crcrpar · 2023-04-27T15:13:03Z

@janeyx99 I found this branch a bit too messed up so refactored it into 100007

janeyx99 · 2023-04-27T15:15:21Z

Thanks for the update--I'll look there!

rel: #94344 Pull Request resolved: #100007 Approved by: https://github.com/janeyx99

rel: #94344 Pull Request resolved: #100007 Approved by: https://github.com/janeyx99 (cherry picked from commit 74b7a6c)

crcrpar requested review from albanD and janeyx99 as code owners February 7, 2023 22:02

pytorch-bot bot added the release notes: nn release notes category label Feb 7, 2023

pytorchbot added the open source label Feb 7, 2023

soulitzer added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Feb 8, 2023

crcrpar force-pushed the fused-optimizers-faster-grouping branch from 8ff00e0 to cff18cb Compare February 10, 2023 01:53

crcrpar mentioned this pull request Feb 14, 2023

[mta] Implement fused SGD #94791

Closed

pytorchmergebot force-pushed the fused-optimizers-faster-grouping branch from cff18cb to 93384db Compare February 22, 2023 22:55

crcrpar mentioned this pull request Feb 23, 2023

Foreach Functions Tracking Issue #58833

Open

28 tasks

crcrpar mentioned this pull request Mar 2, 2023

Fix wrong handling of grad_scale & found_inf in fused optimizers #95847

Closed

crcrpar force-pushed the fused-optimizers-faster-grouping branch from 93384db to 39681b9 Compare March 8, 2023 17:40

janeyx99 reviewed Mar 20, 2023

View reviewed changes

aten/src/ATen/native/cuda/fused_adam_utils.cuh Outdated Show resolved Hide resolved

janeyx99 reviewed Mar 20, 2023

View reviewed changes

aten/src/ATen/native/cuda/fused_adam_utils.cuh Outdated Show resolved Hide resolved

crcrpar force-pushed the fused-optimizers-faster-grouping branch from 39681b9 to 0c82432 Compare March 25, 2023 01:09

janeyx99 reviewed Mar 31, 2023

View reviewed changes

aten/src/ATen/native/ForeachUtils.h Outdated

Copy link

Contributor

janeyx99 Mar 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To my understanding, this function would be only used by the optimizers, yes?

janeyx99 reviewed Mar 31, 2023

View reviewed changes

crcrpar added 7 commits April 5, 2023 17:59

Disable autogen of device_guard and device_check

a2de08e

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

adding grouping function...

4ca543f

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

use cpp grouping func in fused adam & adamw

b8ba7e1

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

remove cuda check in optimizer.__init__

01e5d8d

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

initial attempt to expose C++ grouping func

743e43c

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

failing to compile this vector<vector<optional<Tensor>>>

6a39cea

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

port _group_tensors_by_device_and_dtype to C++

7f92a8b

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

crcrpar force-pushed the fused-optimizers-faster-grouping branch from 0c82432 to 7f92a8b Compare April 6, 2023 05:00

crcrpar requested review from jbschlosser and mikaylagawarecki as code owners April 6, 2023 05:00

crcrpar added 3 commits April 5, 2023 22:13

fix type hint

b220607

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

reflect updates

b9f8d1d

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

clang format

f0b08c8

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

crcrpar closed this Apr 23, 2023

crcrpar mentioned this pull request Apr 25, 2023

Move tensor grouping to ATen #100007

Closed

pytorchmergebot pushed a commit that referenced this pull request Jun 9, 2023

Move tensor grouping to ATen (#100007)

74b7a6c

rel: #94344 Pull Request resolved: #100007 Approved by: https://github.com/janeyx99

malfet pushed a commit that referenced this pull request Jun 20, 2023

Move tensor grouping to ATen (#100007)

ea26983

rel: #94344 Pull Request resolved: #100007 Approved by: https://github.com/janeyx99 (cherry picked from commit 74b7a6c)

[RFC] grouping tensors in C++ for fused adam(w) #94344

[RFC] grouping tensors in C++ for fused adam(w) #94344

Uh oh!

Conversation

crcrpar commented Feb 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Feb 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/94344

❌ 1 Failures

Uh oh!

albanD commented Feb 7, 2023

Uh oh!

crcrpar commented Feb 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

crcrpar commented Feb 9, 2023

Uh oh!

janeyx99 commented Feb 16, 2023

Uh oh!

crcrpar commented Feb 16, 2023

Uh oh!

janeyx99 commented Feb 17, 2023

Uh oh!

crcrpar commented Feb 17, 2023

Uh oh!

crcrpar commented Feb 22, 2023

Uh oh!

pytorchmergebot commented Feb 22, 2023

Uh oh!

pytorchmergebot commented Feb 22, 2023

Uh oh!

janeyx99 commented Feb 27, 2023

Uh oh!

crcrpar commented Mar 10, 2023

FP32 - the same gpt2 script above

FP16 AMP (autocast & GradScaler)

Uh oh!

Uh oh!

janeyx99 Mar 20, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

janeyx99 commented Mar 20, 2023

Uh oh!

crcrpar commented Mar 28, 2023

Uh oh!

janeyx99 Mar 31, 2023

Choose a reason for hiding this comment

Uh oh!

janeyx99 Mar 31, 2023

Choose a reason for hiding this comment

Uh oh!

janeyx99 Mar 31, 2023

Choose a reason for hiding this comment

Uh oh!

janeyx99 commented Mar 31, 2023

Uh oh!

janeyx99 commented Apr 24, 2023

Uh oh!

crcrpar commented Apr 27, 2023

Uh oh!

janeyx99 commented Apr 27, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

crcrpar commented Feb 7, 2023 •

edited

Loading

pytorch-bot bot commented Feb 7, 2023 •

edited

Loading

crcrpar commented Feb 9, 2023 •

edited

Loading