Skip to content

Conversation

@crcrpar
Copy link
Collaborator

@crcrpar crcrpar commented Feb 7, 2023

This duplicates the core functionality of https://github.com/pytorch/pytorch/blob/master/torch/utils/_foreach_utils.py#L21 in C++.

The changes are

  • manually setting device_guard and calling device_check in each fused kernel
  • incrementing step counts and optionally decrementing them

Ideally there should be an equivalent in C++ which we can use for foreach optimizer and each foreach function so that we can remove the for-loop in Python. Note that once having each foreach function takes care of this grouping, the foreach optimizers could redundantly do the grouping; so still iterating over the lists of grouped tensors could make sense in foreach optimizers.

related: #58833, #89591 (comment)

cc @ptrblck @ngimel

@pytorch-bot
Copy link

pytorch-bot bot commented Feb 7, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/94344

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 Failures

As of commit f0b08c8:

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the release notes: nn release notes category label Feb 7, 2023
@albanD
Copy link
Collaborator

albanD commented Feb 7, 2023

Nice!
Any number on perf improvement?

@soulitzer soulitzer added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Feb 8, 2023
@crcrpar
Copy link
Collaborator Author

crcrpar commented Feb 9, 2023

with huggingface/transformers example of gpt2-medium (script) on A100x4

master (2.0.0a0+git1767026)

***** train metrics *****
  epoch                    =        3.0
  train_loss               =     2.8572
  train_runtime            = 0:01:49.16
  train_samples            =       1200
  train_samples_per_second =     32.978
  train_steps_per_second   =      1.044

this pr

***** train metrics *****
  epoch                    =        3.0
  train_loss               =     2.8572
  train_runtime            = 0:01:39.96
  train_samples            =       1200
  train_samples_per_second =     36.014
  train_steps_per_second   =       1.14

machine: DGXStation (A100x4, AMD EPYC7742 x 64)

Command:

python run_clm.py --model_name_or_path gpt2-medium --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --per_device_train_batch_size 8 --per_device_eval_batch_size 8 --do_train --do_eval --output_dir /workspace/results --max_train_samples 1200 --warmup_steps 8 --overwrite_output_dir --optim adamw_torch

In the first run, this check

if not all(

errored out so I removed it and things worked. So I'll remove.

@crcrpar
Copy link
Collaborator Author

crcrpar commented Feb 9, 2023

t5-large

master

***** train metrics *****
  epoch                    =        1.0
  train_loss               =     2.6017
  train_runtime            = 0:00:39.23
  train_samples            =       5000
  train_samples_per_second =     127.43
  train_steps_per_second   =      1.019

this pr

***** train metrics *****
  epoch                    =        1.0
  train_loss               =     2.6017
  train_runtime            = 0:00:39.14
  train_samples            =       5000
  train_samples_per_second =    127.737
  train_steps_per_second   =      1.022

command:

python run_translation.py --model_name_or_path t5-large --do_train --label_smoothing 0.1 --logging_strategy no --save_strategy no --per_device_train_batch_size 32 --max_source_length 512 --max_target_length 512 --num_train_epochs 1 --source_lang en --target_lang ro --dataset_name wmt16 --dataset_config ro-en --source_prefix "translate English to Romanian:"  --warmup_steps 50 --max_train_samples 5000 --dataloader_num_workers 2 --output_dir /workspace/result_t5-large --overwrite_output_dir --optim adamw_torch

@crcrpar crcrpar force-pushed the fused-optimizers-faster-grouping branch from 8ff00e0 to cff18cb Compare February 10, 2023 01:53
@janeyx99
Copy link
Contributor

Thanks for looking into this! Is your long term goal here to replace the python grouping util with a generalized C++ util or is this PR just for fusedAdamW?

@crcrpar
Copy link
Collaborator Author

crcrpar commented Feb 16, 2023

Is your long term goal here to replace the python grouping util with a generalized C++ util or is this PR just for fusedAdamW?

The former. that said, I'm not sure how to avoid multiple groupings for _multi_tensor optimizer functions at the moment

@janeyx99
Copy link
Contributor

Yea, I agree with the concern if we were to expand foreach grouping to C++ as a generic util.

I have a more general concern about adding grouping for general foreach ops (like add, mul, etc)--will this add significant perf cost for foreach calls where the tensors were already properly grouped? What is the perf difference if we added grouping for every foreach op vs the existing checks we do today? Do you have some benchmarks comparing the existing checks logic vs the grouping logic?

If the difference is significant, would a potential solution be to allow users to pass in a skip_grouping flag that says "I'm sure I don't need grouping, just run + if it fails it's my fault"? This would require an API change though...which....is likely BC breaking.

cc'ing @ngimel as well

@crcrpar
Copy link
Collaborator Author

crcrpar commented Feb 17, 2023

If the difference is significant, would a potential solution be to allow users to pass in a skip_grouping flag that says "I'm sure I don't need grouping, just run + if it fails it's my fault"? This would require an API change though...which....is likely BC breaking.

I had a similar thought: adding a flag to each foreach function and/or expose from cpp to python some function which takes list of Tensors and returns Dict[torch.device, Dict[torch.dtype, List[List[Tensor]] which could be utilized by

def _unscale_grads_(self, optimizer, inv_scale, found_inf, allow_fp16):
as well

@crcrpar
Copy link
Collaborator Author

crcrpar commented Feb 22, 2023

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a rebase job. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased fused-optimizers-faster-grouping onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout fused-optimizers-faster-grouping && git pull --rebase)

@pytorchmergebot pytorchmergebot force-pushed the fused-optimizers-faster-grouping branch from cff18cb to 93384db Compare February 22, 2023 22:55
@crcrpar crcrpar mentioned this pull request Feb 23, 2023
28 tasks
@janeyx99
Copy link
Contributor

I would agree that having a separate C++ util exposed that could be optionally called from relevant endpoints would be better than changing the API haha. Do you have the numbers for how long the grouping would take with the gpt2-medium params?

pytorchmergebot pushed a commit that referenced this pull request Mar 4, 2023
…95847)

Fixes #95781.
The cause seems to be that the current implementation doesn't correctly pass `found_inf` when `grad_scale` is `None`. Therefore parameters can get mistakenly updated by gradients whose some elements are invalid, i.e. nan or inf.

Related #94060

I forgot about this wrong handling after #94344

Pull Request resolved: #95847
Approved by: https://github.com/janeyx99
crcrpar added a commit to crcrpar/pytorch that referenced this pull request Mar 6, 2023
…ytorch#95847)

Fixes pytorch#95781.
The cause seems to be that the current implementation doesn't correctly pass `found_inf` when `grad_scale` is `None`. Therefore parameters can get mistakenly updated by gradients whose some elements are invalid, i.e. nan or inf.

Related pytorch#94060

I forgot about this wrong handling after pytorch#94344

Pull Request resolved: pytorch#95847
Approved by: https://github.com/janeyx99
@crcrpar crcrpar force-pushed the fused-optimizers-faster-grouping branch from 93384db to 39681b9 Compare March 8, 2023 17:40
@crcrpar
Copy link
Collaborator Author

crcrpar commented Mar 10, 2023

some more numbers (top: this pr, bottom: master)

"Self" numbers look good.

FP32 - the same gpt2 script above

                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                              Optimizer.step#AdamW.step         0.14%      21.845ms         0.46%      72.387ms       3.619ms       0.000us         0.00%     161.130ms       8.056ms            20  
                                    aten::_fused_adamw_         0.04%       6.821ms         0.32%      50.472ms       1.262ms     161.130ms         1.68%     161.130ms       4.028ms            40  
                    Optimizer.zero_grad#AdamW.zero_grad         0.09%      14.349ms         5.52%     865.262ms      43.263ms       0.000us         0.00%      19.468ms     973.400us            2

---

                              Optimizer.step#AdamW.step         0.19%      29.690ms         0.46%      71.838ms       3.592ms       0.000us         0.00%     161.619ms       8.081ms            20  
                                    aten::_foreach_add_         0.02%       2.650ms         0.02%       3.294ms      82.350us     180.000us         0.00%     180.000us       4.500us            40  
                                    aten::_fused_adamw_         0.02%       3.216ms         0.25%      38.790ms     969.750us     161.439ms         1.69%     161.439ms       4.036ms            40  
                    Optimizer.zero_grad#AdamW.zero_grad         0.09%      14.145ms         5.53%     867.559ms      43.378ms       0.000us         0.00%      19.454ms     972.700us            2

FP16 AMP (autocast & GradScaler)

                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                              Optimizer.step#AdamW.step         0.06%       8.618ms         0.13%      19.301ms     965.050us       0.000us         0.00%     162.830ms       8.142ms            20
                                    aten::_fused_adamw_         0.06%       8.422ms         0.07%      10.665ms     266.625us     162.830ms         1.91%     162.830ms       4.071ms            40
                    Optimizer.zero_grad#AdamW.zero_grad         0.10%      14.414ms         0.36%      52.794ms       2.640ms       0.000us         0.00%      19.235ms     961.750us            20

---

                              Optimizer.step#AdamW.step         0.13%      18.071ms         0.20%      28.279ms       1.414ms       0.000us         0.00%     162.600ms       8.130ms            20
                                    aten::_foreach_add_         0.02%       2.582ms         0.02%       3.014ms      75.350us     160.000us         0.00%     160.000us       4.000us            40  
                                    aten::_fused_adamw_         0.02%       2.775ms         0.03%       4.242ms     106.050us     162.197ms         1.90%     162.197ms       4.055ms            40  
                                    aten::_foreach_sub_         0.02%       2.478ms         0.02%       2.938ms      73.450us     243.000us         0.00%     243.000us       6.075us            40
                    Optimizer.zero_grad#AdamW.zero_grad         0.10%      13.828ms         0.37%      53.763ms       2.688ms       0.000us         0.00%      19.232ms     961.600us            20

ydwu4 added a commit to ydwu4/pytorch that referenced this pull request Mar 13, 2023
…ytorch#95847)

Fixes pytorch#95781.
The cause seems to be that the current implementation doesn't correctly pass `found_inf` when `grad_scale` is `None`. Therefore parameters can get mistakenly updated by gradients whose some elements are invalid, i.e. nan or inf.

Related pytorch#94060

I forgot about this wrong handling after pytorch#94344

Pull Request resolved: pytorch#95847
Approved by: https://github.com/janeyx99
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is only_device_check simply is_step_tensor? nit on a more indicative name if so

@janeyx99
Copy link
Contributor

The numbers do look good! I have some questions:

  1. Is it correct that this PR moved the grouping logic onto CUDA by calling it within the kernel? If so, are the numbers for fused_adam(w) basically showing that the grouping adds no significant time at all (and takes shorter than the previous checking?)
  2. I'm trying to think ahead to what the generic grouping function in c++ would look like. It seems like the distance between the current grouping operator and the generic one is mainly the check ensuring the tensorlists were similar enough. Could we already make this generic by allowing a flag to do this check vs not? Or we could have another util that will just group a list of tensors instead of a nested list.
  3. I would prefer if we could already land the generalized version in this PR (and have it live in foreach_utils) if there are no downsides.

@crcrpar crcrpar force-pushed the fused-optimizers-faster-grouping branch from 39681b9 to 0c82432 Compare March 25, 2023 01:09
@crcrpar
Copy link
Collaborator Author

crcrpar commented Mar 28, 2023

  1. Is it correct that this PR moved the grouping logic onto CUDA by calling it within the kernel? If so, are the numbers for fused_adam(w) basically showing that the grouping adds no significant time at all (and takes shorter than the previous checking?)

yes. IIUC the difference mainly come from C++ or Python including the setting device guard and handling of map/dict of device and Tensor for grad scaler & found inf.

  • I'm trying to think ahead to what the generic grouping function in c++ would look like. It seems like the distance between the current grouping operator and the generic one is mainly the check ensuring the tensorlists were similar enough. Could we already make this generic by allowing a flag to do this check vs not? Or we could have another util that will just group a list of tensors instead of a nested list.
  • I would prefer if we could already land the generalized version in this PR (and have it live in foreach_utils) if there are no downsides.

I've been played with torch/csrc/Module.cpp and am yet to figure out what will suit better with SGD's logic here:

momentum_buffer_list.append(None)

I guess having the flag to tell whether or not momentum buffers need initialization will work well, e.g. https://github.com/NVIDIA/apex/blob/7150e20cc3adef34e3f36261a1070fc0882f16a7/apex/optimizers/fused_sgd.py#L121-L136, which I think backward breaking though

crcrpar added a commit to crcrpar/pytorch that referenced this pull request Mar 29, 2023
…ytorch#95847)

Fixes pytorch#95781.
The cause seems to be that the current implementation doesn't correctly pass `found_inf` when `grad_scale` is `None`. Therefore parameters can get mistakenly updated by gradients whose some elements are invalid, i.e. nan or inf.

Related pytorch#94060

I forgot about this wrong handling after pytorch#94344

Pull Request resolved: pytorch#95847
Approved by: https://github.com/janeyx99
Comment on lines 212 to 245
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this logic just be moved before the for loop so it doesn't keep checking with count?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To my understanding, this function would be only used by the optimizers, yes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will you be switching this to call the generic foreach util once that's ready? I think that would be ideal (vs having a copy in the fused_adam_utils) and I'd be curious if the perf results would be different.

@janeyx99
Copy link
Contributor

yes. IIUC the difference mainly come from C++ or Python including the setting device guard and handling of map/dict of device and Tensor for grad scaler & found inf.

Ah, so this means the wins won't be as prominent for the other optimizers/foreach ops, but should be a general win nonetheless.

I've been played with torch/csrc/Module.cpp and am yet to figure out what will suit better with SGD's logic here:

momentum_buffer_list.append(None)

I guess having the flag to tell whether or not momentum buffers need initialization will work well, e.g. https://github.com/NVIDIA/apex/blob/7150e20cc3adef34e3f36261a1070fc0882f16a7/apex/optimizers/fused_sgd.py#L121-L136, which I think backward breaking though

Is the problem here that the momentum buffer list may be all Nones initially? I think the way I dealt with it before is just including the with_indices flag and having sgd use the indices to update the original buffer. Is that insufficient?

atalman pushed a commit that referenced this pull request Apr 5, 2023
…95847) (#97885)

Fixes #95781.
The cause seems to be that the current implementation doesn't correctly pass `found_inf` when `grad_scale` is `None`. Therefore parameters can get mistakenly updated by gradients whose some elements are invalid, i.e. nan or inf.

Related #94060

I forgot about this wrong handling after #94344

Pull Request resolved: #95847
Approved by: https://github.com/janeyx99
crcrpar added 7 commits April 5, 2023 17:59
Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>
Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>
Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>
Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>
Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>
Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>
Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>
@crcrpar crcrpar force-pushed the fused-optimizers-faster-grouping branch from 0c82432 to 7f92a8b Compare April 6, 2023 05:00
crcrpar added 3 commits April 5, 2023 22:13
Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>
Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>
Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>
@crcrpar crcrpar closed this Apr 23, 2023
@janeyx99
Copy link
Contributor

hey! why did you close the pr? sorry if you had been waiting on a review—i did not think it was ready before and would be happy to review

@crcrpar
Copy link
Collaborator Author

crcrpar commented Apr 27, 2023

@janeyx99 I found this branch a bit too messed up so refactored it into 100007

@janeyx99
Copy link
Contributor

Thanks for the update--I'll look there!

pytorchmergebot pushed a commit that referenced this pull request Jun 9, 2023
malfet pushed a commit that referenced this pull request Jun 20, 2023
rel: #94344
Pull Request resolved: #100007
Approved by: https://github.com/janeyx99

(cherry picked from commit 74b7a6c)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

open source release notes: nn release notes category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants