Allow torch.cuda.amp.GradScaler to support sparse gradients #36786

mcarilli · 2020-04-17T03:18:47Z

Should close #35810.

I decided to keep sparse handling on the Python side for clarity, although it could be moved to the C++ side (into _amp_non_finite_check_and_unscale_) without much trouble.

For non-fp16 sparse grads the logic is simple (call _amp_non_finite_check_and_unscale_ on grad._values()) instead of grad itself. At least I hope it's that easy.

For fp16 sparse grads, it's tricker. Sparse tensors can be uncoalesced. From the Note:

Our sparse tensor format permits uncoalesced sparse tensors, where there may be duplicate coordinates in the indices; in this case, the interpretation is that the value at that index is the sum of all duplicate value entries.

An uncoalesced scaled fp16 grad may have values at duplicate coordinates that are all finite but large, such that adding them to make the coalesced version WOULD cause overflows.** If I checked _values() on the uncoalesced version, it might not report overflows, but I think it should.

So, if the grad is sparse, fp16, and uncoalesced, I still call _amp_non_finite_check_and_unscale_ to unscale grad._values() in-place, but I also double-check the coalesced version by calling a second _amp_non_finite_check_and_unscale_ on grad.coalesce()._values(). coalesce() is out-of-place, so this call doesn't redundantly affect grad._values(), but it does have the power to populate the same found_inf tensor. The is_coalesced() check and coalesce() probably aren't great for performance, but if someone needs a giant embedding table in FP16, they're better than nothing and memorywise, they'll only create a copy of nnz gradient values+indices, which is still way better than changing the whole table to FP32.

An unscale variant with liberty to create unscaled grads out-of-place, and replace param.grad instead of writing through it, could get away with just one _amp_non_finite_check_and_unscale_. It could say coalesced = grad.coalesced(), do only the stronger _amp_non_finite_check_and_unscale_ on coalesced._values(), and set param.grad = coalesced. I could even avoid replacing param.grad itself by going one level deeper and setting param.grad's indices and values to coalesced's, but that seems brittle and still isn't truly "in place".

** you could whiteboard an uncoalesced fp32 grad with the same property, but fp32's range is big enough that I don't think it's realistic.

dr-ci · 2020-04-17T03:52:47Z

💊 CI failures summary and remediations

As of commit 0c7bc4f (more details on the Dr. CI page):

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 5 times.

ngimel · 2020-06-22T23:42:33Z

torch/cuda/amp/grad_scaler.py

+                            # coalesce() deduplicates indices and adds all values that have the same index.
+                            # For scaled fp16 values, there's a good chance coalescing will cause overflow,
+                            # so we should double check the coalesced _values().
+                            torch._amp_non_finite_check_and_unscale_(g.coalesce()._values(),


should you just replace g with its coalesced version in this case, and not do the check twice?

I like that, but my original thinking was I don't want to replace param.grad with a new reference because unscale_ advertises itself as in-place (last paragraph of PR wall of text). If you think it's ok for unscale_ to replace sparse .grads it's an easy change.

0c7bc4f replaces param.grad with the coalesced, unscaled version if param.grad was fp16 and uncoalesced.

facebook-github-bot

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2020-06-24T18:25:19Z

@ngimel merged this pull request in b4ccdef.

Tests pass

492cb66

mcarilli requested a review from ngimel April 17, 2020 03:19

mcarilli changed the title ~~Allow gradient scaling to support sparse gradients~~ Allow GradScaler to support sparse gradients Apr 17, 2020

mcarilli changed the title ~~Allow GradScaler to support sparse gradients~~ Allow torch.cuda.amp.GradScaler to support sparse gradients Apr 17, 2020

pytorchbot added the open source label Apr 17, 2020

ngimel added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Apr 18, 2020

mruberry added the module: cuda Related to torch.cuda, and CUDA support in general label Apr 18, 2020

mcarilli added the module: amp (automated mixed precision) autocast label Apr 20, 2020

mcarilli mentioned this pull request Apr 30, 2020

torch.cuda.amp > apex.amp NVIDIA/apex#818

Open

Merge remote-tracking branch 'origin/master' into amp_sparse_gradients

b223cf3

ngimel reviewed Jun 22, 2020

View reviewed changes

Unscale out of place if grads are fp16 uncoalesced

0c7bc4f

ngimel approved these changes Jun 24, 2020

View reviewed changes

facebook-github-bot reviewed Jun 24, 2020

View reviewed changes

facebook-github-bot closed this in b4ccdef Jun 24, 2020

facebook-github-bot added the merged label Jun 24, 2020

mruberry added the Merged label Oct 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Allow torch.cuda.amp.GradScaler to support sparse gradients #36786

Allow torch.cuda.amp.GradScaler to support sparse gradients #36786

Uh oh!

mcarilli commented Apr 17, 2020 •

edited

Loading

Uh oh!

dr-ci bot commented Apr 17, 2020 •

edited

Loading

Uh oh!

ngimel Jun 22, 2020

Uh oh!

mcarilli Jun 22, 2020 •

edited

Loading

Uh oh!

mcarilli Jun 23, 2020 •

edited

Loading

Uh oh!

facebook-github-bot left a comment

Uh oh!

facebook-github-bot commented Jun 24, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Allow torch.cuda.amp.GradScaler to support sparse gradients #36786

Allow torch.cuda.amp.GradScaler to support sparse gradients #36786

Uh oh!

Conversation

mcarilli commented Apr 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dr-ci bot commented Apr 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

Uh oh!

ngimel Jun 22, 2020

Choose a reason for hiding this comment

Uh oh!

mcarilli Jun 22, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mcarilli Jun 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Jun 24, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

mcarilli commented Apr 17, 2020 •

edited

Loading

dr-ci bot commented Apr 17, 2020 •

edited

Loading

mcarilli Jun 22, 2020 •

edited

Loading

mcarilli Jun 23, 2020 •

edited

Loading