[WIP] Fix cpp grad accessor API #40887

albanD · 2020-07-01T22:19:59Z

Update the API to access grad in cpp to avoid unexpected thread safety issues.
In particular, with the current API, a check like t.grad().defined() is not thread safe.

This introduces t.mutable_grad() that should be used when getting a mutable version of the saved gradient. This function is not thread safe.
The Tensor& grad() API is now removed. We could not do a deprecation cycle as most of our call side use non-const Tensors that use the non-const overload. This would lead to most calls hitting the warning. This would be too verbose for all the users.

facebook-github-bot

@albanD has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

dr-ci · 2020-07-02T00:02:51Z

💊 CI failures summary and remediations

As of commit 9ee011a (more details on the Dr. CI page):

1/1 failures introduced in this PR

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

pytorch_macos_10_13_py3_test (1/1)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

Jul 14 10:40:53 [E request_callback_impl.cpp:168] Received error while processing request type 2: PickleError: ScriptModules cannot be deepcopied using copy.deepcopy or saved using torch.save. Mixed serialization of script and non-script modules is not supported. For purely script modules use my_script_module.save() instead.

Jul 14 10:40:53   /Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/distributed/rpc/internal.py(93): serialize 
Jul 14 10:40:53   /Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/distributed/rpc/internal.py(145): serialize 
Jul 14 10:40:53  
Jul 14 10:40:53 [E request_callback_impl.cpp:168] Received error while processing request type 2: PickleError: ScriptModules cannot be deepcopied using copy.deepcopy or saved using torch.save. Mixed serialization of script and non-script modules is not supported. For purely script modules use my_script_module.save(<filename>) instead. 
Jul 14 10:40:53  
Jul 14 10:40:53 At: 
Jul 14 10:40:53   /Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/jit/_script.py(570): __getstate__ 
Jul 14 10:40:53   /Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/distributed/rpc/internal.py(93): serialize 
Jul 14 10:40:53   /Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/distributed/rpc/internal.py(145): serialize 
Jul 14 10:40:53  
Jul 14 10:40:53 [E request_callback_impl.cpp:168] Received error while processing request type 2: PickleError: ScriptModules cannot be deepcopied using copy.deepcopy or saved using torch.save. Mixed serialization of script and non-script modules is not supported. For purely script modules use my_script_module.save(<filename>) instead. 
Jul 14 10:40:53  
Jul 14 10:40:53 At: 
Jul 14 10:40:53   /Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/jit/_script.py(570): __getstate__ 
Jul 14 10:40:53   /Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/distributed/rpc/internal.py(93): serialize 
Jul 14 10:40:53   /Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/distributed/rpc/internal.py(145): serialize 
Jul 14 10:40:53  
Jul 14 10:40:53 ok (1.360s) 
Jul 14 10:40:55   test_unexepected_kwarg_is_specified (__main__.JitRpcTestWithSpawn) ... ok (1.207s) 
Jul 14 10:40:56   test_user_rrefs_confirmed (__main__.JitRpcTestWithSpawn) ... ok (1.305s) 
Jul 14 10:40:57   test_user_rrefs_confirmed_remote (__main__.JitRpcTestWithSpawn) ... ok (1.209s)

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 14 times.

test/cpp/api/nn_utils.cpp

facebook-github-bot

@albanD has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

torch/csrc/autograd/functions/accumulate_grad.cpp

torch/csrc/distributed/c10d/reducer.cpp

facebook-github-bot

@albanD has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

albanD · 2020-07-09T22:58:32Z

Even though the code compiles just fine right now (and all tests pass fine), if I add back the bad overload (for BC reasons with a warning), the compiler actually use the bad overload...
Do we actually want that? Or are we happy with a hard BC breaking here (not adding back the bad overload with a warning and merging as-is)?

cc @glaringlee

glaringlee · 2020-07-10T03:11:55Z

@albanD
BC breaking is not the end of world, but it would be a bad user experience if their code suddenly won't compile.
I am not sure how bad this will be, given all the pytorch internal code is already replaced with mutable_grad(), can we do the following?

Tensor& grad() {
    TORCH_WARN_ONCE("..."); // say something like non-const grad() will be deprecated in next release, call mutable_grad() instead if tensor is actually mutable, or make it const"
    return impl_->mutable_grad();
}

albanD · 2020-07-10T14:41:23Z

I did that on my local commit. And when I recompile and run the test, the warning actually fires a lot of times (when I remove the once). Note that the exact same code with just that function removed works fine!

My understanding is that the compiler uses the "non-const" overload even though it could be using the const one. And that means that potentially all users are going to see this warning if we add this overload back...

facebook-github-bot

@albanD has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2020-07-16T18:13:53Z

@albanD merged this pull request in 45c5bac.

albanD requested review from apaszke, ebetica, goldsborough, mrshenli, pietern, yf225 and zhaojuanmao as code owners July 1, 2020 22:19

albanD marked this pull request as draft July 1, 2020 22:20

facebook-github-bot reviewed Jul 1, 2020

View reviewed changes

ezyang reviewed Jul 2, 2020

View reviewed changes

test/cpp/api/nn_utils.cpp Outdated Show resolved Hide resolved

ezyang reviewed Jul 2, 2020

View reviewed changes

test/cpp/api/nn_utils.cpp Outdated Show resolved Hide resolved

facebook-github-bot reviewed Jul 2, 2020

View reviewed changes

ezyang reviewed Jul 6, 2020

View reviewed changes

torch/csrc/autograd/functions/accumulate_grad.cpp Outdated Show resolved Hide resolved

ezyang reviewed Jul 6, 2020

View reviewed changes

torch/csrc/distributed/c10d/reducer.cpp Outdated Show resolved Hide resolved

facebook-github-bot reviewed Jul 6, 2020

View reviewed changes

ezyang approved these changes Jul 7, 2020

View reviewed changes

albanD added 3 commits July 9, 2020 18:31

Fix cpp grad accessor API

caa3bb6

custom ops tests and comment

2d73747

accumulate grad thread safety

7ba4b4f

albanD mentioned this pull request Jul 13, 2020

libtorch C++ setting gradient to 0 is incorrectly detected as violating const correctness #41264

Closed

albanD added the module: bc-breaking Related to a BC-breaking change label Jul 14, 2020

albanD force-pushed the const_grad branch from 3b8de07 to 7ba4b4f Compare July 14, 2020 14:14

Update doc

9ee011a

albanD marked this pull request as ready for review July 14, 2020 14:24

facebook-github-bot reviewed Jul 14, 2020

View reviewed changes

albanD mentioned this pull request Jul 16, 2020

test_nn failures on power #41469

Closed

facebook-github-bot closed this in 45c5bac Jul 16, 2020

facebook-github-bot added the merged label Jul 16, 2020

mruberry added the Merged label Oct 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] Fix cpp grad accessor API #40887

[WIP] Fix cpp grad accessor API #40887

Uh oh!

albanD commented Jul 1, 2020 •

edited

Loading

Uh oh!

facebook-github-bot left a comment

Uh oh!

dr-ci bot commented Jul 2, 2020 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

facebook-github-bot left a comment

Uh oh!

Uh oh!

Uh oh!

facebook-github-bot left a comment

Uh oh!

albanD commented Jul 9, 2020

Uh oh!

glaringlee commented Jul 10, 2020

Uh oh!

albanD commented Jul 10, 2020

Uh oh!

facebook-github-bot left a comment

Uh oh!

facebook-github-bot commented Jul 16, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[WIP] Fix cpp grad accessor API #40887

[WIP] Fix cpp grad accessor API #40887

Uh oh!

Conversation

albanD commented Jul 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

dr-ci bot commented Jul 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

🕵️ 1 new failure recognized by patterns

pytorch_macos_10_13_py3_test (1/1)

Uh oh!

Uh oh!

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

albanD commented Jul 9, 2020

Uh oh!

glaringlee commented Jul 10, 2020

Uh oh!

albanD commented Jul 10, 2020

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Jul 16, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

albanD commented Jul 1, 2020 •

edited

Loading

dr-ci bot commented Jul 2, 2020 •

edited

Loading