Skip to content

Conversation

@albanD
Copy link
Collaborator

@albanD albanD commented Jul 1, 2020

Update the API to access grad in cpp to avoid unexpected thread safety issues.
In particular, with the current API, a check like t.grad().defined() is not thread safe.

  • This introduces t.mutable_grad() that should be used when getting a mutable version of the saved gradient. This function is not thread safe.
  • The Tensor& grad() API is now removed. We could not do a deprecation cycle as most of our call side use non-const Tensors that use the non-const overload. This would lead to most calls hitting the warning. This would be too verbose for all the users.

@albanD albanD marked this pull request as draft July 1, 2020 22:20
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@albanD has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@dr-ci
Copy link

dr-ci bot commented Jul 2, 2020

💊 CI failures summary and remediations

As of commit 9ee011a (more details on the Dr. CI page):


  • 1/1 failures introduced in this PR

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_macos_10_13_py3_test (1/1)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

Jul 14 10:40:53 [E request_callback_impl.cpp:168] Received error while processing request type 2: PickleError: ScriptModules cannot be deepcopied using copy.deepcopy or saved using torch.save. Mixed serialization of script and non-script modules is not supported. For purely script modules use my_script_module.save() instead.
Jul 14 10:40:53   /Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/distributed/rpc/internal.py(93): serialize 
Jul 14 10:40:53   /Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/distributed/rpc/internal.py(145): serialize 
Jul 14 10:40:53  
Jul 14 10:40:53 [E request_callback_impl.cpp:168] Received error while processing request type 2: PickleError: ScriptModules cannot be deepcopied using copy.deepcopy or saved using torch.save. Mixed serialization of script and non-script modules is not supported. For purely script modules use my_script_module.save(<filename>) instead. 
Jul 14 10:40:53  
Jul 14 10:40:53 At: 
Jul 14 10:40:53   /Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/jit/_script.py(570): __getstate__ 
Jul 14 10:40:53   /Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/distributed/rpc/internal.py(93): serialize 
Jul 14 10:40:53   /Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/distributed/rpc/internal.py(145): serialize 
Jul 14 10:40:53  
Jul 14 10:40:53 [E request_callback_impl.cpp:168] Received error while processing request type 2: PickleError: ScriptModules cannot be deepcopied using copy.deepcopy or saved using torch.save. Mixed serialization of script and non-script modules is not supported. For purely script modules use my_script_module.save(<filename>) instead. 
Jul 14 10:40:53  
Jul 14 10:40:53 At: 
Jul 14 10:40:53   /Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/jit/_script.py(570): __getstate__ 
Jul 14 10:40:53   /Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/distributed/rpc/internal.py(93): serialize 
Jul 14 10:40:53   /Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/distributed/rpc/internal.py(145): serialize 
Jul 14 10:40:53  
Jul 14 10:40:53 ok (1.360s) 
Jul 14 10:40:55   test_unexepected_kwarg_is_specified (__main__.JitRpcTestWithSpawn) ... ok (1.207s) 
Jul 14 10:40:56   test_user_rrefs_confirmed (__main__.JitRpcTestWithSpawn) ... ok (1.305s) 
Jul 14 10:40:57   test_user_rrefs_confirmed_remote (__main__.JitRpcTestWithSpawn) ... ok (1.209s) 

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 14 times.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@albanD has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@albanD has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@albanD
Copy link
Collaborator Author

albanD commented Jul 9, 2020

Even though the code compiles just fine right now (and all tests pass fine), if I add back the bad overload (for BC reasons with a warning), the compiler actually use the bad overload...
Do we actually want that? Or are we happy with a hard BC breaking here (not adding back the bad overload with a warning and merging as-is)?

cc @glaringlee

@glaringlee
Copy link
Contributor

@albanD
BC breaking is not the end of world, but it would be a bad user experience if their code suddenly won't compile.
I am not sure how bad this will be, given all the pytorch internal code is already replaced with mutable_grad(), can we do the following?

Tensor& grad() {
    TORCH_WARN_ONCE("..."); // say something like non-const grad() will be deprecated in next release, call mutable_grad() instead if tensor is actually mutable, or make it const"
    return impl_->mutable_grad();
}

@albanD
Copy link
Collaborator Author

albanD commented Jul 10, 2020

I did that on my local commit. And when I recompile and run the test, the warning actually fires a lot of times (when I remove the once). Note that the exact same code with just that function removed works fine!

My understanding is that the compiler uses the "non-const" overload even though it could be using the const one. And that means that potentially all users are going to see this warning if we add this overload back...

@albanD albanD marked this pull request as ready for review July 14, 2020 14:24
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@albanD has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@albanD merged this pull request in 45c5bac.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged module: bc-breaking Related to a BC-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants