Skip to content

Conversation

@rohan-varma
Copy link
Contributor

@rohan-varma rohan-varma commented Sep 14, 2020

Stack from ghstack:

Since toHere() does not execute operations (torch operators) over RPC and simply
transfers the value to the local node, we don't need to enable the profiler
remotely for this message. This causes unnecessary overhead and is not needed.

Since toHere is a blocking call, we already profile the call on the local node using RECORD_USER_SCOPE, so this does not change the expected profiler results (validated by ensuring all remote profiling tests pass).

Differential Revision: D23641466

Since `toHere()` does not execute operations over RPC and simply
transfers the value to the local node, we don't need to enable the profiler
remotely for this message. This causes unnecessary overhead and is not needed.

Since `toHere` is a blocking call, we already profile the call on the local node using `RECORD_USER_SCOPE`, so this does not change the expected profiler results (validated by ensuring all remote profiling tests pass).

Differential Revision: [D23641466](https://our.internmc.facebook.com/intern/diff/D23641466/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Sep 14, 2020
Since `toHere()` does not execute operations over RPC and simply
transfers the value to the local node, we don't need to enable the profiler
remotely for this message. This causes unnecessary overhead and is not needed.

Since `toHere` is a blocking call, we already profile the call on the local node using `RECORD_USER_SCOPE`, so this does not change the expected profiler results (validated by ensuring all remote profiling tests pass).

Differential Revision: [D23641466](https://our.internmc.facebook.com/intern/diff/D23641466/)

ghstack-source-id: 112012912
Pull Request resolved: #44655
@codecov
Copy link

codecov bot commented Sep 15, 2020

Codecov Report

❗ No coverage uploaded for pull request base (gh/rohan-varma/172/base@f0a18e2). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@                    Coverage Diff                     @@
##             gh/rohan-varma/172/base   #44655   +/-   ##
==========================================================
  Coverage                           ?   67.85%           
==========================================================
  Files                              ?      384           
  Lines                              ?    50020           
  Branches                           ?        0           
==========================================================
  Hits                               ?    33940           
  Misses                             ?    16080           
  Partials                           ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f0a18e2...5382e89. Read the comment docs.

// If profiler is enabled, wrap this message with profiling metadata that will
// tell the remote end to process this request with the profiler enabled.
if (torch::autograd::profiler::profilerEnabled()) {
if (!forceDisableProfiling && torch::autograd::profiler::profilerEnabled()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks OK to me. But I wonder if we can avoid this new arg by making torch::autograd::profiler::profilerEnabled() return false? E.g., is it possible to use a guard in rref_impl.cpp to toggle the status of profilerEnabled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess one thing we could do is set a thread_local that allows this override, and have a guard that sets/restores it. If it's true, then we would override profilerEnabled() to always be false. Although, I don't think this should exist in profiler::profilerEnabled() itself since that may increase complexity in the profiler too much, maybe we can have a separate function for RPC profiling specifically that wraps around profilerEnabled() and checks this guard value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, if it's not that easy to toggle profiler on/off. The current version LGTM! Thanks!

Since `toHere()` does not execute operations (torch operators) over RPC and simply
transfers the value to the local node, we don't need to enable the profiler
remotely for this message. This causes unnecessary overhead and is not needed.

Since `toHere` is a blocking call, we already profile the call on the local node using `RECORD_USER_SCOPE`, so this does not change the expected profiler results (validated by ensuring all remote profiling tests pass).

Differential Revision: [D23641466](https://our.internmc.facebook.com/intern/diff/D23641466/)

[ghstack-poisoned]
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* **#44655 [RPC profiling] Don't wrap toHere() calls with profiling**
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.

Since `toHere()` does not execute operations (torch operators) over RPC and simply
transfers the value to the local node, we don't need to enable the profiler
remotely for this message. This causes unnecessary overhead and is not needed.

Since `toHere` is a blocking call, we already profile the call on the local node using `RECORD_USER_SCOPE`, so this does not change the expected profiler results (validated by ensuring all remote profiling tests pass).

Differential Revision: [D23641466](https://our.internmc.facebook.com/intern/diff/D23641466/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Sep 18, 2020
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* **#44646 Remove thread_local RecordFunctionGuard from profiler.**

Per a discussion with @ilia-cher, this is not needed anymore and
removing it would make some future changes to support async RPC profiling
easier. Tested by ensuring profiling tests in `test_autograd.py` still pass.

Differential Revision: [D23683998](https://our.internmc.facebook.com/intern/diff/D23683998/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Sep 18, 2020
…another thread. "

server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* **#44653 [RPC profiling] Allow disableProfiler() to be called from another thread.**
* #44646 Remove thread_local RecordFunctionGuard from profiler.

This changes the profiler per a discussion with @ilia-cher offline that enables `disableProfiler()` event consolidation logic to be called from different threads (i.e. threads where the profiler was not explicitly enabled). This is needed to support the functionality enabled by D23638387 where we defer profiling event collection until executing an async callback that can execute on a different thread, to support RPC async function profiling.

This is done by introducing 2 flags `cleanupTLSState` and `consolidate` which controls whether we should clean up thread local settings (we don't do this when calling `disableProfiler()` on non-main threads) and whether we should consolidate all profiled events. Backwards compatiblity is ensured since both options are true by default.

Added a test in `test_misc.cpp` to test this.

Differential Revision: [D23638499](https://our.internmc.facebook.com/intern/diff/D23638499/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23638499/)!

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Sep 18, 2020
…tion execution over RPC."

server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.

Closes #39971. This PR adds support for functions decorated with `@rpc.functions.async_execution` to be profiled over RPC as builtins, jit functions, and blocking python UDFs currently can be. The reasoning for this is to provide complete feature support in terms of RPC profiling and the various types of functions users can run.

To enable this, the PR below this enables calling `disableProfiler()` safely from another thread. We use that functionality to defer disabling the profiler on the server until the future corresponding to the RPC request completes (rather than only the blocking `processRPC` call as was done previously). Since when the future completes we've kicked off the async function and the future corresponding to it has completed, we are able to capture any RPCs the function would have called and the actual work done on the other node.

For example, if the following async function is ran on a server over RPC:

```
def slow_add(x, y):
    time.sleep(1)
    return torch.add(x, y)


@rpc.functions.async_execution
def slow_async_add(to, x, y):
    return rpc.rpc_async(to, slow_add, args=(x, y))
```

we expect to see the original RPC profiled, the nested RPC profiled, and the actual torch.add() work. All of these events should be recorded with the correct node id. Here is an example profiling output:

```
-------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------
Name                                                                                                                       Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls  Node ID
-------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------                                                                                                                            rpc_async#slow_async_add(worker1 -> worker2)                                                                               0.00%            0.000us          0                1.012s
         1.012s           1                1
aten::empty                                                                                                                7.02%            11.519us         7.02%            11.519us         11.519us         1                1
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)                             0.00%            0.000us          0                1.006s
         1.006s           1                2                                                                                                                                          rpc_async#slow_async_add(worker1 -> worker2)#remote_op: aten::empty                                                        7.21%            11.843us         7.21%            11.843us
         11.843us         1                2
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::add        71.94%           118.107us        85.77%           140.802us        140.802us        1                3
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::empty      13.82%           22.695us         13.82%           22.695us
         22.695us         1                3                                                                                                                                          -------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------
Self CPU time total: 164.164us
```

This PR also moves a bunch of the profiling logic to `rpc/utils.cpp` to declutter `request_callback` code.

Differential Revision: [D23638387](https://our.internmc.facebook.com/intern/diff/D23638387/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Sep 18, 2020
… single threaded

server"

server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.

server

This ensures that RPC profiling works in single-threaded server
scenarios and that we won't make the assumption that we'll have multiple
threads when working on this code. For example, this assumption resulted in a
bug in the previous diff (which was fixed). 

Differential Revision: [D23691304](https://our.internmc.facebook.com/intern/diff/D23691304/)

[ghstack-poisoned]
@dr-ci
Copy link

dr-ci bot commented Sep 18, 2020

💊 CI failures summary and remediations

As of commit 5382e89 (more details on the Dr. CI page):


  • 1/1 failures introduced in this PR

XLA failure

Job pytorch_xla_linux_bionic_py3_6_clang9_build is failing. Please create an issue with title prefixed by [PT_BREAK] in pytorch/xla and link to to this PR. If you have questions, please reach out to @ailzhang / @dlibenzi / @JackCaoG.


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 14 times.

server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* **#44655 [RPC profiling] Don't wrap toHere() calls with profiling**
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* **#44655 [RPC profiling] Don't wrap toHere() calls with profiling**
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.

Since `toHere()` does not execute operations (torch operators) over RPC and simply
transfers the value to the local node, we don't need to enable the profiler
remotely for this message. This causes unnecessary overhead and is not needed.

Since `toHere` is a blocking call, we already profile the call on the local node using `RECORD_USER_SCOPE`, so this does not change the expected profiler results (validated by ensuring all remote profiling tests pass).

Differential Revision: [D23641466](https://our.internmc.facebook.com/intern/diff/D23641466/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Sep 18, 2020
…another thread. "

server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* **#44653 [RPC profiling] Allow disableProfiler() to be called from another thread.**
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* **#44653 [RPC profiling] Allow disableProfiler() to be called from another thread.**
* #44646 Remove thread_local RecordFunctionGuard from profiler.

This changes the profiler per a discussion with @ilia-cher offline that enables `disableProfiler()` event consolidation logic to be called from different threads (i.e. threads where the profiler was not explicitly enabled). This is needed to support the functionality enabled by D23638387 where we defer profiling event collection until executing an async callback that can execute on a different thread, to support RPC async function profiling.

This is done by introducing 2 flags `cleanupTLSState` and `consolidate` which controls whether we should clean up thread local settings (we don't do this when calling `disableProfiler()` on non-main threads) and whether we should consolidate all profiled events. Backwards compatiblity is ensured since both options are true by default.

Added a test in `test_misc.cpp` to test this.

Differential Revision: [D23638499](https://our.internmc.facebook.com/intern/diff/D23638499/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23638499/)!

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Sep 18, 2020
…tion execution over RPC."

server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.

Closes #39971. This PR adds support for functions decorated with `@rpc.functions.async_execution` to be profiled over RPC as builtins, jit functions, and blocking python UDFs currently can be. The reasoning for this is to provide complete feature support in terms of RPC profiling and the various types of functions users can run.

To enable this, the PR below this enables calling `disableProfiler()` safely from another thread. We use that functionality to defer disabling the profiler on the server until the future corresponding to the RPC request completes (rather than only the blocking `processRPC` call as was done previously). Since when the future completes we've kicked off the async function and the future corresponding to it has completed, we are able to capture any RPCs the function would have called and the actual work done on the other node.

For example, if the following async function is ran on a server over RPC:

```
def slow_add(x, y):
    time.sleep(1)
    return torch.add(x, y)


@rpc.functions.async_execution
def slow_async_add(to, x, y):
    return rpc.rpc_async(to, slow_add, args=(x, y))
```

we expect to see the original RPC profiled, the nested RPC profiled, and the actual torch.add() work. All of these events should be recorded with the correct node id. Here is an example profiling output:

```
-------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------
Name                                                                                                                       Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls  Node ID
-------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------                                                                                                                            rpc_async#slow_async_add(worker1 -> worker2)                                                                               0.00%            0.000us          0                1.012s
         1.012s           1                1
aten::empty                                                                                                                7.02%            11.519us         7.02%            11.519us         11.519us         1                1
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)                             0.00%            0.000us          0                1.006s
         1.006s           1                2                                                                                                                                          rpc_async#slow_async_add(worker1 -> worker2)#remote_op: aten::empty                                                        7.21%            11.843us         7.21%            11.843us
         11.843us         1                2
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::add        71.94%           118.107us        85.77%           140.802us        140.802us        1                3
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::empty      13.82%           22.695us         13.82%           22.695us
         22.695us         1                3                                                                                                                                          -------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------
Self CPU time total: 164.164us
```

This PR also moves a bunch of the profiling logic to `rpc/utils.cpp` to declutter `request_callback` code.

Differential Revision: [D23638387](https://our.internmc.facebook.com/intern/diff/D23638387/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Sep 18, 2020
… single threaded

server"

server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.

server

This ensures that RPC profiling works in single-threaded server
scenarios and that we won't make the assumption that we'll have multiple
threads when working on this code. For example, this assumption resulted in a
bug in the previous diff (which was fixed). 

Differential Revision: [D23691304](https://our.internmc.facebook.com/intern/diff/D23691304/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Sep 20, 2020
…filing"

server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.

When enabling profiler on server, if it is a different machine it may
not have CUDA while caller does. In this case, we would crash but now we
fallback to CPU and log a warning.

For testing, I forced it to return CUDA profiler state, and validated that it falls back. Not sure how to add a unittest given that we have single machine tests and the machine either has or doesn't have cuda. 
Differential Revision: [D23790729](https://our.internmc.facebook.com/intern/diff/D23790729/)

[ghstack-poisoned]
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* **#44655 [RPC profiling] Don't wrap toHere() calls with profiling**
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* **#44655 [RPC profiling] Don't wrap toHere() calls with profiling**
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* **#44655 [RPC profiling] Don't wrap toHere() calls with profiling**
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* **#44655 [RPC profiling] Don't wrap toHere() calls with profiling**
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.

Since `toHere()` does not execute operations (torch operators) over RPC and simply
transfers the value to the local node, we don't need to enable the profiler
remotely for this message. This causes unnecessary overhead and is not needed.

Since `toHere` is a blocking call, we already profile the call on the local node using `RECORD_USER_SCOPE`, so this does not change the expected profiler results (validated by ensuring all remote profiling tests pass).

Differential Revision: [D23641466](https://our.internmc.facebook.com/intern/diff/D23641466/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Sep 21, 2020
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* **#44646 Remove thread_local RecordFunctionGuard from profiler.**
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* **#44646 Remove thread_local RecordFunctionGuard from profiler.**
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* **#44646 Remove thread_local RecordFunctionGuard from profiler.**
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* **#44646 Remove thread_local RecordFunctionGuard from profiler.**

Per a discussion with @ilia-cher, this is not needed anymore and
removing it would make some future changes to support async RPC profiling
easier. Tested by ensuring profiling tests in `test_autograd.py` still pass.

Differential Revision: [D23683998](https://our.internmc.facebook.com/intern/diff/D23683998/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Sep 21, 2020
…another thread. "

server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* **#44653 [RPC profiling] Allow disableProfiler() to be called from another thread.**
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* **#44653 [RPC profiling] Allow disableProfiler() to be called from another thread.**
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* **#44653 [RPC profiling] Allow disableProfiler() to be called from another thread.**
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* **#44653 [RPC profiling] Allow disableProfiler() to be called from another thread.**
* #44646 Remove thread_local RecordFunctionGuard from profiler.

This changes the profiler per a discussion with @ilia-cher offline that enables `disableProfiler()` event consolidation logic to be called from different threads (i.e. threads where the profiler was not explicitly enabled). This is needed to support the functionality enabled by D23638387 where we defer profiling event collection until executing an async callback that can execute on a different thread, to support RPC async function profiling.

This is done by introducing 2 flags `cleanupTLSState` and `consolidate` which controls whether we should clean up thread local settings (we don't do this when calling `disableProfiler()` on non-main threads) and whether we should consolidate all profiled events. Backwards compatiblity is ensured since both options are true by default.

Added a test in `test_misc.cpp` to test this.

Differential Revision: [D23638499](https://our.internmc.facebook.com/intern/diff/D23638499/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23638499/)!

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Sep 21, 2020
…tion execution over RPC."

server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.

Closes #39971. This PR adds support for functions decorated with `@rpc.functions.async_execution` to be profiled over RPC as builtins, jit functions, and blocking python UDFs currently can be. The reasoning for this is to provide complete feature support in terms of RPC profiling and the various types of functions users can run.

To enable this, the PR below this enables calling `disableProfiler()` safely from another thread. We use that functionality to defer disabling the profiler on the server until the future corresponding to the RPC request completes (rather than only the blocking `processRPC` call as was done previously). Since when the future completes we've kicked off the async function and the future corresponding to it has completed, we are able to capture any RPCs the function would have called and the actual work done on the other node.

For example, if the following async function is ran on a server over RPC:

```
def slow_add(x, y):
    time.sleep(1)
    return torch.add(x, y)


@rpc.functions.async_execution
def slow_async_add(to, x, y):
    return rpc.rpc_async(to, slow_add, args=(x, y))
```

we expect to see the original RPC profiled, the nested RPC profiled, and the actual torch.add() work. All of these events should be recorded with the correct node id. Here is an example profiling output:

```
-------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------
Name                                                                                                                       Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls  Node ID
-------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------                                                                                                                            rpc_async#slow_async_add(worker1 -> worker2)                                                                               0.00%            0.000us          0                1.012s
         1.012s           1                1
aten::empty                                                                                                                7.02%            11.519us         7.02%            11.519us         11.519us         1                1
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)                             0.00%            0.000us          0                1.006s
         1.006s           1                2                                                                                                                                          rpc_async#slow_async_add(worker1 -> worker2)#remote_op: aten::empty                                                        7.21%            11.843us         7.21%            11.843us
         11.843us         1                2
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::add        71.94%           118.107us        85.77%           140.802us        140.802us        1                3
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::empty      13.82%           22.695us         13.82%           22.695us
         22.695us         1                3                                                                                                                                          -------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------
Self CPU time total: 164.164us
```

This PR also moves a bunch of the profiling logic to `rpc/utils.cpp` to declutter `request_callback` code.

Differential Revision: [D23638387](https://our.internmc.facebook.com/intern/diff/D23638387/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Sep 22, 2020
…tion execution over RPC."

server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.

Closes #39971. This PR adds support for functions decorated with `@rpc.functions.async_execution` to be profiled over RPC as builtins, jit functions, and blocking python UDFs currently can be. The reasoning for this is to provide complete feature support in terms of RPC profiling and the various types of functions users can run.

To enable this, the PR below this enables calling `disableProfiler()` safely from another thread. We use that functionality to defer disabling the profiler on the server until the future corresponding to the RPC request completes (rather than only the blocking `processRPC` call as was done previously). Since when the future completes we've kicked off the async function and the future corresponding to it has completed, we are able to capture any RPCs the function would have called and the actual work done on the other node.

For example, if the following async function is ran on a server over RPC:

```
def slow_add(x, y):
    time.sleep(1)
    return torch.add(x, y)


@rpc.functions.async_execution
def slow_async_add(to, x, y):
    return rpc.rpc_async(to, slow_add, args=(x, y))
```

we expect to see the original RPC profiled, the nested RPC profiled, and the actual torch.add() work. All of these events should be recorded with the correct node id. Here is an example profiling output:

```
-------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------
Name                                                                                                                       Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls  Node ID
-------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------                                                                                                                            rpc_async#slow_async_add(worker1 -> worker2)                                                                               0.00%            0.000us          0                1.012s
         1.012s           1                1
aten::empty                                                                                                                7.02%            11.519us         7.02%            11.519us         11.519us         1                1
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)                             0.00%            0.000us          0                1.006s
         1.006s           1                2                                                                                                                                          rpc_async#slow_async_add(worker1 -> worker2)#remote_op: aten::empty                                                        7.21%            11.843us         7.21%            11.843us
         11.843us         1                2
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::add        71.94%           118.107us        85.77%           140.802us        140.802us        1                3
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::empty      13.82%           22.695us         13.82%           22.695us
         22.695us         1                3                                                                                                                                          -------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------
Self CPU time total: 164.164us
```

This PR also moves a bunch of the profiling logic to `rpc/utils.cpp` to declutter `request_callback` code.

Differential Revision: [D23638387](https://our.internmc.facebook.com/intern/diff/D23638387/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Sep 22, 2020
… single threaded

server"

server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.

server

This ensures that RPC profiling works in single-threaded server
scenarios and that we won't make the assumption that we'll have multiple
threads when working on this code. For example, this assumption resulted in a
bug in the previous diff (which was fixed). 

Differential Revision: [D23691304](https://our.internmc.facebook.com/intern/diff/D23691304/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Sep 22, 2020
…filing"

server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.

A comment from @mrshenli on #44664 led us to the following concern: when enabling profiler on server, if it is a different machine it may
not have CUDA while caller does. In this case, we would crash but now we
fallback to CPU and log a warning.

For testing, I forced it to return CUDA profiler state, and validated that it falls back. Not sure how to add a unittest given that we have single machine tests and the machine either has or doesn't have cuda. 
Differential Revision: [D23790729](https://our.internmc.facebook.com/intern/diff/D23790729/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in d4a634c.

1 similar comment
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in d4a634c.

rohan-varma added a commit that referenced this pull request Sep 23, 2020
…filing"

server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.

A comment from @mrshenli on #44664 led us to the following concern: when enabling profiler on server, if it is a different machine it may
not have CUDA while caller does. In this case, we would crash but now we
fallback to CPU and log a warning.

For testing, I forced it to return CUDA profiler state, and validated that it falls back. Not sure how to add a unittest given that we have single machine tests and the machine either has or doesn't have cuda. 
Differential Revision: [D23790729](https://our.internmc.facebook.com/intern/diff/D23790729/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Sep 23, 2020
…filing"

server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.

A comment from @mrshenli on #44664 led us to the following concern: when enabling profiler on server, if it is a different machine it may
not have CUDA while caller does. In this case, we would crash but now we
fallback to CPU and log a warning.

For testing, I forced it to return CUDA profiler state, and validated that it falls back. Not sure how to add a unittest given that we have single machine tests and the machine either has or doesn't have cuda. 
Differential Revision: [D23790729](https://our.internmc.facebook.com/intern/diff/D23790729/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Sep 23, 2020
…pport async function execution over RPC."

server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.

Closes #39971. This PR adds support for functions decorated with `@rpc.functions.async_execution` to be profiled over RPC as builtins, jit functions, and blocking python UDFs currently can be. The reasoning for this is to provide complete feature support in terms of RPC profiling and the various types of functions users can run.

To enable this, the PR below this enables calling `disableProfiler()` safely from another thread. We use that functionality to defer disabling the profiler on the server until the future corresponding to the RPC request completes (rather than only the blocking `processRPC` call as was done previously). Since when the future completes we've kicked off the async function and the future corresponding to it has completed, we are able to capture any RPCs the function would have called and the actual work done on the other node.

For example, if the following async function is ran on a server over RPC:

```
def slow_add(x, y):
    time.sleep(1)
    return torch.add(x, y)


@rpc.functions.async_execution
def slow_async_add(to, x, y):
    return rpc.rpc_async(to, slow_add, args=(x, y))
```

we expect to see the original RPC profiled, the nested RPC profiled, and the actual torch.add() work. All of these events should be recorded with the correct node id. Here is an example profiling output:

```
-------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------
Name                                                                                                                       Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls  Node ID
-------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------                                                                                                                            rpc_async#slow_async_add(worker1 -> worker2)                                                                               0.00%            0.000us          0                1.012s
         1.012s           1                1
aten::empty                                                                                                                7.02%            11.519us         7.02%            11.519us         11.519us         1                1
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)                             0.00%            0.000us          0                1.006s
         1.006s           1                2                                                                                                                                          rpc_async#slow_async_add(worker1 -> worker2)#remote_op: aten::empty                                                        7.21%            11.843us         7.21%            11.843us
         11.843us         1                2
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::add        71.94%           118.107us        85.77%           140.802us        140.802us        1                3
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::empty      13.82%           22.695us         13.82%           22.695us
         22.695us         1                3                                                                                                                                          -------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------
Self CPU time total: 164.164us
```

This PR also moves a bunch of the profiling logic to `rpc/utils.cpp` to declutter `request_callback` code.

Differential Revision: [D23638387](https://our.internmc.facebook.com/intern/diff/D23638387/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Sep 23, 2020
…tion execution over RPC."

server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.

Closes #39971. This PR adds support for functions decorated with `@rpc.functions.async_execution` to be profiled over RPC as builtins, jit functions, and blocking python UDFs currently can be. The reasoning for this is to provide complete feature support in terms of RPC profiling and the various types of functions users can run.

To enable this, the PR below this enables calling `disableProfiler()` safely from another thread. We use that functionality to defer disabling the profiler on the server until the future corresponding to the RPC request completes (rather than only the blocking `processRPC` call as was done previously). Since when the future completes we've kicked off the async function and the future corresponding to it has completed, we are able to capture any RPCs the function would have called and the actual work done on the other node.

For example, if the following async function is ran on a server over RPC:

```
def slow_add(x, y):
    time.sleep(1)
    return torch.add(x, y)


@rpc.functions.async_execution
def slow_async_add(to, x, y):
    return rpc.rpc_async(to, slow_add, args=(x, y))
```

we expect to see the original RPC profiled, the nested RPC profiled, and the actual torch.add() work. All of these events should be recorded with the correct node id. Here is an example profiling output:

```
-------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------
Name                                                                                                                       Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls  Node ID
-------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------                                                                                                                            rpc_async#slow_async_add(worker1 -> worker2)                                                                               0.00%            0.000us          0                1.012s
         1.012s           1                1
aten::empty                                                                                                                7.02%            11.519us         7.02%            11.519us         11.519us         1                1
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)                             0.00%            0.000us          0                1.006s
         1.006s           1                2                                                                                                                                          rpc_async#slow_async_add(worker1 -> worker2)#remote_op: aten::empty                                                        7.21%            11.843us         7.21%            11.843us
         11.843us         1                2
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::add        71.94%           118.107us        85.77%           140.802us        140.802us        1                3
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::empty      13.82%           22.695us         13.82%           22.695us
         22.695us         1                3                                                                                                                                          -------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------
Self CPU time total: 164.164us
```

This PR also moves a bunch of the profiling logic to `rpc/utils.cpp` to declutter `request_callback` code.

Differential Revision: [D23638387](https://our.internmc.facebook.com/intern/diff/D23638387/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Sep 23, 2020
… single threaded

server"

server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.

server

This ensures that RPC profiling works in single-threaded server
scenarios and that we won't make the assumption that we'll have multiple
threads when working on this code. For example, this assumption resulted in a
bug in the previous diff (which was fixed). 

Differential Revision: [D23691304](https://our.internmc.facebook.com/intern/diff/D23691304/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Sep 23, 2020
…filing"

server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.

A comment from @mrshenli on #44664 led us to the following concern: when enabling profiler on server, if it is a different machine it may
not have CUDA while caller does. In this case, we would crash but now we
fallback to CPU and log a warning.

For testing, I forced it to return CUDA profiler state, and validated that it falls back. Not sure how to add a unittest given that we have single machine tests and the machine either has or doesn't have cuda. 
Differential Revision: [D23790729](https://our.internmc.facebook.com/intern/diff/D23790729/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Sep 23, 2020
… single threaded

server"

server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.

server

This ensures that RPC profiling works in single-threaded server
scenarios and that we won't make the assumption that we'll have multiple
threads when working on this code. For example, this assumption resulted in a
bug in the previous diff (which was fixed). 

Differential Revision: [D23691304](https://our.internmc.facebook.com/intern/diff/D23691304/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Sep 23, 2020
…filing"

server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.

A comment from @mrshenli on #44664 led us to the following concern: when enabling profiler on server, if it is a different machine it may
not have CUDA while caller does. In this case, we would crash but now we
fallback to CPU and log a warning.

For testing, I forced it to return CUDA profiler state, and validated that it falls back. Not sure how to add a unittest given that we have single machine tests and the machine either has or doesn't have cuda. 
Differential Revision: [D23790729](https://our.internmc.facebook.com/intern/diff/D23790729/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Sep 24, 2020
…pport async function execution over RPC."

server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.

Closes #39971. This PR adds support for functions decorated with `@rpc.functions.async_execution` to be profiled over RPC as builtins, jit functions, and blocking python UDFs currently can be. The reasoning for this is to provide complete feature support in terms of RPC profiling and the various types of functions users can run.

To enable this, the PR below this enables calling `disableProfiler()` safely from another thread. We use that functionality to defer disabling the profiler on the server until the future corresponding to the RPC request completes (rather than only the blocking `processRPC` call as was done previously). Since when the future completes we've kicked off the async function and the future corresponding to it has completed, we are able to capture any RPCs the function would have called and the actual work done on the other node.

For example, if the following async function is ran on a server over RPC:

```
def slow_add(x, y):
    time.sleep(1)
    return torch.add(x, y)


@rpc.functions.async_execution
def slow_async_add(to, x, y):
    return rpc.rpc_async(to, slow_add, args=(x, y))
```

we expect to see the original RPC profiled, the nested RPC profiled, and the actual torch.add() work. All of these events should be recorded with the correct node id. Here is an example profiling output:

```
-------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------
Name                                                                                                                       Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls  Node ID
-------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------                                                                                                                            rpc_async#slow_async_add(worker1 -> worker2)                                                                               0.00%            0.000us          0                1.012s
         1.012s           1                1
aten::empty                                                                                                                7.02%            11.519us         7.02%            11.519us         11.519us         1                1
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)                             0.00%            0.000us          0                1.006s
         1.006s           1                2                                                                                                                                          rpc_async#slow_async_add(worker1 -> worker2)#remote_op: aten::empty                                                        7.21%            11.843us         7.21%            11.843us
         11.843us         1                2
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::add        71.94%           118.107us        85.77%           140.802us        140.802us        1                3
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::empty      13.82%           22.695us         13.82%           22.695us
         22.695us         1                3                                                                                                                                          -------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------
Self CPU time total: 164.164us
```

This PR also moves a bunch of the profiling logic to `rpc/utils.cpp` to declutter `request_callback` code.

Differential Revision: [D23638387](https://our.internmc.facebook.com/intern/diff/D23638387/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Sep 24, 2020
…tion execution over RPC."

server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.

Closes #39971. This PR adds support for functions decorated with `@rpc.functions.async_execution` to be profiled over RPC as builtins, jit functions, and blocking python UDFs currently can be. The reasoning for this is to provide complete feature support in terms of RPC profiling and the various types of functions users can run.

To enable this, the PR below this enables calling `disableProfiler()` safely from another thread. We use that functionality to defer disabling the profiler on the server until the future corresponding to the RPC request completes (rather than only the blocking `processRPC` call as was done previously). Since when the future completes we've kicked off the async function and the future corresponding to it has completed, we are able to capture any RPCs the function would have called and the actual work done on the other node.

For example, if the following async function is ran on a server over RPC:

```
def slow_add(x, y):
    time.sleep(1)
    return torch.add(x, y)


@rpc.functions.async_execution
def slow_async_add(to, x, y):
    return rpc.rpc_async(to, slow_add, args=(x, y))
```

we expect to see the original RPC profiled, the nested RPC profiled, and the actual torch.add() work. All of these events should be recorded with the correct node id. Here is an example profiling output:

```
-------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------
Name                                                                                                                       Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls  Node ID
-------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------                                                                                                                            rpc_async#slow_async_add(worker1 -> worker2)                                                                               0.00%            0.000us          0                1.012s
         1.012s           1                1
aten::empty                                                                                                                7.02%            11.519us         7.02%            11.519us         11.519us         1                1
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)                             0.00%            0.000us          0                1.006s
         1.006s           1                2                                                                                                                                          rpc_async#slow_async_add(worker1 -> worker2)#remote_op: aten::empty                                                        7.21%            11.843us         7.21%            11.843us
         11.843us         1                2
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::add        71.94%           118.107us        85.77%           140.802us        140.802us        1                3
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::empty      13.82%           22.695us         13.82%           22.695us
         22.695us         1                3                                                                                                                                          -------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------
Self CPU time total: 164.164us
```

This PR also moves a bunch of the profiling logic to `rpc/utils.cpp` to declutter `request_callback` code.

Differential Revision: [D23638387](https://our.internmc.facebook.com/intern/diff/D23638387/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Sep 24, 2020
… single threaded

server"

server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.

server

This ensures that RPC profiling works in single-threaded server
scenarios and that we won't make the assumption that we'll have multiple
threads when working on this code. For example, this assumption resulted in a
bug in the previous diff (which was fixed). 

Differential Revision: [D23691304](https://our.internmc.facebook.com/intern/diff/D23691304/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Sep 24, 2020
…filing"

server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.

A comment from @mrshenli on #44664 led us to the following concern: when enabling profiler on server, if it is a different machine it may
not have CUDA while caller does. In this case, we would crash but now we
fallback to CPU and log a warning.

For testing, I forced it to return CUDA profiler state, and validated that it falls back. Not sure how to add a unittest given that we have single machine tests and the machine either has or doesn't have cuda. 
Differential Revision: [D23790729](https://our.internmc.facebook.com/intern/diff/D23790729/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Sep 24, 2020
…pport async function execution over RPC."

server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.

Closes #39971. This PR adds support for functions decorated with `@rpc.functions.async_execution` to be profiled over RPC as builtins, jit functions, and blocking python UDFs currently can be. The reasoning for this is to provide complete feature support in terms of RPC profiling and the various types of functions users can run.

To enable this, the PR below this enables calling `disableProfiler()` safely from another thread. We use that functionality to defer disabling the profiler on the server until the future corresponding to the RPC request completes (rather than only the blocking `processRPC` call as was done previously). Since when the future completes we've kicked off the async function and the future corresponding to it has completed, we are able to capture any RPCs the function would have called and the actual work done on the other node.

For example, if the following async function is ran on a server over RPC:

```
def slow_add(x, y):
    time.sleep(1)
    return torch.add(x, y)


@rpc.functions.async_execution
def slow_async_add(to, x, y):
    return rpc.rpc_async(to, slow_add, args=(x, y))
```

we expect to see the original RPC profiled, the nested RPC profiled, and the actual torch.add() work. All of these events should be recorded with the correct node id. Here is an example profiling output:

```
-------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------
Name                                                                                                                       Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls  Node ID
-------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------                                                                                                                            rpc_async#slow_async_add(worker1 -> worker2)                                                                               0.00%            0.000us          0                1.012s
         1.012s           1                1
aten::empty                                                                                                                7.02%            11.519us         7.02%            11.519us         11.519us         1                1
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)                             0.00%            0.000us          0                1.006s
         1.006s           1                2                                                                                                                                          rpc_async#slow_async_add(worker1 -> worker2)#remote_op: aten::empty                                                        7.21%            11.843us         7.21%            11.843us
         11.843us         1                2
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::add        71.94%           118.107us        85.77%           140.802us        140.802us        1                3
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::empty      13.82%           22.695us         13.82%           22.695us
         22.695us         1                3                                                                                                                                          -------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------
Self CPU time total: 164.164us
```

This PR also moves a bunch of the profiling logic to `rpc/utils.cpp` to declutter `request_callback` code.

Differential Revision: [D23638387](https://our.internmc.facebook.com/intern/diff/D23638387/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Sep 24, 2020
…tion execution over RPC."

server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.

Closes #39971. This PR adds support for functions decorated with `@rpc.functions.async_execution` to be profiled over RPC as builtins, jit functions, and blocking python UDFs currently can be. The reasoning for this is to provide complete feature support in terms of RPC profiling and the various types of functions users can run.

To enable this, the PR below this enables calling `disableProfiler()` safely from another thread. We use that functionality to defer disabling the profiler on the server until the future corresponding to the RPC request completes (rather than only the blocking `processRPC` call as was done previously). Since when the future completes we've kicked off the async function and the future corresponding to it has completed, we are able to capture any RPCs the function would have called and the actual work done on the other node.

For example, if the following async function is ran on a server over RPC:

```
def slow_add(x, y):
    time.sleep(1)
    return torch.add(x, y)


@rpc.functions.async_execution
def slow_async_add(to, x, y):
    return rpc.rpc_async(to, slow_add, args=(x, y))
```

we expect to see the original RPC profiled, the nested RPC profiled, and the actual torch.add() work. All of these events should be recorded with the correct node id. Here is an example profiling output:

```
-------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------
Name                                                                                                                       Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls  Node ID
-------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------                                                                                                                            rpc_async#slow_async_add(worker1 -> worker2)                                                                               0.00%            0.000us          0                1.012s
         1.012s           1                1
aten::empty                                                                                                                7.02%            11.519us         7.02%            11.519us         11.519us         1                1
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)                             0.00%            0.000us          0                1.006s
         1.006s           1                2                                                                                                                                          rpc_async#slow_async_add(worker1 -> worker2)#remote_op: aten::empty                                                        7.21%            11.843us         7.21%            11.843us
         11.843us         1                2
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::add        71.94%           118.107us        85.77%           140.802us        140.802us        1                3
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::empty      13.82%           22.695us         13.82%           22.695us
         22.695us         1                3                                                                                                                                          -------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------
Self CPU time total: 164.164us
```

This PR also moves a bunch of the profiling logic to `rpc/utils.cpp` to declutter `request_callback` code.

Differential Revision: [D23638387](https://our.internmc.facebook.com/intern/diff/D23638387/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Sep 24, 2020
… single threaded

server"

server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.

server

This ensures that RPC profiling works in single-threaded server
scenarios and that we won't make the assumption that we'll have multiple
threads when working on this code. For example, this assumption resulted in a
bug in the previous diff (which was fixed). 

Differential Revision: [D23691304](https://our.internmc.facebook.com/intern/diff/D23691304/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Sep 24, 2020
…filing"

server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.

A comment from @mrshenli on #44664 led us to the following concern: when enabling profiler on server, if it is a different machine it may
not have CUDA while caller does. In this case, we would crash but now we
fallback to CPU and log a warning.

For testing, I forced it to return CUDA profiler state, and validated that it falls back. Not sure how to add a unittest given that we have single machine tests and the machine either has or doesn't have cuda. 
Differential Revision: [D23790729](https://our.internmc.facebook.com/intern/diff/D23790729/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Sep 25, 2020
…ave CUDA for profiling"

server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.

A comment from @mrshenli on #44664 led us to the following concern: when enabling profiler on server, if it is a different machine it may
not have CUDA while caller does. In this case, we would crash but now we
fallback to CPU and log a warning.

For testing, I forced it to return CUDA profiler state, and validated that it falls back. Not sure how to add a unittest given that we have single machine tests and the machine either has or doesn't have cuda. 
Differential Revision: [D23790729](https://our.internmc.facebook.com/intern/diff/D23790729/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Sep 25, 2020
…filing"

server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.

A comment from @mrshenli on #44664 led us to the following concern: when enabling profiler on server, if it is a different machine it may
not have CUDA while caller does. In this case, we would crash but now we
fallback to CPU and log a warning.

For testing, I forced it to return CUDA profiler state, and validated that it falls back. Not sure how to add a unittest given that we have single machine tests and the machine either has or doesn't have cuda. 
Differential Revision: [D23790729](https://our.internmc.facebook.com/intern/diff/D23790729/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Sep 26, 2020
…ave CUDA for profiling"

server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.

A comment from @mrshenli on #44664 led us to the following concern: when enabling profiler on server, if it is a different machine it may
not have CUDA while caller does. In this case, we would crash but now we
fallback to CPU and log a warning.

For testing, I forced it to return CUDA profiler state, and validated that it falls back. Not sure how to add a unittest given that we have single machine tests and the machine either has or doesn't have cuda. 
Differential Revision: [D23790729](https://our.internmc.facebook.com/intern/diff/D23790729/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Sep 26, 2020
…filing"

server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.

A comment from @mrshenli on #44664 led us to the following concern: when enabling profiler on server, if it is a different machine it may
not have CUDA while caller does. In this case, we would crash but now we
fallback to CPU and log a warning.

For testing, I forced it to return CUDA profiler state, and validated that it falls back. Not sure how to add a unittest given that we have single machine tests and the machine either has or doesn't have cuda. 
Differential Revision: [D23790729](https://our.internmc.facebook.com/intern/diff/D23790729/)

[ghstack-poisoned]
@facebook-github-bot facebook-github-bot deleted the gh/rohan-varma/172/head branch September 26, 2020 14:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants