Skip to content

Conversation

@rohan-varma
Copy link
Contributor

@rohan-varma rohan-varma commented Sep 14, 2020

Stack from ghstack:

Closes #39971. This PR adds support for functions decorated with @rpc.functions.async_execution to be profiled over RPC as builtins, jit functions, and blocking python UDFs currently can be. The reasoning for this is to provide complete feature support in terms of RPC profiling and the various types of functions users can run.

To enable this, the PR below this enables calling disableProfiler() safely from another thread. We use that functionality to defer disabling the profiler on the server until the future corresponding to the RPC request completes (rather than only the blocking processRPC call as was done previously). Since when the future completes we've kicked off the async function and the future corresponding to it has completed, we are able to capture any RPCs the function would have called and the actual work done on the other node.

For example, if the following async function is ran on a server over RPC:

def slow_add(x, y):
    time.sleep(1)
    return torch.add(x, y)


@rpc.functions.async_execution
def slow_async_add(to, x, y):
    return rpc.rpc_async(to, slow_add, args=(x, y))

we expect to see the original RPC profiled, the nested RPC profiled, and the actual torch.add() work. All of these events should be recorded with the correct node id. Here is an example profiling output:

-------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------
Name                                                                                                                       Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls  Node ID
-------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------                                                                                                                            rpc_async#slow_async_add(worker1 -> worker2)                                                                               0.00%            0.000us          0                1.012s
         1.012s           1                1
aten::empty                                                                                                                7.02%            11.519us         7.02%            11.519us         11.519us         1                1
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)                             0.00%            0.000us          0                1.006s
         1.006s           1                2                                                                                                                                          rpc_async#slow_async_add(worker1 -> worker2)#remote_op: aten::empty                                                        7.21%            11.843us         7.21%            11.843us
         11.843us         1                2
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::add        71.94%           118.107us        85.77%           140.802us        140.802us        1                3
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::empty      13.82%           22.695us         13.82%           22.695us
         22.695us         1                3                                                                                                                                          -------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------
Self CPU time total: 164.164us

This PR also moves a bunch of the profiling logic to rpc/utils.cpp to declutter request_callback code.

Differential Revision: D23638387

…ion over RPC.

Closes #39971. This PR adds support for functions decorated with `@rpc.functions.async_execution` to be profiled over RPC as builtins, jit functions, and blocking python UDFs currently can be. The reasoning for this is to provide complete feature support in terms of RPC profiling and the various types of functions users can run.

To enable this, the PR below this enables calling `disableProfiler()` safely from another thread. We use that functionality to defer disabling the profiler on the server until the future corresponding to the RPC request completes (rather than only the blocking `processRPC` call as was done previously). Since when the future completes we've kicked off the async function and the future corresponding to it has completed, we are able to capture any RPCs the function would have called and the actual work done on the other node.

For example, if the following async function is ran on a server over RPC:

```
def slow_add(x, y):
    time.sleep(1)
    return torch.add(x, y)


@rpc.functions.async_execution
def slow_async_add(to, x, y):
    return rpc.rpc_async(to, slow_add, args=(x, y))
```

we expect to see the original RPC profiled, the nested RPC profiled, and the actual torch.add() work. All of these events should be recorded with the correct node id. Here is an example profiling output:

```
-------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------
Name                                                                                                                       Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls  Node ID
-------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------                                                                                                                            rpc_async#slow_async_add(worker1 -> worker2)                                                                               0.00%            0.000us          0                1.012s
         1.012s           1                1
aten::empty                                                                                                                7.02%            11.519us         7.02%            11.519us         11.519us         1                1
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)                             0.00%            0.000us          0                1.006s
         1.006s           1                2                                                                                                                                          rpc_async#slow_async_add(worker1 -> worker2)#remote_op: aten::empty                                                        7.21%            11.843us         7.21%            11.843us
         11.843us         1                2
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::add        71.94%           118.107us        85.77%           140.802us        140.802us        1                3
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::empty      13.82%           22.695us         13.82%           22.695us
         22.695us         1                3                                                                                                                                          -------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------
Self CPU time total: 164.164us
```

This PR also moves a bunch of the profiling logic to `rpc/utils.cpp` to declutter `request_callback` code.

Differential Revision: [D23638387](https://our.internmc.facebook.com/intern/diff/D23638387/)

[ghstack-poisoned]
…tion execution over RPC."

Closes #39971. This PR adds support for functions decorated with `@rpc.functions.async_execution` to be profiled over RPC as builtins, jit functions, and blocking python UDFs currently can be. The reasoning for this is to provide complete feature support in terms of RPC profiling and the various types of functions users can run.

To enable this, the PR below this enables calling `disableProfiler()` safely from another thread. We use that functionality to defer disabling the profiler on the server until the future corresponding to the RPC request completes (rather than only the blocking `processRPC` call as was done previously). Since when the future completes we've kicked off the async function and the future corresponding to it has completed, we are able to capture any RPCs the function would have called and the actual work done on the other node.

For example, if the following async function is ran on a server over RPC:

```
def slow_add(x, y):
    time.sleep(1)
    return torch.add(x, y)


@rpc.functions.async_execution
def slow_async_add(to, x, y):
    return rpc.rpc_async(to, slow_add, args=(x, y))
```

we expect to see the original RPC profiled, the nested RPC profiled, and the actual torch.add() work. All of these events should be recorded with the correct node id. Here is an example profiling output:

```
-------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------
Name                                                                                                                       Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls  Node ID
-------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------                                                                                                                            rpc_async#slow_async_add(worker1 -> worker2)                                                                               0.00%            0.000us          0                1.012s
         1.012s           1                1
aten::empty                                                                                                                7.02%            11.519us         7.02%            11.519us         11.519us         1                1
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)                             0.00%            0.000us          0                1.006s
         1.006s           1                2                                                                                                                                          rpc_async#slow_async_add(worker1 -> worker2)#remote_op: aten::empty                                                        7.21%            11.843us         7.21%            11.843us
         11.843us         1                2
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::add        71.94%           118.107us        85.77%           140.802us        140.802us        1                3
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::empty      13.82%           22.695us         13.82%           22.695us
         22.695us         1                3                                                                                                                                          -------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------
Self CPU time total: 164.164us
```

This PR also moves a bunch of the profiling logic to `rpc/utils.cpp` to declutter `request_callback` code.

Differential Revision: [D23638387](https://our.internmc.facebook.com/intern/diff/D23638387/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Sep 14, 2020
…ion over RPC.

Pull Request resolved: #44664

Closes #39971. This PR adds support for functions decorated with `@rpc.functions.async_execution` to be profiled over RPC as builtins, jit functions, and blocking python UDFs currently can be. The reasoning for this is to provide complete feature support in terms of RPC profiling and the various types of functions users can run.

To enable this, the PR below this enables calling `disableProfiler()` safely from another thread. We use that functionality to defer disabling the profiler on the server until the future corresponding to the RPC request completes (rather than only the blocking `processRPC` call as was done previously). Since when the future completes we've kicked off the async function and the future corresponding to it has completed, we are able to capture any RPCs the function would have called and the actual work done on the other node.

For example, if the following async function is ran on a server over RPC:

```
def slow_add(x, y):
    time.sleep(1)
    return torch.add(x, y)


@rpc.functions.async_execution
def slow_async_add(to, x, y):
    return rpc.rpc_async(to, slow_add, args=(x, y))
```

we expect to see the original RPC profiled, the nested RPC profiled, and the actual torch.add() work. All of these events should be recorded with the correct node id. Here is an example profiling output:

```
-------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------
Name                                                                                                                       Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls  Node ID
-------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------                                                                                                                            rpc_async#slow_async_add(worker1 -> worker2)                                                                               0.00%            0.000us          0                1.012s
         1.012s           1                1
aten::empty                                                                                                                7.02%            11.519us         7.02%            11.519us         11.519us         1                1
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)                             0.00%            0.000us          0                1.006s
         1.006s           1                2                                                                                                                                          rpc_async#slow_async_add(worker1 -> worker2)#remote_op: aten::empty                                                        7.21%            11.843us         7.21%            11.843us
         11.843us         1                2
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::add        71.94%           118.107us        85.77%           140.802us        140.802us        1                3
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::empty      13.82%           22.695us         13.82%           22.695us
         22.695us         1                3                                                                                                                                          -------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------
Self CPU time total: 164.164us
```

This PR also moves a bunch of the profiling logic to `rpc/utils.cpp` to declutter `request_callback` code.
ghstack-source-id: 112032360

Differential Revision: [D23638387](https://our.internmc.facebook.com/intern/diff/D23638387/)
@codecov
Copy link

codecov bot commented Sep 15, 2020

Codecov Report

❗ No coverage uploaded for pull request base (gh/rohan-varma/173/base@4e68a6a). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@                    Coverage Diff                     @@
##             gh/rohan-varma/173/base   #44664   +/-   ##
==========================================================
  Coverage                           ?   68.04%           
==========================================================
  Files                              ?      393           
  Lines                              ?    51019           
  Branches                           ?        0           
==========================================================
  Hits                               ?    34714           
  Misses                             ?    16305           
  Partials                           ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4e68a6a...9250e79. Read the comment docs.

@dr-ci
Copy link

dr-ci bot commented Sep 15, 2020

💊 CI failures summary and remediations

As of commit 9250e79 (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 42 times.

};

// A struct to control settings of disableProfiler options, to be used in
// conjunction with TlSProfilerGuard.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tls -> TLS

event_lists = disableProfiler();
}
if (cb_) {
(*cb_)(event_lists);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is prior to this PR. Just curious, what if cb_ throws? Or is there logics that prevent it from throwing in this dtor?

Copy link
Contributor Author

@rohan-varma rohan-varma Sep 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we handle the case where it throws, unfortunately. If this happens, then in the current only use case for this we would simply fail to return the remotely profiled event data. This probably isn't a critical error since the underlying RPC could still run fine, so we can probably wrap this with a try/catch and log a warning if that happens.

thread_event_lists event_lists = disableProfiler();
thread_event_lists event_lists;
if (profilerDisableOptions_) {
event_lists = disableProfiler(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: will it be better to let disableProfiler take an optional profilerDisableOptions_, instead of do the branching here? So that if there are future changes in profilerDisableOptions_, the change can be contained in one place?

// Enable the profiler with the config from the sender.
std::vector<torch::autograd::profiler::Event> profiledEvents;
torch::autograd::profiler::ProfilerDisableOptions requestThreadOptions;
requestThreadOptions.cleanupTLSState = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit and this can come in followup PR. Should we add a ctor for requestThreadOptions instead of manually setting its fields? Or do we need to modify these fields later?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in this PR itself, since I did some refactoring to make disableProfiler take in this struct anyways.


// Only clean up TLS states of profiler if we are disabling on
// the main thread.
bool shouldCleanUpTLSStates = (std::this_thread::get_id() == tid);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you please elaborate more why we don't need clean up TLS on child thread?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per some discussions with @ilia-cher, the way I understand it is that child threads with their thread-local states propagated over by parent threads will pop their thread local states after async task is finished executing.

Since the profiler wasn't explicitly enabled in this thread (rather carried over by at::wrapPropagateTLSState) we shouldn't explicitly clean up the TL states of profiler either (which is what disableProfiler() would do by default). Those states will be clean up after the guard in wrapPropagateTLSState exits.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming shouldCleanUpTLSStates is only True when wrappedRpcResponseFuture has already completed and is running inline as a result? If so, should we do this cleanup logic after calling addCallback so that it runs on the main thread and cleans up the state accordingly? We could run the cleanup logic only if we know the future has completed.

It's messy to have code like this which is conditional on thread ids.

Copy link
Contributor Author

@rohan-varma rohan-varma Sep 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to the case that you mentioned, I believe it's also possible to encounter this when running with single-threaded server where all callbacks will run on the same thread. Also, with a thread pool I think there also might be a chance where we run the callback on the same thread that created the future (it would happen if the thread in the thread pool that called markCompleted() happened to be the same).

Although, I think we can remove the thread_id conditioning and just never clean up TLS state in the callback. The reason is, in the case where we run inline/on the main thread, we already wrap this callback with an extra layer of TLSState, so we would not want to clean up this extra state (it would be automatically cleaned by the ThreadLocalStateGuard destructor). I'll validate that this passes with the single-threaded server test cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this works, also added a test in jit/test_misc.cpp which should simulate this issue much more cleanly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason is, in the case where we run inline/on the main thread, we already wrap this callback with an extra layer of TLSState, so we would not want to clean up this extra state (it would be automatically cleaned by the ThreadLocalStateGuard destructor).

Is this true for all functions run by thread from the pool?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's actually only true for specific callbacks such as this one: ones that are wrapped with at::wrapPropagateTLSState (or otherwise use ThreadLocalStateGuard, but wrapPropagateTLSState is the recommended API for continuation callbacks). It won't carry over thread local state implicitly for an arbitrary function ran by the thread pool.

profilerStart = &e;
found_cpu_start = true;
}
if (cuda_profiling_enabled && 0 == strcmp(e.name(), "__cuda_start_event")) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be an else if?

// We should always find __start_profile.
TORCH_CHECK(
profilerStart != nullptr, "Expected to find __start_profile event.");
// Should have >= 1 CUDA start event.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this true even if the user function is CPU only but on a machine with CUDA devices? Or is it true that in this case cuda_profiling_enabled will be false?

Copy link
Contributor Author

@rohan-varma rohan-varma Sep 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the user function is CPU-only but machine has cuda device it should still be true since we push __cuda_start_event regardless if running with profile_cuda flag. Although this is a good point, let me check what happens when we enable this on a CPU only machine (I think we should crash during profiler creation probably).

And the comment should probably more accurately read "Should have >= 1 CUDA start event if cuda_profiling_enabled."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the caller if there's no CUDA device, then we fail as expected with a message that cuda is not enabled.

On the callee, we get a stacktrace with the following error message:

> W0918 11:52:41.866964 1090162 record_function.cpp:171] Exception in RecordFunction callback: CUDA used in profil
er but not enabled.

That can happen if machine A sends an RPC to B and A has CUDA but B does not. I wonder if in this case, instead of crashing if we want to support this, it's worth it to check if B also has CUDA, and fallback to CPU-only profiler in the case that it does not instead of crashing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if in this case, instead of crashing if we want to support this

Yep, I would vote for supporting this. I do saw users on forum who wants to connect CUDA with non-CUDA servers using RPC.

std::vector<char>& payload,
const rpc::Message& message);

TORCH_API void populateRemoteProfiledEvents(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's add some comments to describe this func

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added these in the update.

INIT_METHOD_TEMPLATE = "file://{file_name}"



Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one redundant new line?


def slow_add(x, y):
time.sleep(1)
return torch.add(x, y)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to add a test with CUDA ops as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's a good point.

…tion execution over RPC."

Closes #39971. This PR adds support for functions decorated with `@rpc.functions.async_execution` to be profiled over RPC as builtins, jit functions, and blocking python UDFs currently can be. The reasoning for this is to provide complete feature support in terms of RPC profiling and the various types of functions users can run.

To enable this, the PR below this enables calling `disableProfiler()` safely from another thread. We use that functionality to defer disabling the profiler on the server until the future corresponding to the RPC request completes (rather than only the blocking `processRPC` call as was done previously). Since when the future completes we've kicked off the async function and the future corresponding to it has completed, we are able to capture any RPCs the function would have called and the actual work done on the other node.

For example, if the following async function is ran on a server over RPC:

```
def slow_add(x, y):
    time.sleep(1)
    return torch.add(x, y)


@rpc.functions.async_execution
def slow_async_add(to, x, y):
    return rpc.rpc_async(to, slow_add, args=(x, y))
```

we expect to see the original RPC profiled, the nested RPC profiled, and the actual torch.add() work. All of these events should be recorded with the correct node id. Here is an example profiling output:

```
-------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------
Name                                                                                                                       Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls  Node ID
-------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------                                                                                                                            rpc_async#slow_async_add(worker1 -> worker2)                                                                               0.00%            0.000us          0                1.012s
         1.012s           1                1
aten::empty                                                                                                                7.02%            11.519us         7.02%            11.519us         11.519us         1                1
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)                             0.00%            0.000us          0                1.006s
         1.006s           1                2                                                                                                                                          rpc_async#slow_async_add(worker1 -> worker2)#remote_op: aten::empty                                                        7.21%            11.843us         7.21%            11.843us
         11.843us         1                2
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::add        71.94%           118.107us        85.77%           140.802us        140.802us        1                3
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::empty      13.82%           22.695us         13.82%           22.695us
         22.695us         1                3                                                                                                                                          -------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------
Self CPU time total: 164.164us
```

This PR also moves a bunch of the profiling logic to `rpc/utils.cpp` to declutter `request_callback` code.

Differential Revision: [D23638387](https://our.internmc.facebook.com/intern/diff/D23638387/)

[ghstack-poisoned]
Comment on lines 377 to 378
bool cleanupTLSState = true;
bool consolidate = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some comments about what each of these options mean.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in the update.


// Only clean up TLS states of profiler if we are disabling on
// the main thread.
bool shouldCleanUpTLSStates = (std::this_thread::get_id() == tid);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming shouldCleanUpTLSStates is only True when wrappedRpcResponseFuture has already completed and is running inline as a result? If so, should we do this cleanup logic after calling addCallback so that it runs on the main thread and cleans up the state accordingly? We could run the cleanup logic only if we know the future has completed.

It's messy to have code like this which is conditional on thread ids.

Comment on lines 524 to 525
auto event_lists = torch::autograd::profiler::disableProfiler(
shouldCleanUpTLSStates, true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we're doing some cleanup as part of the destructor of torch::autograd::profiler::TLSProfilerGuard g and then there is some additional cleanup here. Could you explain what is the difference between the two?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The destructor in the TLSProfilerGuard cleans up the profiler thread local states in the main thread, but does not call consolidate() which clears out the event lists, In the continuation thread, we don't clean up thread local states (those are restored by the destructor of ThreadLocalStateGuard in at::wrapPropagateTLSState but we do call consolidate() to get all of the events, even the async ones (which would not have been logged if we called consolidate() on main thread).

…tion execution over RPC."

server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.

Closes #39971. This PR adds support for functions decorated with `@rpc.functions.async_execution` to be profiled over RPC as builtins, jit functions, and blocking python UDFs currently can be. The reasoning for this is to provide complete feature support in terms of RPC profiling and the various types of functions users can run.

To enable this, the PR below this enables calling `disableProfiler()` safely from another thread. We use that functionality to defer disabling the profiler on the server until the future corresponding to the RPC request completes (rather than only the blocking `processRPC` call as was done previously). Since when the future completes we've kicked off the async function and the future corresponding to it has completed, we are able to capture any RPCs the function would have called and the actual work done on the other node.

For example, if the following async function is ran on a server over RPC:

```
def slow_add(x, y):
    time.sleep(1)
    return torch.add(x, y)


@rpc.functions.async_execution
def slow_async_add(to, x, y):
    return rpc.rpc_async(to, slow_add, args=(x, y))
```

we expect to see the original RPC profiled, the nested RPC profiled, and the actual torch.add() work. All of these events should be recorded with the correct node id. Here is an example profiling output:

```
-------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------
Name                                                                                                                       Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls  Node ID
-------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------                                                                                                                            rpc_async#slow_async_add(worker1 -> worker2)                                                                               0.00%            0.000us          0                1.012s
         1.012s           1                1
aten::empty                                                                                                                7.02%            11.519us         7.02%            11.519us         11.519us         1                1
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)                             0.00%            0.000us          0                1.006s
         1.006s           1                2                                                                                                                                          rpc_async#slow_async_add(worker1 -> worker2)#remote_op: aten::empty                                                        7.21%            11.843us         7.21%            11.843us
         11.843us         1                2
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::add        71.94%           118.107us        85.77%           140.802us        140.802us        1                3
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::empty      13.82%           22.695us         13.82%           22.695us
         22.695us         1                3                                                                                                                                          -------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------
Self CPU time total: 164.164us
```

This PR also moves a bunch of the profiling logic to `rpc/utils.cpp` to declutter `request_callback` code.

Differential Revision: [D23638387](https://our.internmc.facebook.com/intern/diff/D23638387/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Sep 18, 2020
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* **#44646 Remove thread_local RecordFunctionGuard from profiler.**

Per a discussion with @ilia-cher, this is not needed anymore and
removing it would make some future changes to support async RPC profiling
easier. Tested by ensuring profiling tests in `test_autograd.py` still pass.

Differential Revision: [D23683998](https://our.internmc.facebook.com/intern/diff/D23683998/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Sep 18, 2020
…another thread. "

server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* **#44653 [RPC profiling] Allow disableProfiler() to be called from another thread.**
* #44646 Remove thread_local RecordFunctionGuard from profiler.

This changes the profiler per a discussion with @ilia-cher offline that enables `disableProfiler()` event consolidation logic to be called from different threads (i.e. threads where the profiler was not explicitly enabled). This is needed to support the functionality enabled by D23638387 where we defer profiling event collection until executing an async callback that can execute on a different thread, to support RPC async function profiling.

This is done by introducing 2 flags `cleanupTLSState` and `consolidate` which controls whether we should clean up thread local settings (we don't do this when calling `disableProfiler()` on non-main threads) and whether we should consolidate all profiled events. Backwards compatiblity is ensured since both options are true by default.

Added a test in `test_misc.cpp` to test this.

Differential Revision: [D23638499](https://our.internmc.facebook.com/intern/diff/D23638499/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23638499/)!

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Sep 18, 2020
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* **#44655 [RPC profiling] Don't wrap toHere() calls with profiling**
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.

Since `toHere()` does not execute operations (torch operators) over RPC and simply
transfers the value to the local node, we don't need to enable the profiler
remotely for this message. This causes unnecessary overhead and is not needed.

Since `toHere` is a blocking call, we already profile the call on the local node using `RECORD_USER_SCOPE`, so this does not change the expected profiler results (validated by ensuring all remote profiling tests pass).

Differential Revision: [D23641466](https://our.internmc.facebook.com/intern/diff/D23641466/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Sep 18, 2020
… single threaded

server"

server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.

server

This ensures that RPC profiling works in single-threaded server
scenarios and that we won't make the assumption that we'll have multiple
threads when working on this code. For example, this assumption resulted in a
bug in the previous diff (which was fixed). 

Differential Revision: [D23691304](https://our.internmc.facebook.com/intern/diff/D23691304/)

[ghstack-poisoned]
…tion execution over RPC."

server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.

Closes #39971. This PR adds support for functions decorated with `@rpc.functions.async_execution` to be profiled over RPC as builtins, jit functions, and blocking python UDFs currently can be. The reasoning for this is to provide complete feature support in terms of RPC profiling and the various types of functions users can run.

To enable this, the PR below this enables calling `disableProfiler()` safely from another thread. We use that functionality to defer disabling the profiler on the server until the future corresponding to the RPC request completes (rather than only the blocking `processRPC` call as was done previously). Since when the future completes we've kicked off the async function and the future corresponding to it has completed, we are able to capture any RPCs the function would have called and the actual work done on the other node.

For example, if the following async function is ran on a server over RPC:

```
def slow_add(x, y):
    time.sleep(1)
    return torch.add(x, y)


@rpc.functions.async_execution
def slow_async_add(to, x, y):
    return rpc.rpc_async(to, slow_add, args=(x, y))
```

we expect to see the original RPC profiled, the nested RPC profiled, and the actual torch.add() work. All of these events should be recorded with the correct node id. Here is an example profiling output:

```
-------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------
Name                                                                                                                       Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls  Node ID
-------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------                                                                                                                            rpc_async#slow_async_add(worker1 -> worker2)                                                                               0.00%            0.000us          0                1.012s
         1.012s           1                1
aten::empty                                                                                                                7.02%            11.519us         7.02%            11.519us         11.519us         1                1
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)                             0.00%            0.000us          0                1.006s
         1.006s           1                2                                                                                                                                          rpc_async#slow_async_add(worker1 -> worker2)#remote_op: aten::empty                                                        7.21%            11.843us         7.21%            11.843us
         11.843us         1                2
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::add        71.94%           118.107us        85.77%           140.802us        140.802us        1                3
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::empty      13.82%           22.695us         13.82%           22.695us
         22.695us         1                3                                                                                                                                          -------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------
Self CPU time total: 164.164us
```

This PR also moves a bunch of the profiling logic to `rpc/utils.cpp` to declutter `request_callback` code.

Differential Revision: [D23638387](https://our.internmc.facebook.com/intern/diff/D23638387/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Sep 18, 2020
…another thread. "

server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* **#44653 [RPC profiling] Allow disableProfiler() to be called from another thread.**
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* **#44653 [RPC profiling] Allow disableProfiler() to be called from another thread.**
* #44646 Remove thread_local RecordFunctionGuard from profiler.

This changes the profiler per a discussion with @ilia-cher offline that enables `disableProfiler()` event consolidation logic to be called from different threads (i.e. threads where the profiler was not explicitly enabled). This is needed to support the functionality enabled by D23638387 where we defer profiling event collection until executing an async callback that can execute on a different thread, to support RPC async function profiling.

This is done by introducing 2 flags `cleanupTLSState` and `consolidate` which controls whether we should clean up thread local settings (we don't do this when calling `disableProfiler()` on non-main threads) and whether we should consolidate all profiled events. Backwards compatiblity is ensured since both options are true by default.

Added a test in `test_misc.cpp` to test this.

Differential Revision: [D23638499](https://our.internmc.facebook.com/intern/diff/D23638499/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23638499/)!

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Sep 18, 2020
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* **#44655 [RPC profiling] Don't wrap toHere() calls with profiling**
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* **#44655 [RPC profiling] Don't wrap toHere() calls with profiling**
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.

Since `toHere()` does not execute operations (torch operators) over RPC and simply
transfers the value to the local node, we don't need to enable the profiler
remotely for this message. This causes unnecessary overhead and is not needed.

Since `toHere` is a blocking call, we already profile the call on the local node using `RECORD_USER_SCOPE`, so this does not change the expected profiler results (validated by ensuring all remote profiling tests pass).

Differential Revision: [D23641466](https://our.internmc.facebook.com/intern/diff/D23641466/)

[ghstack-poisoned]
…tion execution over RPC."


Closes #39971. This PR adds support for functions decorated with `@rpc.functions.async_execution` to be profiled over RPC as builtins, jit functions, and blocking python UDFs currently can be. The reasoning for this is to provide complete feature support in terms of RPC profiling and the various types of functions users can run.

To enable this, the PR below this enables calling `disableProfiler()` safely from another thread. We use that functionality to defer disabling the profiler on the server until the future corresponding to the RPC request completes (rather than only the blocking `processRPC` call as was done previously). Since when the future completes we've kicked off the async function and the future corresponding to it has completed, we are able to capture any RPCs the function would have called and the actual work done on the other node.

For example, if the following async function is ran on a server over RPC:

```
def slow_add(x, y):
    time.sleep(1)
    return torch.add(x, y)


@rpc.functions.async_execution
def slow_async_add(to, x, y):
    return rpc.rpc_async(to, slow_add, args=(x, y))
```

we expect to see the original RPC profiled, the nested RPC profiled, and the actual torch.add() work. All of these events should be recorded with the correct node id. Here is an example profiling output:

```
-------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------
Name                                                                                                                       Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls  Node ID
-------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------                                                                                                                            rpc_async#slow_async_add(worker1 -> worker2)                                                                               0.00%            0.000us          0                1.012s
         1.012s           1                1
aten::empty                                                                                                                7.02%            11.519us         7.02%            11.519us         11.519us         1                1
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)                             0.00%            0.000us          0                1.006s
         1.006s           1                2                                                                                                                                          rpc_async#slow_async_add(worker1 -> worker2)#remote_op: aten::empty                                                        7.21%            11.843us         7.21%            11.843us
         11.843us         1                2
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::add        71.94%           118.107us        85.77%           140.802us        140.802us        1                3
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::empty      13.82%           22.695us         13.82%           22.695us
         22.695us         1                3                                                                                                                                          -------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------
Self CPU time total: 164.164us
```

This PR also moves a bunch of the profiling logic to `rpc/utils.cpp` to declutter `request_callback` code.

Differential Revision: [D23638387](https://our.internmc.facebook.com/intern/diff/D23638387/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Sep 18, 2020
…ion over RPC.

Pull Request resolved: #44664

Closes #39971. This PR adds support for functions decorated with `@rpc.functions.async_execution` to be profiled over RPC as builtins, jit functions, and blocking python UDFs currently can be. The reasoning for this is to provide complete feature support in terms of RPC profiling and the various types of functions users can run.

To enable this, the PR below this enables calling `disableProfiler()` safely from another thread. We use that functionality to defer disabling the profiler on the server until the future corresponding to the RPC request completes (rather than only the blocking `processRPC` call as was done previously). Since when the future completes we've kicked off the async function and the future corresponding to it has completed, we are able to capture any RPCs the function would have called and the actual work done on the other node.

For example, if the following async function is ran on a server over RPC:

```
def slow_add(x, y):
    time.sleep(1)
    return torch.add(x, y)


@rpc.functions.async_execution
def slow_async_add(to, x, y):
    return rpc.rpc_async(to, slow_add, args=(x, y))
```

we expect to see the original RPC profiled, the nested RPC profiled, and the actual torch.add() work. All of these events should be recorded with the correct node id. Here is an example profiling output:

```
-------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------
Name                                                                                                                       Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls  Node ID
-------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------                                                                                                                            rpc_async#slow_async_add(worker1 -> worker2)                                                                               0.00%            0.000us          0                1.012s
         1.012s           1                1
aten::empty                                                                                                                7.02%            11.519us         7.02%            11.519us         11.519us         1                1
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)                             0.00%            0.000us          0                1.006s
         1.006s           1                2                                                                                                                                          rpc_async#slow_async_add(worker1 -> worker2)#remote_op: aten::empty                                                        7.21%            11.843us         7.21%            11.843us
         11.843us         1                2
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::add        71.94%           118.107us        85.77%           140.802us        140.802us        1                3
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::empty      13.82%           22.695us         13.82%           22.695us
         22.695us         1                3                                                                                                                                          -------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------
Self CPU time total: 164.164us
```

This PR also moves a bunch of the profiling logic to `rpc/utils.cpp` to declutter `request_callback` code.
ghstack-source-id: 112403438

Differential Revision: [D23638387](https://our.internmc.facebook.com/intern/diff/D23638387/)
rohan-varma added a commit that referenced this pull request Sep 22, 2020
…filing"

server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.

A comment from @mrshenli on #44664 led us to the following concern: when enabling profiler on server, if it is a different machine it may
not have CUDA while caller does. In this case, we would crash but now we
fallback to CPU and log a warning.

For testing, I forced it to return CUDA profiler state, and validated that it falls back. Not sure how to add a unittest given that we have single machine tests and the machine either has or doesn't have cuda. 
Differential Revision: [D23790729](https://our.internmc.facebook.com/intern/diff/D23790729/)

[ghstack-poisoned]
@rohan-varma
Copy link
Contributor Author

Test failure existed before this diff and should be resolved by #45162

// });
// Code to profile
// }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unintentional new line? the comment above is for this struct.

)
try:
return self._rpc_backend_options
except AttributeError:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when will this error be triggered?

Copy link
Contributor Author

@rohan-varma rohan-varma Sep 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error actually happens in the default flow when we do not run under single threaded PG agent. When running under single threaded mode, we set _rpc_backend_options (see setter method below), when we don't, we call into the below construct_rpc_backend_options as before.

Not sure if there's a better way to write this, but something like the following would also work if we don't want to use try/except:

if hasattr(self, '_single_threaded_options'):
    return self._single_threaded_options # means we want to run in single-threadd mode
else:
    return construct_rpc_backend_options(...)

rohan-varma added a commit that referenced this pull request Sep 23, 2020
…filing"

server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.

A comment from @mrshenli on #44664 led us to the following concern: when enabling profiler on server, if it is a different machine it may
not have CUDA while caller does. In this case, we would crash but now we
fallback to CPU and log a warning.

For testing, I forced it to return CUDA profiler state, and validated that it falls back. Not sure how to add a unittest given that we have single machine tests and the machine either has or doesn't have cuda. 
Differential Revision: [D23790729](https://our.internmc.facebook.com/intern/diff/D23790729/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Sep 23, 2020
…filing"

server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.

A comment from @mrshenli on #44664 led us to the following concern: when enabling profiler on server, if it is a different machine it may
not have CUDA while caller does. In this case, we would crash but now we
fallback to CPU and log a warning.

For testing, I forced it to return CUDA profiler state, and validated that it falls back. Not sure how to add a unittest given that we have single machine tests and the machine either has or doesn't have cuda. 
Differential Revision: [D23790729](https://our.internmc.facebook.com/intern/diff/D23790729/)

[ghstack-poisoned]
…tion execution over RPC."

server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.

Closes #39971. This PR adds support for functions decorated with `@rpc.functions.async_execution` to be profiled over RPC as builtins, jit functions, and blocking python UDFs currently can be. The reasoning for this is to provide complete feature support in terms of RPC profiling and the various types of functions users can run.

To enable this, the PR below this enables calling `disableProfiler()` safely from another thread. We use that functionality to defer disabling the profiler on the server until the future corresponding to the RPC request completes (rather than only the blocking `processRPC` call as was done previously). Since when the future completes we've kicked off the async function and the future corresponding to it has completed, we are able to capture any RPCs the function would have called and the actual work done on the other node.

For example, if the following async function is ran on a server over RPC:

```
def slow_add(x, y):
    time.sleep(1)
    return torch.add(x, y)


@rpc.functions.async_execution
def slow_async_add(to, x, y):
    return rpc.rpc_async(to, slow_add, args=(x, y))
```

we expect to see the original RPC profiled, the nested RPC profiled, and the actual torch.add() work. All of these events should be recorded with the correct node id. Here is an example profiling output:

```
-------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------
Name                                                                                                                       Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls  Node ID
-------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------                                                                                                                            rpc_async#slow_async_add(worker1 -> worker2)                                                                               0.00%            0.000us          0                1.012s
         1.012s           1                1
aten::empty                                                                                                                7.02%            11.519us         7.02%            11.519us         11.519us         1                1
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)                             0.00%            0.000us          0                1.006s
         1.006s           1                2                                                                                                                                          rpc_async#slow_async_add(worker1 -> worker2)#remote_op: aten::empty                                                        7.21%            11.843us         7.21%            11.843us
         11.843us         1                2
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::add        71.94%           118.107us        85.77%           140.802us        140.802us        1                3
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::empty      13.82%           22.695us         13.82%           22.695us
         22.695us         1                3                                                                                                                                          -------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------
Self CPU time total: 164.164us
```

This PR also moves a bunch of the profiling logic to `rpc/utils.cpp` to declutter `request_callback` code.

Differential Revision: [D23638387](https://our.internmc.facebook.com/intern/diff/D23638387/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Sep 23, 2020
…pport async function execution over RPC."

server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.

Closes #39971. This PR adds support for functions decorated with `@rpc.functions.async_execution` to be profiled over RPC as builtins, jit functions, and blocking python UDFs currently can be. The reasoning for this is to provide complete feature support in terms of RPC profiling and the various types of functions users can run.

To enable this, the PR below this enables calling `disableProfiler()` safely from another thread. We use that functionality to defer disabling the profiler on the server until the future corresponding to the RPC request completes (rather than only the blocking `processRPC` call as was done previously). Since when the future completes we've kicked off the async function and the future corresponding to it has completed, we are able to capture any RPCs the function would have called and the actual work done on the other node.

For example, if the following async function is ran on a server over RPC:

```
def slow_add(x, y):
    time.sleep(1)
    return torch.add(x, y)


@rpc.functions.async_execution
def slow_async_add(to, x, y):
    return rpc.rpc_async(to, slow_add, args=(x, y))
```

we expect to see the original RPC profiled, the nested RPC profiled, and the actual torch.add() work. All of these events should be recorded with the correct node id. Here is an example profiling output:

```
-------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------
Name                                                                                                                       Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls  Node ID
-------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------                                                                                                                            rpc_async#slow_async_add(worker1 -> worker2)                                                                               0.00%            0.000us          0                1.012s
         1.012s           1                1
aten::empty                                                                                                                7.02%            11.519us         7.02%            11.519us         11.519us         1                1
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)                             0.00%            0.000us          0                1.006s
         1.006s           1                2                                                                                                                                          rpc_async#slow_async_add(worker1 -> worker2)#remote_op: aten::empty                                                        7.21%            11.843us         7.21%            11.843us
         11.843us         1                2
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::add        71.94%           118.107us        85.77%           140.802us        140.802us        1                3
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::empty      13.82%           22.695us         13.82%           22.695us
         22.695us         1                3                                                                                                                                          -------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------
Self CPU time total: 164.164us
```

This PR also moves a bunch of the profiling logic to `rpc/utils.cpp` to declutter `request_callback` code.

Differential Revision: [D23638387](https://our.internmc.facebook.com/intern/diff/D23638387/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Sep 23, 2020
… single threaded

server"

server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.

server

This ensures that RPC profiling works in single-threaded server
scenarios and that we won't make the assumption that we'll have multiple
threads when working on this code. For example, this assumption resulted in a
bug in the previous diff (which was fixed). 

Differential Revision: [D23691304](https://our.internmc.facebook.com/intern/diff/D23691304/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Sep 23, 2020
…filing"

server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.

A comment from @mrshenli on #44664 led us to the following concern: when enabling profiler on server, if it is a different machine it may
not have CUDA while caller does. In this case, we would crash but now we
fallback to CPU and log a warning.

For testing, I forced it to return CUDA profiler state, and validated that it falls back. Not sure how to add a unittest given that we have single machine tests and the machine either has or doesn't have cuda. 
Differential Revision: [D23790729](https://our.internmc.facebook.com/intern/diff/D23790729/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Sep 23, 2020
… single threaded

server"

server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.

server

This ensures that RPC profiling works in single-threaded server
scenarios and that we won't make the assumption that we'll have multiple
threads when working on this code. For example, this assumption resulted in a
bug in the previous diff (which was fixed). 

Differential Revision: [D23691304](https://our.internmc.facebook.com/intern/diff/D23691304/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Sep 23, 2020
…filing"

server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.

A comment from @mrshenli on #44664 led us to the following concern: when enabling profiler on server, if it is a different machine it may
not have CUDA while caller does. In this case, we would crash but now we
fallback to CPU and log a warning.

For testing, I forced it to return CUDA profiler state, and validated that it falls back. Not sure how to add a unittest given that we have single machine tests and the machine either has or doesn't have cuda. 
Differential Revision: [D23790729](https://our.internmc.facebook.com/intern/diff/D23790729/)

[ghstack-poisoned]
…tion execution over RPC."

server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.

Closes #39971. This PR adds support for functions decorated with `@rpc.functions.async_execution` to be profiled over RPC as builtins, jit functions, and blocking python UDFs currently can be. The reasoning for this is to provide complete feature support in terms of RPC profiling and the various types of functions users can run.

To enable this, the PR below this enables calling `disableProfiler()` safely from another thread. We use that functionality to defer disabling the profiler on the server until the future corresponding to the RPC request completes (rather than only the blocking `processRPC` call as was done previously). Since when the future completes we've kicked off the async function and the future corresponding to it has completed, we are able to capture any RPCs the function would have called and the actual work done on the other node.

For example, if the following async function is ran on a server over RPC:

```
def slow_add(x, y):
    time.sleep(1)
    return torch.add(x, y)


@rpc.functions.async_execution
def slow_async_add(to, x, y):
    return rpc.rpc_async(to, slow_add, args=(x, y))
```

we expect to see the original RPC profiled, the nested RPC profiled, and the actual torch.add() work. All of these events should be recorded with the correct node id. Here is an example profiling output:

```
-------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------
Name                                                                                                                       Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls  Node ID
-------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------                                                                                                                            rpc_async#slow_async_add(worker1 -> worker2)                                                                               0.00%            0.000us          0                1.012s
         1.012s           1                1
aten::empty                                                                                                                7.02%            11.519us         7.02%            11.519us         11.519us         1                1
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)                             0.00%            0.000us          0                1.006s
         1.006s           1                2                                                                                                                                          rpc_async#slow_async_add(worker1 -> worker2)#remote_op: aten::empty                                                        7.21%            11.843us         7.21%            11.843us
         11.843us         1                2
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::add        71.94%           118.107us        85.77%           140.802us        140.802us        1                3
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::empty      13.82%           22.695us         13.82%           22.695us
         22.695us         1                3                                                                                                                                          -------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------
Self CPU time total: 164.164us
```

This PR also moves a bunch of the profiling logic to `rpc/utils.cpp` to declutter `request_callback` code.

Differential Revision: [D23638387](https://our.internmc.facebook.com/intern/diff/D23638387/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Sep 24, 2020
…pport async function execution over RPC."

server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.

Closes #39971. This PR adds support for functions decorated with `@rpc.functions.async_execution` to be profiled over RPC as builtins, jit functions, and blocking python UDFs currently can be. The reasoning for this is to provide complete feature support in terms of RPC profiling and the various types of functions users can run.

To enable this, the PR below this enables calling `disableProfiler()` safely from another thread. We use that functionality to defer disabling the profiler on the server until the future corresponding to the RPC request completes (rather than only the blocking `processRPC` call as was done previously). Since when the future completes we've kicked off the async function and the future corresponding to it has completed, we are able to capture any RPCs the function would have called and the actual work done on the other node.

For example, if the following async function is ran on a server over RPC:

```
def slow_add(x, y):
    time.sleep(1)
    return torch.add(x, y)


@rpc.functions.async_execution
def slow_async_add(to, x, y):
    return rpc.rpc_async(to, slow_add, args=(x, y))
```

we expect to see the original RPC profiled, the nested RPC profiled, and the actual torch.add() work. All of these events should be recorded with the correct node id. Here is an example profiling output:

```
-------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------
Name                                                                                                                       Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls  Node ID
-------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------                                                                                                                            rpc_async#slow_async_add(worker1 -> worker2)                                                                               0.00%            0.000us          0                1.012s
         1.012s           1                1
aten::empty                                                                                                                7.02%            11.519us         7.02%            11.519us         11.519us         1                1
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)                             0.00%            0.000us          0                1.006s
         1.006s           1                2                                                                                                                                          rpc_async#slow_async_add(worker1 -> worker2)#remote_op: aten::empty                                                        7.21%            11.843us         7.21%            11.843us
         11.843us         1                2
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::add        71.94%           118.107us        85.77%           140.802us        140.802us        1                3
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::empty      13.82%           22.695us         13.82%           22.695us
         22.695us         1                3                                                                                                                                          -------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------
Self CPU time total: 164.164us
```

This PR also moves a bunch of the profiling logic to `rpc/utils.cpp` to declutter `request_callback` code.

Differential Revision: [D23638387](https://our.internmc.facebook.com/intern/diff/D23638387/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Sep 24, 2020
… single threaded

server"

server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.

server

This ensures that RPC profiling works in single-threaded server
scenarios and that we won't make the assumption that we'll have multiple
threads when working on this code. For example, this assumption resulted in a
bug in the previous diff (which was fixed). 

Differential Revision: [D23691304](https://our.internmc.facebook.com/intern/diff/D23691304/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Sep 24, 2020
…filing"

server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.

A comment from @mrshenli on #44664 led us to the following concern: when enabling profiler on server, if it is a different machine it may
not have CUDA while caller does. In this case, we would crash but now we
fallback to CPU and log a warning.

For testing, I forced it to return CUDA profiler state, and validated that it falls back. Not sure how to add a unittest given that we have single machine tests and the machine either has or doesn't have cuda. 
Differential Revision: [D23790729](https://our.internmc.facebook.com/intern/diff/D23790729/)

[ghstack-poisoned]
…tion execution over RPC."

server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.

Closes #39971. This PR adds support for functions decorated with `@rpc.functions.async_execution` to be profiled over RPC as builtins, jit functions, and blocking python UDFs currently can be. The reasoning for this is to provide complete feature support in terms of RPC profiling and the various types of functions users can run.

To enable this, the PR below this enables calling `disableProfiler()` safely from another thread. We use that functionality to defer disabling the profiler on the server until the future corresponding to the RPC request completes (rather than only the blocking `processRPC` call as was done previously). Since when the future completes we've kicked off the async function and the future corresponding to it has completed, we are able to capture any RPCs the function would have called and the actual work done on the other node.

For example, if the following async function is ran on a server over RPC:

```
def slow_add(x, y):
    time.sleep(1)
    return torch.add(x, y)


@rpc.functions.async_execution
def slow_async_add(to, x, y):
    return rpc.rpc_async(to, slow_add, args=(x, y))
```

we expect to see the original RPC profiled, the nested RPC profiled, and the actual torch.add() work. All of these events should be recorded with the correct node id. Here is an example profiling output:

```
-------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------
Name                                                                                                                       Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls  Node ID
-------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------                                                                                                                            rpc_async#slow_async_add(worker1 -> worker2)                                                                               0.00%            0.000us          0                1.012s
         1.012s           1                1
aten::empty                                                                                                                7.02%            11.519us         7.02%            11.519us         11.519us         1                1
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)                             0.00%            0.000us          0                1.006s
         1.006s           1                2                                                                                                                                          rpc_async#slow_async_add(worker1 -> worker2)#remote_op: aten::empty                                                        7.21%            11.843us         7.21%            11.843us
         11.843us         1                2
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::add        71.94%           118.107us        85.77%           140.802us        140.802us        1                3
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::empty      13.82%           22.695us         13.82%           22.695us
         22.695us         1                3                                                                                                                                          -------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------
Self CPU time total: 164.164us
```

This PR also moves a bunch of the profiling logic to `rpc/utils.cpp` to declutter `request_callback` code.

Differential Revision: [D23638387](https://our.internmc.facebook.com/intern/diff/D23638387/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Sep 24, 2020
…pport async function execution over RPC."

server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.**
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.

Closes #39971. This PR adds support for functions decorated with `@rpc.functions.async_execution` to be profiled over RPC as builtins, jit functions, and blocking python UDFs currently can be. The reasoning for this is to provide complete feature support in terms of RPC profiling and the various types of functions users can run.

To enable this, the PR below this enables calling `disableProfiler()` safely from another thread. We use that functionality to defer disabling the profiler on the server until the future corresponding to the RPC request completes (rather than only the blocking `processRPC` call as was done previously). Since when the future completes we've kicked off the async function and the future corresponding to it has completed, we are able to capture any RPCs the function would have called and the actual work done on the other node.

For example, if the following async function is ran on a server over RPC:

```
def slow_add(x, y):
    time.sleep(1)
    return torch.add(x, y)


@rpc.functions.async_execution
def slow_async_add(to, x, y):
    return rpc.rpc_async(to, slow_add, args=(x, y))
```

we expect to see the original RPC profiled, the nested RPC profiled, and the actual torch.add() work. All of these events should be recorded with the correct node id. Here is an example profiling output:

```
-------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------
Name                                                                                                                       Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls  Node ID
-------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------                                                                                                                            rpc_async#slow_async_add(worker1 -> worker2)                                                                               0.00%            0.000us          0                1.012s
         1.012s           1                1
aten::empty                                                                                                                7.02%            11.519us         7.02%            11.519us         11.519us         1                1
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)                             0.00%            0.000us          0                1.006s
         1.006s           1                2                                                                                                                                          rpc_async#slow_async_add(worker1 -> worker2)#remote_op: aten::empty                                                        7.21%            11.843us         7.21%            11.843us
         11.843us         1                2
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::add        71.94%           118.107us        85.77%           140.802us        140.802us        1                3
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::empty      13.82%           22.695us         13.82%           22.695us
         22.695us         1                3                                                                                                                                          -------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------
Self CPU time total: 164.164us
```

This PR also moves a bunch of the profiling logic to `rpc/utils.cpp` to declutter `request_callback` code.

Differential Revision: [D23638387](https://our.internmc.facebook.com/intern/diff/D23638387/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Sep 24, 2020
… single threaded

server"

server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server**
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.

server

This ensures that RPC profiling works in single-threaded server
scenarios and that we won't make the assumption that we'll have multiple
threads when working on this code. For example, this assumption resulted in a
bug in the previous diff (which was fixed). 

Differential Revision: [D23691304](https://our.internmc.facebook.com/intern/diff/D23691304/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Sep 24, 2020
…filing"

server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.

A comment from @mrshenli on #44664 led us to the following concern: when enabling profiler on server, if it is a different machine it may
not have CUDA while caller does. In this case, we would crash but now we
fallback to CPU and log a warning.

For testing, I forced it to return CUDA profiler state, and validated that it falls back. Not sure how to add a unittest given that we have single machine tests and the machine either has or doesn't have cuda. 
Differential Revision: [D23790729](https://our.internmc.facebook.com/intern/diff/D23790729/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Sep 25, 2020
…ave CUDA for profiling"

server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.

A comment from @mrshenli on #44664 led us to the following concern: when enabling profiler on server, if it is a different machine it may
not have CUDA while caller does. In this case, we would crash but now we
fallback to CPU and log a warning.

For testing, I forced it to return CUDA profiler state, and validated that it falls back. Not sure how to add a unittest given that we have single machine tests and the machine either has or doesn't have cuda. 
Differential Revision: [D23790729](https://our.internmc.facebook.com/intern/diff/D23790729/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Sep 25, 2020
…filing"

server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.

A comment from @mrshenli on #44664 led us to the following concern: when enabling profiler on server, if it is a different machine it may
not have CUDA while caller does. In this case, we would crash but now we
fallback to CPU and log a warning.

For testing, I forced it to return CUDA profiler state, and validated that it falls back. Not sure how to add a unittest given that we have single machine tests and the machine either has or doesn't have cuda. 
Differential Revision: [D23790729](https://our.internmc.facebook.com/intern/diff/D23790729/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 27ab9bc.

rohan-varma added a commit that referenced this pull request Sep 26, 2020
…ave CUDA for profiling"

server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.

A comment from @mrshenli on #44664 led us to the following concern: when enabling profiler on server, if it is a different machine it may
not have CUDA while caller does. In this case, we would crash but now we
fallback to CPU and log a warning.

For testing, I forced it to return CUDA profiler state, and validated that it falls back. Not sure how to add a unittest given that we have single machine tests and the machine either has or doesn't have cuda. 
Differential Revision: [D23790729](https://our.internmc.facebook.com/intern/diff/D23790729/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Sep 26, 2020
…filing"

server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.
server
* #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.
* #44655 [RPC profiling] Don't wrap toHere() calls with profiling
* #44653 [RPC profiling] Allow disableProfiler() to be called from another thread.
* #44646 Remove thread_local RecordFunctionGuard from profiler.

A comment from @mrshenli on #44664 led us to the following concern: when enabling profiler on server, if it is a different machine it may
not have CUDA while caller does. In this case, we would crash but now we
fallback to CPU and log a warning.

For testing, I forced it to return CUDA profiler state, and validated that it falls back. Not sure how to add a unittest given that we have single machine tests and the machine either has or doesn't have cuda. 
Differential Revision: [D23790729](https://our.internmc.facebook.com/intern/diff/D23790729/)

[ghstack-poisoned]
@facebook-github-bot facebook-github-bot deleted the gh/rohan-varma/173/head branch September 29, 2020 14:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants