-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[RPC profiling] Extend RPC profiling to support async function execution over RPC. #44664
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ion over RPC. Closes #39971. This PR adds support for functions decorated with `@rpc.functions.async_execution` to be profiled over RPC as builtins, jit functions, and blocking python UDFs currently can be. The reasoning for this is to provide complete feature support in terms of RPC profiling and the various types of functions users can run. To enable this, the PR below this enables calling `disableProfiler()` safely from another thread. We use that functionality to defer disabling the profiler on the server until the future corresponding to the RPC request completes (rather than only the blocking `processRPC` call as was done previously). Since when the future completes we've kicked off the async function and the future corresponding to it has completed, we are able to capture any RPCs the function would have called and the actual work done on the other node. For example, if the following async function is ran on a server over RPC: ``` def slow_add(x, y): time.sleep(1) return torch.add(x, y) @rpc.functions.async_execution def slow_async_add(to, x, y): return rpc.rpc_async(to, slow_add, args=(x, y)) ``` we expect to see the original RPC profiled, the nested RPC profiled, and the actual torch.add() work. All of these events should be recorded with the correct node id. Here is an example profiling output: ``` ------------------------------------------------------------------------------------------------------------------------- --------------- --------------- --------------- -------- ------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg Number of Calls Node ID ------------------------------------------------------------------------------------------------------------------------- --------------- --------------- --------------- -------- ------- --------------- --------------- --------------- rpc_async#slow_async_add(worker1 -> worker2) 0.00% 0.000us 0 1.012s 1.012s 1 1 aten::empty 7.02% 11.519us 7.02% 11.519us 11.519us 1 1 rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3) 0.00% 0.000us 0 1.006s 1.006s 1 2 rpc_async#slow_async_add(worker1 -> worker2)#remote_op: aten::empty 7.21% 11.843us 7.21% 11.843us 11.843us 1 2 rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::add 71.94% 118.107us 85.77% 140.802us 140.802us 1 3 rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::empty 13.82% 22.695us 13.82% 22.695us 22.695us 1 3 ------------------------------------------------------------------------------------------------------------------------- --------------- --------------- --------------- -------- ------- --------------- --------------- --------------- Self CPU time total: 164.164us ``` This PR also moves a bunch of the profiling logic to `rpc/utils.cpp` to declutter `request_callback` code. Differential Revision: [D23638387](https://our.internmc.facebook.com/intern/diff/D23638387/) [ghstack-poisoned]
…tion execution over RPC." Closes #39971. This PR adds support for functions decorated with `@rpc.functions.async_execution` to be profiled over RPC as builtins, jit functions, and blocking python UDFs currently can be. The reasoning for this is to provide complete feature support in terms of RPC profiling and the various types of functions users can run. To enable this, the PR below this enables calling `disableProfiler()` safely from another thread. We use that functionality to defer disabling the profiler on the server until the future corresponding to the RPC request completes (rather than only the blocking `processRPC` call as was done previously). Since when the future completes we've kicked off the async function and the future corresponding to it has completed, we are able to capture any RPCs the function would have called and the actual work done on the other node. For example, if the following async function is ran on a server over RPC: ``` def slow_add(x, y): time.sleep(1) return torch.add(x, y) @rpc.functions.async_execution def slow_async_add(to, x, y): return rpc.rpc_async(to, slow_add, args=(x, y)) ``` we expect to see the original RPC profiled, the nested RPC profiled, and the actual torch.add() work. All of these events should be recorded with the correct node id. Here is an example profiling output: ``` ------------------------------------------------------------------------------------------------------------------------- --------------- --------------- --------------- -------- ------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg Number of Calls Node ID ------------------------------------------------------------------------------------------------------------------------- --------------- --------------- --------------- -------- ------- --------------- --------------- --------------- rpc_async#slow_async_add(worker1 -> worker2) 0.00% 0.000us 0 1.012s 1.012s 1 1 aten::empty 7.02% 11.519us 7.02% 11.519us 11.519us 1 1 rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3) 0.00% 0.000us 0 1.006s 1.006s 1 2 rpc_async#slow_async_add(worker1 -> worker2)#remote_op: aten::empty 7.21% 11.843us 7.21% 11.843us 11.843us 1 2 rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::add 71.94% 118.107us 85.77% 140.802us 140.802us 1 3 rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::empty 13.82% 22.695us 13.82% 22.695us 22.695us 1 3 ------------------------------------------------------------------------------------------------------------------------- --------------- --------------- --------------- -------- ------- --------------- --------------- --------------- Self CPU time total: 164.164us ``` This PR also moves a bunch of the profiling logic to `rpc/utils.cpp` to declutter `request_callback` code. Differential Revision: [D23638387](https://our.internmc.facebook.com/intern/diff/D23638387/) [ghstack-poisoned]
…ion over RPC. Pull Request resolved: #44664 Closes #39971. This PR adds support for functions decorated with `@rpc.functions.async_execution` to be profiled over RPC as builtins, jit functions, and blocking python UDFs currently can be. The reasoning for this is to provide complete feature support in terms of RPC profiling and the various types of functions users can run. To enable this, the PR below this enables calling `disableProfiler()` safely from another thread. We use that functionality to defer disabling the profiler on the server until the future corresponding to the RPC request completes (rather than only the blocking `processRPC` call as was done previously). Since when the future completes we've kicked off the async function and the future corresponding to it has completed, we are able to capture any RPCs the function would have called and the actual work done on the other node. For example, if the following async function is ran on a server over RPC: ``` def slow_add(x, y): time.sleep(1) return torch.add(x, y) @rpc.functions.async_execution def slow_async_add(to, x, y): return rpc.rpc_async(to, slow_add, args=(x, y)) ``` we expect to see the original RPC profiled, the nested RPC profiled, and the actual torch.add() work. All of these events should be recorded with the correct node id. Here is an example profiling output: ``` ------------------------------------------------------------------------------------------------------------------------- --------------- --------------- --------------- -------- ------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg Number of Calls Node ID ------------------------------------------------------------------------------------------------------------------------- --------------- --------------- --------------- -------- ------- --------------- --------------- --------------- rpc_async#slow_async_add(worker1 -> worker2) 0.00% 0.000us 0 1.012s 1.012s 1 1 aten::empty 7.02% 11.519us 7.02% 11.519us 11.519us 1 1 rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3) 0.00% 0.000us 0 1.006s 1.006s 1 2 rpc_async#slow_async_add(worker1 -> worker2)#remote_op: aten::empty 7.21% 11.843us 7.21% 11.843us 11.843us 1 2 rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::add 71.94% 118.107us 85.77% 140.802us 140.802us 1 3 rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::empty 13.82% 22.695us 13.82% 22.695us 22.695us 1 3 ------------------------------------------------------------------------------------------------------------------------- --------------- --------------- --------------- -------- ------- --------------- --------------- --------------- Self CPU time total: 164.164us ``` This PR also moves a bunch of the profiling logic to `rpc/utils.cpp` to declutter `request_callback` code. ghstack-source-id: 112032360 Differential Revision: [D23638387](https://our.internmc.facebook.com/intern/diff/D23638387/)
Codecov Report
@@ Coverage Diff @@
## gh/rohan-varma/173/base #44664 +/- ##
==========================================================
Coverage ? 68.04%
==========================================================
Files ? 393
Lines ? 51019
Branches ? 0
==========================================================
Hits ? 34714
Misses ? 16305
Partials ? 0 Continue to review full report at Codecov.
|
💊 CI failures summary and remediationsAs of commit 9250e79 (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group. This comment has been revised 42 times. |
torch/csrc/autograd/profiler.h
Outdated
| }; | ||
|
|
||
| // A struct to control settings of disableProfiler options, to be used in | ||
| // conjunction with TlSProfilerGuard. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tls -> TLS
torch/csrc/autograd/profiler.h
Outdated
| event_lists = disableProfiler(); | ||
| } | ||
| if (cb_) { | ||
| (*cb_)(event_lists); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is prior to this PR. Just curious, what if cb_ throws? Or is there logics that prevent it from throwing in this dtor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we handle the case where it throws, unfortunately. If this happens, then in the current only use case for this we would simply fail to return the remotely profiled event data. This probably isn't a critical error since the underlying RPC could still run fine, so we can probably wrap this with a try/catch and log a warning if that happens.
torch/csrc/autograd/profiler.h
Outdated
| thread_event_lists event_lists = disableProfiler(); | ||
| thread_event_lists event_lists; | ||
| if (profilerDisableOptions_) { | ||
| event_lists = disableProfiler( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: will it be better to let disableProfiler take an optional profilerDisableOptions_, instead of do the branching here? So that if there are future changes in profilerDisableOptions_, the change can be contained in one place?
| // Enable the profiler with the config from the sender. | ||
| std::vector<torch::autograd::profiler::Event> profiledEvents; | ||
| torch::autograd::profiler::ProfilerDisableOptions requestThreadOptions; | ||
| requestThreadOptions.cleanupTLSState = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit and this can come in followup PR. Should we add a ctor for requestThreadOptions instead of manually setting its fields? Or do we need to modify these fields later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added in this PR itself, since I did some refactoring to make disableProfiler take in this struct anyways.
|
|
||
| // Only clean up TLS states of profiler if we are disabling on | ||
| // the main thread. | ||
| bool shouldCleanUpTLSStates = (std::this_thread::get_id() == tid); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you please elaborate more why we don't need clean up TLS on child thread?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per some discussions with @ilia-cher, the way I understand it is that child threads with their thread-local states propagated over by parent threads will pop their thread local states after async task is finished executing.
Since the profiler wasn't explicitly enabled in this thread (rather carried over by at::wrapPropagateTLSState) we shouldn't explicitly clean up the TL states of profiler either (which is what disableProfiler() would do by default). Those states will be clean up after the guard in wrapPropagateTLSState exits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm assuming shouldCleanUpTLSStates is only True when wrappedRpcResponseFuture has already completed and is running inline as a result? If so, should we do this cleanup logic after calling addCallback so that it runs on the main thread and cleans up the state accordingly? We could run the cleanup logic only if we know the future has completed.
It's messy to have code like this which is conditional on thread ids.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to the case that you mentioned, I believe it's also possible to encounter this when running with single-threaded server where all callbacks will run on the same thread. Also, with a thread pool I think there also might be a chance where we run the callback on the same thread that created the future (it would happen if the thread in the thread pool that called markCompleted() happened to be the same).
Although, I think we can remove the thread_id conditioning and just never clean up TLS state in the callback. The reason is, in the case where we run inline/on the main thread, we already wrap this callback with an extra layer of TLSState, so we would not want to clean up this extra state (it would be automatically cleaned by the ThreadLocalStateGuard destructor). I'll validate that this passes with the single-threaded server test cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this works, also added a test in jit/test_misc.cpp which should simulate this issue much more cleanly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason is, in the case where we run inline/on the main thread, we already wrap this callback with an extra layer of TLSState, so we would not want to clean up this extra state (it would be automatically cleaned by the ThreadLocalStateGuard destructor).
Is this true for all functions run by thread from the pool?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's actually only true for specific callbacks such as this one: ones that are wrapped with at::wrapPropagateTLSState (or otherwise use ThreadLocalStateGuard, but wrapPropagateTLSState is the recommended API for continuation callbacks). It won't carry over thread local state implicitly for an arbitrary function ran by the thread pool.
torch/csrc/distributed/rpc/utils.cpp
Outdated
| profilerStart = &e; | ||
| found_cpu_start = true; | ||
| } | ||
| if (cuda_profiling_enabled && 0 == strcmp(e.name(), "__cuda_start_event")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can be an else if?
torch/csrc/distributed/rpc/utils.cpp
Outdated
| // We should always find __start_profile. | ||
| TORCH_CHECK( | ||
| profilerStart != nullptr, "Expected to find __start_profile event."); | ||
| // Should have >= 1 CUDA start event. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this true even if the user function is CPU only but on a machine with CUDA devices? Or is it true that in this case cuda_profiling_enabled will be false?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the user function is CPU-only but machine has cuda device it should still be true since we push __cuda_start_event regardless if running with profile_cuda flag. Although this is a good point, let me check what happens when we enable this on a CPU only machine (I think we should crash during profiler creation probably).
And the comment should probably more accurately read "Should have >= 1 CUDA start event if cuda_profiling_enabled."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the caller if there's no CUDA device, then we fail as expected with a message that cuda is not enabled.
On the callee, we get a stacktrace with the following error message:
> W0918 11:52:41.866964 1090162 record_function.cpp:171] Exception in RecordFunction callback: CUDA used in profil
er but not enabled.
That can happen if machine A sends an RPC to B and A has CUDA but B does not. I wonder if in this case, instead of crashing if we want to support this, it's worth it to check if B also has CUDA, and fallback to CPU-only profiler in the case that it does not instead of crashing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if in this case, instead of crashing if we want to support this
Yep, I would vote for supporting this. I do saw users on forum who wants to connect CUDA with non-CUDA servers using RPC.
| std::vector<char>& payload, | ||
| const rpc::Message& message); | ||
|
|
||
| TORCH_API void populateRemoteProfiledEvents( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's add some comments to describe this func
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added these in the update.
| INIT_METHOD_TEMPLATE = "file://{file_name}" | ||
|
|
||
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one redundant new line?
|
|
||
| def slow_add(x, y): | ||
| time.sleep(1) | ||
| return torch.add(x, y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to add a test with CUDA ops as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's a good point.
…tion execution over RPC." Closes #39971. This PR adds support for functions decorated with `@rpc.functions.async_execution` to be profiled over RPC as builtins, jit functions, and blocking python UDFs currently can be. The reasoning for this is to provide complete feature support in terms of RPC profiling and the various types of functions users can run. To enable this, the PR below this enables calling `disableProfiler()` safely from another thread. We use that functionality to defer disabling the profiler on the server until the future corresponding to the RPC request completes (rather than only the blocking `processRPC` call as was done previously). Since when the future completes we've kicked off the async function and the future corresponding to it has completed, we are able to capture any RPCs the function would have called and the actual work done on the other node. For example, if the following async function is ran on a server over RPC: ``` def slow_add(x, y): time.sleep(1) return torch.add(x, y) @rpc.functions.async_execution def slow_async_add(to, x, y): return rpc.rpc_async(to, slow_add, args=(x, y)) ``` we expect to see the original RPC profiled, the nested RPC profiled, and the actual torch.add() work. All of these events should be recorded with the correct node id. Here is an example profiling output: ``` ------------------------------------------------------------------------------------------------------------------------- --------------- --------------- --------------- -------- ------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg Number of Calls Node ID ------------------------------------------------------------------------------------------------------------------------- --------------- --------------- --------------- -------- ------- --------------- --------------- --------------- rpc_async#slow_async_add(worker1 -> worker2) 0.00% 0.000us 0 1.012s 1.012s 1 1 aten::empty 7.02% 11.519us 7.02% 11.519us 11.519us 1 1 rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3) 0.00% 0.000us 0 1.006s 1.006s 1 2 rpc_async#slow_async_add(worker1 -> worker2)#remote_op: aten::empty 7.21% 11.843us 7.21% 11.843us 11.843us 1 2 rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::add 71.94% 118.107us 85.77% 140.802us 140.802us 1 3 rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::empty 13.82% 22.695us 13.82% 22.695us 22.695us 1 3 ------------------------------------------------------------------------------------------------------------------------- --------------- --------------- --------------- -------- ------- --------------- --------------- --------------- Self CPU time total: 164.164us ``` This PR also moves a bunch of the profiling logic to `rpc/utils.cpp` to declutter `request_callback` code. Differential Revision: [D23638387](https://our.internmc.facebook.com/intern/diff/D23638387/) [ghstack-poisoned]
torch/csrc/autograd/profiler.h
Outdated
| bool cleanupTLSState = true; | ||
| bool consolidate = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add some comments about what each of these options mean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added in the update.
|
|
||
| // Only clean up TLS states of profiler if we are disabling on | ||
| // the main thread. | ||
| bool shouldCleanUpTLSStates = (std::this_thread::get_id() == tid); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm assuming shouldCleanUpTLSStates is only True when wrappedRpcResponseFuture has already completed and is running inline as a result? If so, should we do this cleanup logic after calling addCallback so that it runs on the main thread and cleans up the state accordingly? We could run the cleanup logic only if we know the future has completed.
It's messy to have code like this which is conditional on thread ids.
| auto event_lists = torch::autograd::profiler::disableProfiler( | ||
| shouldCleanUpTLSStates, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like we're doing some cleanup as part of the destructor of torch::autograd::profiler::TLSProfilerGuard g and then there is some additional cleanup here. Could you explain what is the difference between the two?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The destructor in the TLSProfilerGuard cleans up the profiler thread local states in the main thread, but does not call consolidate() which clears out the event lists, In the continuation thread, we don't clean up thread local states (those are restored by the destructor of ThreadLocalStateGuard in at::wrapPropagateTLSState but we do call consolidate() to get all of the events, even the async ones (which would not have been logged if we called consolidate() on main thread).
…tion execution over RPC." server * **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.** * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. Closes #39971. This PR adds support for functions decorated with `@rpc.functions.async_execution` to be profiled over RPC as builtins, jit functions, and blocking python UDFs currently can be. The reasoning for this is to provide complete feature support in terms of RPC profiling and the various types of functions users can run. To enable this, the PR below this enables calling `disableProfiler()` safely from another thread. We use that functionality to defer disabling the profiler on the server until the future corresponding to the RPC request completes (rather than only the blocking `processRPC` call as was done previously). Since when the future completes we've kicked off the async function and the future corresponding to it has completed, we are able to capture any RPCs the function would have called and the actual work done on the other node. For example, if the following async function is ran on a server over RPC: ``` def slow_add(x, y): time.sleep(1) return torch.add(x, y) @rpc.functions.async_execution def slow_async_add(to, x, y): return rpc.rpc_async(to, slow_add, args=(x, y)) ``` we expect to see the original RPC profiled, the nested RPC profiled, and the actual torch.add() work. All of these events should be recorded with the correct node id. Here is an example profiling output: ``` ------------------------------------------------------------------------------------------------------------------------- --------------- --------------- --------------- -------- ------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg Number of Calls Node ID ------------------------------------------------------------------------------------------------------------------------- --------------- --------------- --------------- -------- ------- --------------- --------------- --------------- rpc_async#slow_async_add(worker1 -> worker2) 0.00% 0.000us 0 1.012s 1.012s 1 1 aten::empty 7.02% 11.519us 7.02% 11.519us 11.519us 1 1 rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3) 0.00% 0.000us 0 1.006s 1.006s 1 2 rpc_async#slow_async_add(worker1 -> worker2)#remote_op: aten::empty 7.21% 11.843us 7.21% 11.843us 11.843us 1 2 rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::add 71.94% 118.107us 85.77% 140.802us 140.802us 1 3 rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::empty 13.82% 22.695us 13.82% 22.695us 22.695us 1 3 ------------------------------------------------------------------------------------------------------------------------- --------------- --------------- --------------- -------- ------- --------------- --------------- --------------- Self CPU time total: 164.164us ``` This PR also moves a bunch of the profiling logic to `rpc/utils.cpp` to declutter `request_callback` code. Differential Revision: [D23638387](https://our.internmc.facebook.com/intern/diff/D23638387/) [ghstack-poisoned]
server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * **#44646 Remove thread_local RecordFunctionGuard from profiler.** Per a discussion with @ilia-cher, this is not needed anymore and removing it would make some future changes to support async RPC profiling easier. Tested by ensuring profiling tests in `test_autograd.py` still pass. Differential Revision: [D23683998](https://our.internmc.facebook.com/intern/diff/D23683998/) [ghstack-poisoned]
…another thread. " server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * **#44653 [RPC profiling] Allow disableProfiler() to be called from another thread.** * #44646 Remove thread_local RecordFunctionGuard from profiler. This changes the profiler per a discussion with @ilia-cher offline that enables `disableProfiler()` event consolidation logic to be called from different threads (i.e. threads where the profiler was not explicitly enabled). This is needed to support the functionality enabled by D23638387 where we defer profiling event collection until executing an async callback that can execute on a different thread, to support RPC async function profiling. This is done by introducing 2 flags `cleanupTLSState` and `consolidate` which controls whether we should clean up thread local settings (we don't do this when calling `disableProfiler()` on non-main threads) and whether we should consolidate all profiled events. Backwards compatiblity is ensured since both options are true by default. Added a test in `test_misc.cpp` to test this. Differential Revision: [D23638499](https://our.internmc.facebook.com/intern/diff/D23638499/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23638499/)! [ghstack-poisoned]
server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * **#44655 [RPC profiling] Don't wrap toHere() calls with profiling** * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. Since `toHere()` does not execute operations (torch operators) over RPC and simply transfers the value to the local node, we don't need to enable the profiler remotely for this message. This causes unnecessary overhead and is not needed. Since `toHere` is a blocking call, we already profile the call on the local node using `RECORD_USER_SCOPE`, so this does not change the expected profiler results (validated by ensuring all remote profiling tests pass). Differential Revision: [D23641466](https://our.internmc.facebook.com/intern/diff/D23641466/) [ghstack-poisoned]
… single threaded server" server** * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server This ensures that RPC profiling works in single-threaded server scenarios and that we won't make the assumption that we'll have multiple threads when working on this code. For example, this assumption resulted in a bug in the previous diff (which was fixed). Differential Revision: [D23691304](https://our.internmc.facebook.com/intern/diff/D23691304/) [ghstack-poisoned]
…tion execution over RPC." server * **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.** * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.** * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. Closes #39971. This PR adds support for functions decorated with `@rpc.functions.async_execution` to be profiled over RPC as builtins, jit functions, and blocking python UDFs currently can be. The reasoning for this is to provide complete feature support in terms of RPC profiling and the various types of functions users can run. To enable this, the PR below this enables calling `disableProfiler()` safely from another thread. We use that functionality to defer disabling the profiler on the server until the future corresponding to the RPC request completes (rather than only the blocking `processRPC` call as was done previously). Since when the future completes we've kicked off the async function and the future corresponding to it has completed, we are able to capture any RPCs the function would have called and the actual work done on the other node. For example, if the following async function is ran on a server over RPC: ``` def slow_add(x, y): time.sleep(1) return torch.add(x, y) @rpc.functions.async_execution def slow_async_add(to, x, y): return rpc.rpc_async(to, slow_add, args=(x, y)) ``` we expect to see the original RPC profiled, the nested RPC profiled, and the actual torch.add() work. All of these events should be recorded with the correct node id. Here is an example profiling output: ``` ------------------------------------------------------------------------------------------------------------------------- --------------- --------------- --------------- -------- ------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg Number of Calls Node ID ------------------------------------------------------------------------------------------------------------------------- --------------- --------------- --------------- -------- ------- --------------- --------------- --------------- rpc_async#slow_async_add(worker1 -> worker2) 0.00% 0.000us 0 1.012s 1.012s 1 1 aten::empty 7.02% 11.519us 7.02% 11.519us 11.519us 1 1 rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3) 0.00% 0.000us 0 1.006s 1.006s 1 2 rpc_async#slow_async_add(worker1 -> worker2)#remote_op: aten::empty 7.21% 11.843us 7.21% 11.843us 11.843us 1 2 rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::add 71.94% 118.107us 85.77% 140.802us 140.802us 1 3 rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::empty 13.82% 22.695us 13.82% 22.695us 22.695us 1 3 ------------------------------------------------------------------------------------------------------------------------- --------------- --------------- --------------- -------- ------- --------------- --------------- --------------- Self CPU time total: 164.164us ``` This PR also moves a bunch of the profiling logic to `rpc/utils.cpp` to declutter `request_callback` code. Differential Revision: [D23638387](https://our.internmc.facebook.com/intern/diff/D23638387/) [ghstack-poisoned]
…another thread. " server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * **#44653 [RPC profiling] Allow disableProfiler() to be called from another thread.** * #44646 Remove thread_local RecordFunctionGuard from profiler. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * **#44653 [RPC profiling] Allow disableProfiler() to be called from another thread.** * #44646 Remove thread_local RecordFunctionGuard from profiler. This changes the profiler per a discussion with @ilia-cher offline that enables `disableProfiler()` event consolidation logic to be called from different threads (i.e. threads where the profiler was not explicitly enabled). This is needed to support the functionality enabled by D23638387 where we defer profiling event collection until executing an async callback that can execute on a different thread, to support RPC async function profiling. This is done by introducing 2 flags `cleanupTLSState` and `consolidate` which controls whether we should clean up thread local settings (we don't do this when calling `disableProfiler()` on non-main threads) and whether we should consolidate all profiled events. Backwards compatiblity is ensured since both options are true by default. Added a test in `test_misc.cpp` to test this. Differential Revision: [D23638499](https://our.internmc.facebook.com/intern/diff/D23638499/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23638499/)! [ghstack-poisoned]
server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * **#44655 [RPC profiling] Don't wrap toHere() calls with profiling** * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * **#44655 [RPC profiling] Don't wrap toHere() calls with profiling** * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. Since `toHere()` does not execute operations (torch operators) over RPC and simply transfers the value to the local node, we don't need to enable the profiler remotely for this message. This causes unnecessary overhead and is not needed. Since `toHere` is a blocking call, we already profile the call on the local node using `RECORD_USER_SCOPE`, so this does not change the expected profiler results (validated by ensuring all remote profiling tests pass). Differential Revision: [D23641466](https://our.internmc.facebook.com/intern/diff/D23641466/) [ghstack-poisoned]
…tion execution over RPC." Closes #39971. This PR adds support for functions decorated with `@rpc.functions.async_execution` to be profiled over RPC as builtins, jit functions, and blocking python UDFs currently can be. The reasoning for this is to provide complete feature support in terms of RPC profiling and the various types of functions users can run. To enable this, the PR below this enables calling `disableProfiler()` safely from another thread. We use that functionality to defer disabling the profiler on the server until the future corresponding to the RPC request completes (rather than only the blocking `processRPC` call as was done previously). Since when the future completes we've kicked off the async function and the future corresponding to it has completed, we are able to capture any RPCs the function would have called and the actual work done on the other node. For example, if the following async function is ran on a server over RPC: ``` def slow_add(x, y): time.sleep(1) return torch.add(x, y) @rpc.functions.async_execution def slow_async_add(to, x, y): return rpc.rpc_async(to, slow_add, args=(x, y)) ``` we expect to see the original RPC profiled, the nested RPC profiled, and the actual torch.add() work. All of these events should be recorded with the correct node id. Here is an example profiling output: ``` ------------------------------------------------------------------------------------------------------------------------- --------------- --------------- --------------- -------- ------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg Number of Calls Node ID ------------------------------------------------------------------------------------------------------------------------- --------------- --------------- --------------- -------- ------- --------------- --------------- --------------- rpc_async#slow_async_add(worker1 -> worker2) 0.00% 0.000us 0 1.012s 1.012s 1 1 aten::empty 7.02% 11.519us 7.02% 11.519us 11.519us 1 1 rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3) 0.00% 0.000us 0 1.006s 1.006s 1 2 rpc_async#slow_async_add(worker1 -> worker2)#remote_op: aten::empty 7.21% 11.843us 7.21% 11.843us 11.843us 1 2 rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::add 71.94% 118.107us 85.77% 140.802us 140.802us 1 3 rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::empty 13.82% 22.695us 13.82% 22.695us 22.695us 1 3 ------------------------------------------------------------------------------------------------------------------------- --------------- --------------- --------------- -------- ------- --------------- --------------- --------------- Self CPU time total: 164.164us ``` This PR also moves a bunch of the profiling logic to `rpc/utils.cpp` to declutter `request_callback` code. Differential Revision: [D23638387](https://our.internmc.facebook.com/intern/diff/D23638387/) [ghstack-poisoned]
…ion over RPC. Pull Request resolved: #44664 Closes #39971. This PR adds support for functions decorated with `@rpc.functions.async_execution` to be profiled over RPC as builtins, jit functions, and blocking python UDFs currently can be. The reasoning for this is to provide complete feature support in terms of RPC profiling and the various types of functions users can run. To enable this, the PR below this enables calling `disableProfiler()` safely from another thread. We use that functionality to defer disabling the profiler on the server until the future corresponding to the RPC request completes (rather than only the blocking `processRPC` call as was done previously). Since when the future completes we've kicked off the async function and the future corresponding to it has completed, we are able to capture any RPCs the function would have called and the actual work done on the other node. For example, if the following async function is ran on a server over RPC: ``` def slow_add(x, y): time.sleep(1) return torch.add(x, y) @rpc.functions.async_execution def slow_async_add(to, x, y): return rpc.rpc_async(to, slow_add, args=(x, y)) ``` we expect to see the original RPC profiled, the nested RPC profiled, and the actual torch.add() work. All of these events should be recorded with the correct node id. Here is an example profiling output: ``` ------------------------------------------------------------------------------------------------------------------------- --------------- --------------- --------------- -------- ------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg Number of Calls Node ID ------------------------------------------------------------------------------------------------------------------------- --------------- --------------- --------------- -------- ------- --------------- --------------- --------------- rpc_async#slow_async_add(worker1 -> worker2) 0.00% 0.000us 0 1.012s 1.012s 1 1 aten::empty 7.02% 11.519us 7.02% 11.519us 11.519us 1 1 rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3) 0.00% 0.000us 0 1.006s 1.006s 1 2 rpc_async#slow_async_add(worker1 -> worker2)#remote_op: aten::empty 7.21% 11.843us 7.21% 11.843us 11.843us 1 2 rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::add 71.94% 118.107us 85.77% 140.802us 140.802us 1 3 rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::empty 13.82% 22.695us 13.82% 22.695us 22.695us 1 3 ------------------------------------------------------------------------------------------------------------------------- --------------- --------------- --------------- -------- ------- --------------- --------------- --------------- Self CPU time total: 164.164us ``` This PR also moves a bunch of the profiling logic to `rpc/utils.cpp` to declutter `request_callback` code. ghstack-source-id: 112403438 Differential Revision: [D23638387](https://our.internmc.facebook.com/intern/diff/D23638387/)
…filing" server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. A comment from @mrshenli on #44664 led us to the following concern: when enabling profiler on server, if it is a different machine it may not have CUDA while caller does. In this case, we would crash but now we fallback to CPU and log a warning. For testing, I forced it to return CUDA profiler state, and validated that it falls back. Not sure how to add a unittest given that we have single machine tests and the machine either has or doesn't have cuda. Differential Revision: [D23790729](https://our.internmc.facebook.com/intern/diff/D23790729/) [ghstack-poisoned]
|
Test failure existed before this diff and should be resolved by #45162 |
torch/csrc/autograd/profiler.h
Outdated
| // }); | ||
| // Code to profile | ||
| // } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unintentional new line? the comment above is for this struct.
| ) | ||
| try: | ||
| return self._rpc_backend_options | ||
| except AttributeError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when will this error be triggered?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This error actually happens in the default flow when we do not run under single threaded PG agent. When running under single threaded mode, we set _rpc_backend_options (see setter method below), when we don't, we call into the below construct_rpc_backend_options as before.
Not sure if there's a better way to write this, but something like the following would also work if we don't want to use try/except:
if hasattr(self, '_single_threaded_options'):
return self._single_threaded_options # means we want to run in single-threadd mode
else:
return construct_rpc_backend_options(...)
…filing" server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. A comment from @mrshenli on #44664 led us to the following concern: when enabling profiler on server, if it is a different machine it may not have CUDA while caller does. In this case, we would crash but now we fallback to CPU and log a warning. For testing, I forced it to return CUDA profiler state, and validated that it falls back. Not sure how to add a unittest given that we have single machine tests and the machine either has or doesn't have cuda. Differential Revision: [D23790729](https://our.internmc.facebook.com/intern/diff/D23790729/) [ghstack-poisoned]
…filing" server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. A comment from @mrshenli on #44664 led us to the following concern: when enabling profiler on server, if it is a different machine it may not have CUDA while caller does. In this case, we would crash but now we fallback to CPU and log a warning. For testing, I forced it to return CUDA profiler state, and validated that it falls back. Not sure how to add a unittest given that we have single machine tests and the machine either has or doesn't have cuda. Differential Revision: [D23790729](https://our.internmc.facebook.com/intern/diff/D23790729/) [ghstack-poisoned]
…tion execution over RPC." server * **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.** * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.** * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.** * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.** * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.** * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.** * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.** * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.** * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.** * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. Closes #39971. This PR adds support for functions decorated with `@rpc.functions.async_execution` to be profiled over RPC as builtins, jit functions, and blocking python UDFs currently can be. The reasoning for this is to provide complete feature support in terms of RPC profiling and the various types of functions users can run. To enable this, the PR below this enables calling `disableProfiler()` safely from another thread. We use that functionality to defer disabling the profiler on the server until the future corresponding to the RPC request completes (rather than only the blocking `processRPC` call as was done previously). Since when the future completes we've kicked off the async function and the future corresponding to it has completed, we are able to capture any RPCs the function would have called and the actual work done on the other node. For example, if the following async function is ran on a server over RPC: ``` def slow_add(x, y): time.sleep(1) return torch.add(x, y) @rpc.functions.async_execution def slow_async_add(to, x, y): return rpc.rpc_async(to, slow_add, args=(x, y)) ``` we expect to see the original RPC profiled, the nested RPC profiled, and the actual torch.add() work. All of these events should be recorded with the correct node id. Here is an example profiling output: ``` ------------------------------------------------------------------------------------------------------------------------- --------------- --------------- --------------- -------- ------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg Number of Calls Node ID ------------------------------------------------------------------------------------------------------------------------- --------------- --------------- --------------- -------- ------- --------------- --------------- --------------- rpc_async#slow_async_add(worker1 -> worker2) 0.00% 0.000us 0 1.012s 1.012s 1 1 aten::empty 7.02% 11.519us 7.02% 11.519us 11.519us 1 1 rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3) 0.00% 0.000us 0 1.006s 1.006s 1 2 rpc_async#slow_async_add(worker1 -> worker2)#remote_op: aten::empty 7.21% 11.843us 7.21% 11.843us 11.843us 1 2 rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::add 71.94% 118.107us 85.77% 140.802us 140.802us 1 3 rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::empty 13.82% 22.695us 13.82% 22.695us 22.695us 1 3 ------------------------------------------------------------------------------------------------------------------------- --------------- --------------- --------------- -------- ------- --------------- --------------- --------------- Self CPU time total: 164.164us ``` This PR also moves a bunch of the profiling logic to `rpc/utils.cpp` to declutter `request_callback` code. Differential Revision: [D23638387](https://our.internmc.facebook.com/intern/diff/D23638387/) [ghstack-poisoned]
…pport async function execution over RPC." server * **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.** * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.** * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.** * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.** * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.** * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.** * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.** * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.** * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.** * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. Closes #39971. This PR adds support for functions decorated with `@rpc.functions.async_execution` to be profiled over RPC as builtins, jit functions, and blocking python UDFs currently can be. The reasoning for this is to provide complete feature support in terms of RPC profiling and the various types of functions users can run. To enable this, the PR below this enables calling `disableProfiler()` safely from another thread. We use that functionality to defer disabling the profiler on the server until the future corresponding to the RPC request completes (rather than only the blocking `processRPC` call as was done previously). Since when the future completes we've kicked off the async function and the future corresponding to it has completed, we are able to capture any RPCs the function would have called and the actual work done on the other node. For example, if the following async function is ran on a server over RPC: ``` def slow_add(x, y): time.sleep(1) return torch.add(x, y) @rpc.functions.async_execution def slow_async_add(to, x, y): return rpc.rpc_async(to, slow_add, args=(x, y)) ``` we expect to see the original RPC profiled, the nested RPC profiled, and the actual torch.add() work. All of these events should be recorded with the correct node id. Here is an example profiling output: ``` ------------------------------------------------------------------------------------------------------------------------- --------------- --------------- --------------- -------- ------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg Number of Calls Node ID ------------------------------------------------------------------------------------------------------------------------- --------------- --------------- --------------- -------- ------- --------------- --------------- --------------- rpc_async#slow_async_add(worker1 -> worker2) 0.00% 0.000us 0 1.012s 1.012s 1 1 aten::empty 7.02% 11.519us 7.02% 11.519us 11.519us 1 1 rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3) 0.00% 0.000us 0 1.006s 1.006s 1 2 rpc_async#slow_async_add(worker1 -> worker2)#remote_op: aten::empty 7.21% 11.843us 7.21% 11.843us 11.843us 1 2 rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::add 71.94% 118.107us 85.77% 140.802us 140.802us 1 3 rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::empty 13.82% 22.695us 13.82% 22.695us 22.695us 1 3 ------------------------------------------------------------------------------------------------------------------------- --------------- --------------- --------------- -------- ------- --------------- --------------- --------------- Self CPU time total: 164.164us ``` This PR also moves a bunch of the profiling logic to `rpc/utils.cpp` to declutter `request_callback` code. Differential Revision: [D23638387](https://our.internmc.facebook.com/intern/diff/D23638387/) [ghstack-poisoned]
… single threaded server" server** * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server** * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server** * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server** * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server** * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server** * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server** * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server** * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server** * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server** * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server** * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server This ensures that RPC profiling works in single-threaded server scenarios and that we won't make the assumption that we'll have multiple threads when working on this code. For example, this assumption resulted in a bug in the previous diff (which was fixed). Differential Revision: [D23691304](https://our.internmc.facebook.com/intern/diff/D23691304/) [ghstack-poisoned]
…filing" server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. A comment from @mrshenli on #44664 led us to the following concern: when enabling profiler on server, if it is a different machine it may not have CUDA while caller does. In this case, we would crash but now we fallback to CPU and log a warning. For testing, I forced it to return CUDA profiler state, and validated that it falls back. Not sure how to add a unittest given that we have single machine tests and the machine either has or doesn't have cuda. Differential Revision: [D23790729](https://our.internmc.facebook.com/intern/diff/D23790729/) [ghstack-poisoned]
… single threaded server" server** * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. server** * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server** * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server** * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server** * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server** * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server** * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server** * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server** * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server** * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server** * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server** * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server This ensures that RPC profiling works in single-threaded server scenarios and that we won't make the assumption that we'll have multiple threads when working on this code. For example, this assumption resulted in a bug in the previous diff (which was fixed). Differential Revision: [D23691304](https://our.internmc.facebook.com/intern/diff/D23691304/) [ghstack-poisoned]
…filing" server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. A comment from @mrshenli on #44664 led us to the following concern: when enabling profiler on server, if it is a different machine it may not have CUDA while caller does. In this case, we would crash but now we fallback to CPU and log a warning. For testing, I forced it to return CUDA profiler state, and validated that it falls back. Not sure how to add a unittest given that we have single machine tests and the machine either has or doesn't have cuda. Differential Revision: [D23790729](https://our.internmc.facebook.com/intern/diff/D23790729/) [ghstack-poisoned]
…tion execution over RPC." server * **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.** server * **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.** server * **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.** * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.** * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.** * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.** * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.** * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.** * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.** * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.** * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.** * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. Closes #39971. This PR adds support for functions decorated with `@rpc.functions.async_execution` to be profiled over RPC as builtins, jit functions, and blocking python UDFs currently can be. The reasoning for this is to provide complete feature support in terms of RPC profiling and the various types of functions users can run. To enable this, the PR below this enables calling `disableProfiler()` safely from another thread. We use that functionality to defer disabling the profiler on the server until the future corresponding to the RPC request completes (rather than only the blocking `processRPC` call as was done previously). Since when the future completes we've kicked off the async function and the future corresponding to it has completed, we are able to capture any RPCs the function would have called and the actual work done on the other node. For example, if the following async function is ran on a server over RPC: ``` def slow_add(x, y): time.sleep(1) return torch.add(x, y) @rpc.functions.async_execution def slow_async_add(to, x, y): return rpc.rpc_async(to, slow_add, args=(x, y)) ``` we expect to see the original RPC profiled, the nested RPC profiled, and the actual torch.add() work. All of these events should be recorded with the correct node id. Here is an example profiling output: ``` ------------------------------------------------------------------------------------------------------------------------- --------------- --------------- --------------- -------- ------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg Number of Calls Node ID ------------------------------------------------------------------------------------------------------------------------- --------------- --------------- --------------- -------- ------- --------------- --------------- --------------- rpc_async#slow_async_add(worker1 -> worker2) 0.00% 0.000us 0 1.012s 1.012s 1 1 aten::empty 7.02% 11.519us 7.02% 11.519us 11.519us 1 1 rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3) 0.00% 0.000us 0 1.006s 1.006s 1 2 rpc_async#slow_async_add(worker1 -> worker2)#remote_op: aten::empty 7.21% 11.843us 7.21% 11.843us 11.843us 1 2 rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::add 71.94% 118.107us 85.77% 140.802us 140.802us 1 3 rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::empty 13.82% 22.695us 13.82% 22.695us 22.695us 1 3 ------------------------------------------------------------------------------------------------------------------------- --------------- --------------- --------------- -------- ------- --------------- --------------- --------------- Self CPU time total: 164.164us ``` This PR also moves a bunch of the profiling logic to `rpc/utils.cpp` to declutter `request_callback` code. Differential Revision: [D23638387](https://our.internmc.facebook.com/intern/diff/D23638387/) [ghstack-poisoned]
…pport async function execution over RPC." server * **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.** server * **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.** server * **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.** * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.** * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.** * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.** * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.** * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.** * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.** * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.** * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.** * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. Closes #39971. This PR adds support for functions decorated with `@rpc.functions.async_execution` to be profiled over RPC as builtins, jit functions, and blocking python UDFs currently can be. The reasoning for this is to provide complete feature support in terms of RPC profiling and the various types of functions users can run. To enable this, the PR below this enables calling `disableProfiler()` safely from another thread. We use that functionality to defer disabling the profiler on the server until the future corresponding to the RPC request completes (rather than only the blocking `processRPC` call as was done previously). Since when the future completes we've kicked off the async function and the future corresponding to it has completed, we are able to capture any RPCs the function would have called and the actual work done on the other node. For example, if the following async function is ran on a server over RPC: ``` def slow_add(x, y): time.sleep(1) return torch.add(x, y) @rpc.functions.async_execution def slow_async_add(to, x, y): return rpc.rpc_async(to, slow_add, args=(x, y)) ``` we expect to see the original RPC profiled, the nested RPC profiled, and the actual torch.add() work. All of these events should be recorded with the correct node id. Here is an example profiling output: ``` ------------------------------------------------------------------------------------------------------------------------- --------------- --------------- --------------- -------- ------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg Number of Calls Node ID ------------------------------------------------------------------------------------------------------------------------- --------------- --------------- --------------- -------- ------- --------------- --------------- --------------- rpc_async#slow_async_add(worker1 -> worker2) 0.00% 0.000us 0 1.012s 1.012s 1 1 aten::empty 7.02% 11.519us 7.02% 11.519us 11.519us 1 1 rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3) 0.00% 0.000us 0 1.006s 1.006s 1 2 rpc_async#slow_async_add(worker1 -> worker2)#remote_op: aten::empty 7.21% 11.843us 7.21% 11.843us 11.843us 1 2 rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::add 71.94% 118.107us 85.77% 140.802us 140.802us 1 3 rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::empty 13.82% 22.695us 13.82% 22.695us 22.695us 1 3 ------------------------------------------------------------------------------------------------------------------------- --------------- --------------- --------------- -------- ------- --------------- --------------- --------------- Self CPU time total: 164.164us ``` This PR also moves a bunch of the profiling logic to `rpc/utils.cpp` to declutter `request_callback` code. Differential Revision: [D23638387](https://our.internmc.facebook.com/intern/diff/D23638387/) [ghstack-poisoned]
… single threaded server" server** * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. server** * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. server** * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server** * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server** * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server** * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server** * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server** * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server** * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server** * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server** * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server** * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server** * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server This ensures that RPC profiling works in single-threaded server scenarios and that we won't make the assumption that we'll have multiple threads when working on this code. For example, this assumption resulted in a bug in the previous diff (which was fixed). Differential Revision: [D23691304](https://our.internmc.facebook.com/intern/diff/D23691304/) [ghstack-poisoned]
…filing" server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. A comment from @mrshenli on #44664 led us to the following concern: when enabling profiler on server, if it is a different machine it may not have CUDA while caller does. In this case, we would crash but now we fallback to CPU and log a warning. For testing, I forced it to return CUDA profiler state, and validated that it falls back. Not sure how to add a unittest given that we have single machine tests and the machine either has or doesn't have cuda. Differential Revision: [D23790729](https://our.internmc.facebook.com/intern/diff/D23790729/) [ghstack-poisoned]
…tion execution over RPC." server * **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.** server * **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.** server * **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.** server * **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.** server * **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.** * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.** * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.** * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.** * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.** * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.** * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.** * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.** * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.** * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. Closes #39971. This PR adds support for functions decorated with `@rpc.functions.async_execution` to be profiled over RPC as builtins, jit functions, and blocking python UDFs currently can be. The reasoning for this is to provide complete feature support in terms of RPC profiling and the various types of functions users can run. To enable this, the PR below this enables calling `disableProfiler()` safely from another thread. We use that functionality to defer disabling the profiler on the server until the future corresponding to the RPC request completes (rather than only the blocking `processRPC` call as was done previously). Since when the future completes we've kicked off the async function and the future corresponding to it has completed, we are able to capture any RPCs the function would have called and the actual work done on the other node. For example, if the following async function is ran on a server over RPC: ``` def slow_add(x, y): time.sleep(1) return torch.add(x, y) @rpc.functions.async_execution def slow_async_add(to, x, y): return rpc.rpc_async(to, slow_add, args=(x, y)) ``` we expect to see the original RPC profiled, the nested RPC profiled, and the actual torch.add() work. All of these events should be recorded with the correct node id. Here is an example profiling output: ``` ------------------------------------------------------------------------------------------------------------------------- --------------- --------------- --------------- -------- ------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg Number of Calls Node ID ------------------------------------------------------------------------------------------------------------------------- --------------- --------------- --------------- -------- ------- --------------- --------------- --------------- rpc_async#slow_async_add(worker1 -> worker2) 0.00% 0.000us 0 1.012s 1.012s 1 1 aten::empty 7.02% 11.519us 7.02% 11.519us 11.519us 1 1 rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3) 0.00% 0.000us 0 1.006s 1.006s 1 2 rpc_async#slow_async_add(worker1 -> worker2)#remote_op: aten::empty 7.21% 11.843us 7.21% 11.843us 11.843us 1 2 rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::add 71.94% 118.107us 85.77% 140.802us 140.802us 1 3 rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::empty 13.82% 22.695us 13.82% 22.695us 22.695us 1 3 ------------------------------------------------------------------------------------------------------------------------- --------------- --------------- --------------- -------- ------- --------------- --------------- --------------- Self CPU time total: 164.164us ``` This PR also moves a bunch of the profiling logic to `rpc/utils.cpp` to declutter `request_callback` code. Differential Revision: [D23638387](https://our.internmc.facebook.com/intern/diff/D23638387/) [ghstack-poisoned]
…pport async function execution over RPC." server * **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.** server * **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.** server * **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.** server * **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.** server * **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.** * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.** * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.** * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.** * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.** * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.** * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.** * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.** * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * **#44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC.** * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. Closes #39971. This PR adds support for functions decorated with `@rpc.functions.async_execution` to be profiled over RPC as builtins, jit functions, and blocking python UDFs currently can be. The reasoning for this is to provide complete feature support in terms of RPC profiling and the various types of functions users can run. To enable this, the PR below this enables calling `disableProfiler()` safely from another thread. We use that functionality to defer disabling the profiler on the server until the future corresponding to the RPC request completes (rather than only the blocking `processRPC` call as was done previously). Since when the future completes we've kicked off the async function and the future corresponding to it has completed, we are able to capture any RPCs the function would have called and the actual work done on the other node. For example, if the following async function is ran on a server over RPC: ``` def slow_add(x, y): time.sleep(1) return torch.add(x, y) @rpc.functions.async_execution def slow_async_add(to, x, y): return rpc.rpc_async(to, slow_add, args=(x, y)) ``` we expect to see the original RPC profiled, the nested RPC profiled, and the actual torch.add() work. All of these events should be recorded with the correct node id. Here is an example profiling output: ``` ------------------------------------------------------------------------------------------------------------------------- --------------- --------------- --------------- -------- ------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg Number of Calls Node ID ------------------------------------------------------------------------------------------------------------------------- --------------- --------------- --------------- -------- ------- --------------- --------------- --------------- rpc_async#slow_async_add(worker1 -> worker2) 0.00% 0.000us 0 1.012s 1.012s 1 1 aten::empty 7.02% 11.519us 7.02% 11.519us 11.519us 1 1 rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3) 0.00% 0.000us 0 1.006s 1.006s 1 2 rpc_async#slow_async_add(worker1 -> worker2)#remote_op: aten::empty 7.21% 11.843us 7.21% 11.843us 11.843us 1 2 rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::add 71.94% 118.107us 85.77% 140.802us 140.802us 1 3 rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::empty 13.82% 22.695us 13.82% 22.695us 22.695us 1 3 ------------------------------------------------------------------------------------------------------------------------- --------------- --------------- --------------- -------- ------- --------------- --------------- --------------- Self CPU time total: 164.164us ``` This PR also moves a bunch of the profiling logic to `rpc/utils.cpp` to declutter `request_callback` code. Differential Revision: [D23638387](https://our.internmc.facebook.com/intern/diff/D23638387/) [ghstack-poisoned]
… single threaded server" server** * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. server** * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. server** * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. server** * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. server** * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server** * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server** * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server** * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server** * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server** * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server** * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server** * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server** * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server** * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server** * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server This ensures that RPC profiling works in single-threaded server scenarios and that we won't make the assumption that we'll have multiple threads when working on this code. For example, this assumption resulted in a bug in the previous diff (which was fixed). Differential Revision: [D23691304](https://our.internmc.facebook.com/intern/diff/D23691304/) [ghstack-poisoned]
…filing" server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. A comment from @mrshenli on #44664 led us to the following concern: when enabling profiler on server, if it is a different machine it may not have CUDA while caller does. In this case, we would crash but now we fallback to CPU and log a warning. For testing, I forced it to return CUDA profiler state, and validated that it falls back. Not sure how to add a unittest given that we have single machine tests and the machine either has or doesn't have cuda. Differential Revision: [D23790729](https://our.internmc.facebook.com/intern/diff/D23790729/) [ghstack-poisoned]
…ave CUDA for profiling" server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. A comment from @mrshenli on #44664 led us to the following concern: when enabling profiler on server, if it is a different machine it may not have CUDA while caller does. In this case, we would crash but now we fallback to CPU and log a warning. For testing, I forced it to return CUDA profiler state, and validated that it falls back. Not sure how to add a unittest given that we have single machine tests and the machine either has or doesn't have cuda. Differential Revision: [D23790729](https://our.internmc.facebook.com/intern/diff/D23790729/) [ghstack-poisoned]
…filing" server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. A comment from @mrshenli on #44664 led us to the following concern: when enabling profiler on server, if it is a different machine it may not have CUDA while caller does. In this case, we would crash but now we fallback to CPU and log a warning. For testing, I forced it to return CUDA profiler state, and validated that it falls back. Not sure how to add a unittest given that we have single machine tests and the machine either has or doesn't have cuda. Differential Revision: [D23790729](https://our.internmc.facebook.com/intern/diff/D23790729/) [ghstack-poisoned]
|
This pull request has been merged in 27ab9bc. |
…ave CUDA for profiling" server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. A comment from @mrshenli on #44664 led us to the following concern: when enabling profiler on server, if it is a different machine it may not have CUDA while caller does. In this case, we would crash but now we fallback to CPU and log a warning. For testing, I forced it to return CUDA profiler state, and validated that it falls back. Not sure how to add a unittest given that we have single machine tests and the machine either has or doesn't have cuda. Differential Revision: [D23790729](https://our.internmc.facebook.com/intern/diff/D23790729/) [ghstack-poisoned]
…filing" server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. server * #44664 [RPC profiling] Extend RPC profiling to support async function execution over RPC. * #44655 [RPC profiling] Don't wrap toHere() calls with profiling * #44653 [RPC profiling] Allow disableProfiler() to be called from another thread. * #44646 Remove thread_local RecordFunctionGuard from profiler. A comment from @mrshenli on #44664 led us to the following concern: when enabling profiler on server, if it is a different machine it may not have CUDA while caller does. In this case, we would crash but now we fallback to CPU and log a warning. For testing, I forced it to return CUDA profiler state, and validated that it falls back. Not sure how to add a unittest given that we have single machine tests and the machine either has or doesn't have cuda. Differential Revision: [D23790729](https://our.internmc.facebook.com/intern/diff/D23790729/) [ghstack-poisoned]
Stack from ghstack:
server
server
server
server
server
server
server
server
server
server
server
server
server
server
Closes #39971. This PR adds support for functions decorated with
@rpc.functions.async_executionto be profiled over RPC as builtins, jit functions, and blocking python UDFs currently can be. The reasoning for this is to provide complete feature support in terms of RPC profiling and the various types of functions users can run.To enable this, the PR below this enables calling
disableProfiler()safely from another thread. We use that functionality to defer disabling the profiler on the server until the future corresponding to the RPC request completes (rather than only the blockingprocessRPCcall as was done previously). Since when the future completes we've kicked off the async function and the future corresponding to it has completed, we are able to capture any RPCs the function would have called and the actual work done on the other node.For example, if the following async function is ran on a server over RPC:
we expect to see the original RPC profiled, the nested RPC profiled, and the actual torch.add() work. All of these events should be recorded with the correct node id. Here is an example profiling output:
This PR also moves a bunch of the profiling logic to
rpc/utils.cppto declutterrequest_callbackcode.Differential Revision: D23638387