[SymmMem] op to get remote tensors#167779
[SymmMem] op to get remote tensors#167779kwen2501 wants to merge 1 commit intogh/kwen2501/285/basefrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/167779
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (7 Unrelated Failures)As of commit 28333cb with merge base 7d7ae4d ( FLAKY - The following jobs failed but were likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
The label |
|
The label |
|
The label |
|
The label |
|
The label |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
To support use case in pytorch/helion#1122, i.e. ``` @helion.kernel def foo( x: Tensor, group_name: str ): x_remotes = torch.ops.symm_mem.get_remote_tensors(x, group_name) for t in x_remotes: ... ```` Helion uses fake tensor to trace a program, thus we cannot use the following code in a Helion function: ``` hdl = rendezvous(tensor) remote_tensors = tuple( hdl.get_remote_tensor(peer, ...) for peer in range(world_size) ) ``` The reason is that when `tensor` is fake, the returned `hdl` is None, thus any subsequent call on it will fail. This PR wraps the above functionality as an op: ``` lib.define("get_remote_tensors(Tensor x, str group_name) -> Tensor[]") ``` so that things like `hdl` is not exposed to Helion. The op also provides a `meta` implementation so that Helion can trace it without actually running the rendezvous. Pull Request resolved: pytorch#167779 Approved by: https://github.com/yf225
Stack from ghstack (oldest at bottom):
To support use case in pytorch/helion#1122, i.e.
Helion uses fake tensor to trace a program, thus we cannot use the following code in a Helion function:
The reason is that when
tensoris fake, the returnedhdlis None, thus any subsequent call on it will fail.This PR wraps the above functionality as an op:
so that things like
hdlis not exposed to Helion. The op also provides ametaimplementation so that Helion can trace it without actually running the rendezvous.cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @msaroufim @dcci