-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Virtualize <type>Storage classes
#66970
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
CI Flow Status⚛️ CI FlowRuleset - Version:
You can add a comment to the PR and tag @pytorchbot with the following commands: # ciflow rerun, "ciflow/default" will always be added automatically
@pytorchbot ciflow rerun
# ciflow rerun with additional labels "-l <ciflow/label_name>", which is equivalent to adding these labels manually and trigger the rerun
@pytorchbot ciflow rerun -l ciflow/scheduled -l ciflow/slowFor more information, please take a look at the CI Flow Wiki. |
🔗 Helpful links
💊 CI failures summary and remediationsAs of commit 938229e (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Please report bugs/suggestions to the (internal) Dr. CI Users group. |
3cedc1b to
abc4253
Compare
b902133 to
160718f
Compare
df89096 to
5bcae84
Compare
|
Note that TypedStorage and UntypedStorage are still not documented yet. I will post a PR soon to add that |
5bcae84 to
d17dcf8
Compare
|
I've added documentation updates to this PR |
d17dcf8 to
e7c4cbf
Compare
e7c4cbf to
8857e4f
Compare
|
@ngimel has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
|
A lot of internal tests are failing with seems like it's caused by There are also other failures where process doesn't exit with 0 exit code, but I haven't tracked what's causing it. |
|
@ngimel, the Maybe we could consider making s = torch.IntStorage()
s._new_shared(size) # This returns a correct TypedStorage with dtype=torch.int
type(s)._new_shared(size) # This would return an incorrect TypedStorage with the default dtype=torch.floatBefore this PR, the |
|
Based on discussion in the linked issue, I regret to inform you that you will have to update the rpc pickler/unpickler logic as well. You can probably jump straight to the untyped storage format as there is no BC/FC need for the on-the-wire format. |
|
@ezyang, I'm not exactly sure how to reproduce the RPC error you posted. I tried a few different things:
I only get an error for the last one, sending an Here are my scripts: worker0.py (click to expand/collapse):import torch
import torch.distributed.rpc as rpc
rpc.init_rpc('worker0', rank=0, world_size=2)
def get_storage(a):
return a.storage()
def get_untyped_storage(a):
return a.storage()
def passthrough(*args):
return args
ret = rpc.rpc_sync(
'worker1',
get_storage,
args=(
torch.arange(10),
))
print(f'0: {ret}')
ret = rpc.rpc_sync(
'worker1',
get_untyped_storage,
args=(
torch.arange(10),
))
print(f'1: {ret}')
ret = rpc.rpc_sync(
'worker1',
passthrough,
args=(
torch.arange(10),
))
print(f'2: {ret}')
ret = rpc.rpc_sync(
'worker1',
passthrough,
args=(
torch.arange(10).storage(),
))
print(f'3: {ret}')
ret = rpc.rpc_sync(
'worker1',
passthrough,
args=(
torch._UntypedStorage(10),
))
print(f'4: {ret}')
rpc.shutdown()worker1.py (click to expand/collapse):import torch.distributed.rpc as rpc
def passthrough(*args):
return args
def get_storage(a):
return a.storage()
def get_untyped_storage(a):
return a.storage()
rpc.init_rpc('worker1', rank=1, world_size=2)
rpc.shutdown()And here's what I see when I run them: Output of worker0.py (click to expand/collapse):Output of worker1.py (click to expand/collapse):We should fix this error, but I assume the error you posted is more urgent since it's introduced by this PR. Can you help me reproduce it? EDIT: Woops, there was a typo in my script for the Click to expand/collapse |
|
Sorry about sending you on a good chase, it looks like the internal test is defining yet another pickler. |
|
This is their pickler implementation: and in their test they set it up as |
|
@ezyang, thanks for the extra info. I'm still having trouble catching the goose though. I have the following files: utils.pyimport copyreg
import torch
import torch.multiprocessing.reductions as TorchMpReductions
from torch.distributed.rpc.internal import _InternalRPCPickler
class ShareMemoryRPCPickler(_InternalRPCPickler):
def __init__(self) -> None:
super().__init__()
self._dispatch_table
# pyre-fixme[4]: Attribute must be annotated.
self._dispatch_table = copyreg.dispatch_table.copy()
for t in torch._storage_classes:
self._dispatch_table[t] = TorchMpReductions.reduce_storage
for t in torch._tensor_classes:
self._dispatch_table[t] = TorchMpReductions.reduce_tensor
self._dispatch_table[torch.Tensor] = TorchMpReductions.reduce_tensor
self._dispatch_table[
torch.nn.parameter.Parameter
] = TorchMpReductions.reduce_tensorworker0.pyimport torch
import torch.distributed.rpc as rpc
from torch.distributed.rpc.api import _use_rpc_pickler
from utils import ShareMemoryRPCPickler, passthrough
with _use_rpc_pickler(ShareMemoryRPCPickler()):
rpc.init_rpc(
'worker0',
rank=0,
world_size=2,
)
rref = rpc.remote(
'worker1',
passthrough,
args=(
torch.arange(10),
)
)
print(rref.to_here())worker1.py (NOTE: whether I use `ShareMemoryRPCPickler` or not in this script, I get the same behavior, so I guess it probably only needs to be used for the `rpc.remote()` call in `worker0.py`)import torch
import torch.distributed.rpc as rpc
from torch.distributed.rpc.api import _use_rpc_pickler
from utils import ShareMemoryRPCPickler, passthrough
# Not sure if this `_use_rpc_pickler` call is actually doing anything
# in this script
with _use_rpc_pickler(ShareMemoryRPCPickler()):
rpc.init_rpc('worker1', rank=1, world_size=2)
rpc.shutdown()When I run Click to expand/collapseI get this same error whether I try to send a tensor, typed storage, or untyped storage. Also, I get the same error on master. I am able to send a native Python serializable type, like a I also tried setting the world size to 1 and send to click to expand/collapseBut it works if I send an untyped storage. And again, the behavior is basically the same on master. It seems like these are probably genuine issues, but again, they don't seem to be introduced by this PR, and I still haven't been able to reproduce the one you posted. Can you see anything obvious that I'm doing wrong? Or any extra details that might help? |
|
digest received was wrong sounds like there's a problem in the overall multiprocessing setup (but I'm not terribly familiar with this API so I'm not sure how to fix it) |
|
It looks like the missing ingredient was: without it, indeed the test fails. Here is now a complete test, and I also fixed your auth problem (you have to spawn the processes from the same parent process or auth will fail) run as I get this on master and I'm guessing it fails on this PR |
|
Confirmed it fails on the PR and not on master. |
|
Awesome, thanks @ezyang! I'll get started fixing it |
|
It's possible I need to patch the FB, code, it just wasn't obvious to me who was in the wrong. It's probably the RPC picklers fault though haha |
|
There's definitely a problem with the multiprocessing pickler when the sharing strategy is "file_system". When a |
|
This has been in the works for a while, what's a little more time :o) |
|
Ok I think I've fixed the issue. At least the The other error ( |
|
I filed a tracking issue for it at. #74016 We don't have to fix it here. |
|
@ezyang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
|
@ezyang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
|
Bruh internal tests are passing. Let's GOOOO |
|
Hey @kurtamohler. |
Fixes #66228
cc @ezyang @bhosmer @smessmer @ljk53 @bdhirsh