Virtualize `<type>Storage` classes #66970

kurtamohler · 2021-10-20T21:48:43Z

cc @ezyang @bhosmer @smessmer @ljk53 @bdhirsh

pytorch-probot · 2021-10-20T21:48:49Z

CI Flow Status

⚛️ CI Flow

Ruleset - Version: v1
Ruleset - File: https://github.com/kurtamohler/pytorch/blob/8857e4f2333b6ea13b4bf192a9530f175c51a1f4/.github/generated-ciflow-ruleset.json
PR ciflow labels: ciflow/default

Workflows	Labels (bold enabled)	Status
Triggered Workflows
linux-bionic-py3.6-clang9	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/noarch`, `ciflow/xla`	✅ triggered
linux-vulkan-bionic-py3.6-clang9	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/vulkan`	✅ triggered
linux-xenial-cuda11.3-py3.6-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/default`, `ciflow/linux`	✅ triggered
linux-xenial-py3-clang5-mobile-build	`ciflow/all`, `ciflow/default`, `ciflow/linux`, `ciflow/mobile`	✅ triggered
linux-xenial-py3-clang5-mobile-custom-build-static	`ciflow/all`, `ciflow/default`, `ciflow/linux`, `ciflow/mobile`	✅ triggered
linux-xenial-py3.6-clang7-asan	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/sanitizers`	✅ triggered
linux-xenial-py3.6-clang7-onnx	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/onnx`	✅ triggered
linux-xenial-py3.6-gcc5.4	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`	✅ triggered
linux-xenial-py3.6-gcc7	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`	✅ triggered
linux-xenial-py3.6-gcc7-bazel-test	`ciflow/all`, `ciflow/bazel`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`	✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single	`ciflow/all`, `ciflow/android`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`	✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit	`ciflow/all`, `ciflow/android`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`	✅ triggered
win-vs2019-cpu-py3	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/win`	✅ triggered
win-vs2019-cuda11.3-py3	`ciflow/all`, `ciflow/cuda`, `ciflow/default`, `ciflow/win`	✅ triggered
Skipped Workflows
caffe2-linux-xenial-py3.6-gcc5.4	`ciflow/all`, `ciflow/cpu`, `ciflow/linux`	🚫 skipped
docker-builds	`ciflow/all`	🚫 skipped
ios-12-5-1-arm64	`ciflow/all`, `ciflow/ios`, `ciflow/macos`	🚫 skipped
ios-12-5-1-arm64-coreml	`ciflow/all`, `ciflow/ios`, `ciflow/macos`	🚫 skipped
ios-12-5-1-arm64-custom-ops	`ciflow/all`, `ciflow/ios`, `ciflow/macos`	🚫 skipped
ios-12-5-1-arm64-full-jit	`ciflow/all`, `ciflow/ios`, `ciflow/macos`	🚫 skipped
ios-12-5-1-arm64-metal	`ciflow/all`, `ciflow/ios`, `ciflow/macos`	🚫 skipped
ios-12-5-1-x86-64	`ciflow/all`, `ciflow/ios`, `ciflow/macos`	🚫 skipped
ios-12-5-1-x86-64-coreml	`ciflow/all`, `ciflow/ios`, `ciflow/macos`	🚫 skipped
ios-12-5-1-x86-64-full-jit	`ciflow/all`, `ciflow/ios`, `ciflow/macos`	🚫 skipped
libtorch-linux-xenial-cuda10.2-py3.6-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`	🚫 skipped
libtorch-linux-xenial-cuda11.3-py3.6-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`	🚫 skipped
linux-bionic-cuda10.2-py3.9-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/slow`	🚫 skipped
macos-10-15-py3-arm64	`ciflow/all`, `ciflow/macos`	🚫 skipped
macos-10-15-py3-lite-interpreter-x86-64	`ciflow/all`, `ciflow/macos`	🚫 skipped
macos-11-py3-x86-64	`ciflow/all`, `ciflow/macos`	🚫 skipped
parallelnative-linux-xenial-py3.6-gcc5.4	`ciflow/all`, `ciflow/cpu`, `ciflow/linux`	🚫 skipped
periodic-libtorch-linux-xenial-cuda11.1-py3.6-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/scheduled`, `ciflow/slow`, `ciflow/slow-gradcheck`	🚫 skipped
periodic-linux-xenial-cuda11.1-py3.6-gcc7-debug	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
periodic-win-vs2019-cuda11.1-py3	`ciflow/all`, `ciflow/cuda`, `ciflow/scheduled`, `ciflow/win`	🚫 skipped

You can add a comment to the PR and tag @pytorchbot with the following commands:

# ciflow rerun, "ciflow/default" will always be added automatically
@pytorchbot ciflow rerun

# ciflow rerun with additional labels "-l <ciflow/label_name>", which is equivalent to adding these labels manually and trigger the rerun
@pytorchbot ciflow rerun -l ciflow/scheduled -l ciflow/slow

For more information, please take a look at the CI Flow Wiki.

facebook-github-bot · 2021-10-20T21:48:50Z

🔗 Helpful links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/66970
Need help or want to give feedback on the CI? Visit our office hours

💊 CI failures summary and remediations

As of commit 938229e (more details on the Dr. CI page):

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

kurtamohler · 2021-11-09T02:50:30Z

Note that TypedStorage and UntypedStorage are still not documented yet. I will post a PR soon to add that

kurtamohler · 2021-11-24T22:26:41Z

I've added documentation updates to this PR

facebook-github-bot · 2021-12-20T23:30:43Z

@ngimel has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

ngimel · 2021-12-21T04:51:34Z

A lot of internal tests are failing with

TypeError: _new_shared() missing 1 required positional argument: 'size'

seems like it's caused by

in allocate_shared_tensor
    storage = tensor.storage_type()._new_shared(size.numel())

There are also other failures where process doesn't exit with 0 exit code, but I haven't tracked what's causing it.

kurtamohler · 2021-12-21T19:24:34Z

@ngimel, the _new_shared error comes from the fact that TypedStorage._new_shared can currently only be used as an instance method, because it needs to access the dtype and device of the instantiated TypedStorage object. For the legacy storage types, <type>Storage._new_shared is a classmethod--the dtype and device are constant for a given legacy storage class.

Maybe we could consider making TypedStorage._new_shared work as both a class method and an instance method (although I'm not sure how to do that at the moment). If it's called as a class method, we could use the default dtype and device. However, I'm not sure if this is a good idea. I have a feeling that it would be better to avoid calling _new_shared as a classmethod entirely. For instance, consider this:

s = torch.IntStorage()

s._new_shared(size)             # This returns a correct TypedStorage with dtype=torch.int
type(s)._new_shared(size)       # This would return an incorrect TypedStorage with the default dtype=torch.float

Before this PR, the type(s)._new_shared(size) call would have returned the proper <type>Storage object, with matching dtype and device. But if we decided allow TypedStorage._new_shared to be called as a classmethod, we would have no way to know what the correct dtype and device were.

ezyang · 2022-03-04T03:49:56Z

Based on discussion in the linked issue, I regret to inform you that you will have to update the rpc pickler/unpickler logic as well. You can probably jump straight to the untyped storage format as there is no BC/FC need for the on-the-wire format.

kurtamohler · 2022-03-08T20:17:06Z

@ezyang, I'm not exactly sure how to reproduce the RPC error you posted.

I tried a few different things:

Send a tensor to a worker, the worker sends back the tensor's _TypedStorage
Send a tensor, send back _UntypedStorage
Send a tensor, send back a tensor
Send a _TypedStorage, send it back
Send an _UntypedStorage, send it back

I only get an error for the last one, sending an _UntypedStorage back and forth. The error I get for this case is different than the one you posted above--it happens during serialization, so it never even gets to the _rebuild_tensor call during the deserialization step, where your error happens. Furthermore, I get the same result whether I run these cases on this branch or on master.

Here are my scripts:

worker0.py (click to expand/collapse):

import torch
import torch.distributed.rpc as rpc 

rpc.init_rpc('worker0', rank=0, world_size=2)

def get_storage(a):
    return a.storage()

def get_untyped_storage(a):
    return a.storage()

def passthrough(*args):
    return args

ret = rpc.rpc_sync(
    'worker1',
    get_storage,
    args=(
        torch.arange(10),
    ))  

print(f'0: {ret}')

ret = rpc.rpc_sync(
    'worker1',
    get_untyped_storage,
    args=(
        torch.arange(10),
    ))  

print(f'1: {ret}')

ret = rpc.rpc_sync(
    'worker1',
    passthrough,
    args=(
        torch.arange(10),
    ))  

print(f'2: {ret}')

ret = rpc.rpc_sync(
    'worker1',
    passthrough,
    args=(
        torch.arange(10).storage(),
    ))

print(f'3: {ret}')

ret = rpc.rpc_sync(
    'worker1',
    passthrough,
    args=(
        torch._UntypedStorage(10),
    ))

print(f'4: {ret}')

rpc.shutdown()

worker1.py (click to expand/collapse):

import torch.distributed.rpc as rpc 


def passthrough(*args):
    return args

def get_storage(a):
    return a.storage()

def get_untyped_storage(a):
    return a.storage()

rpc.init_rpc('worker1', rank=1, world_size=2)

rpc.shutdown()

And here's what I see when I run them:

Output of worker0.py (click to expand/collapse):

0:  0
 1
 2
 3
 4
 5
 6
 7
 8
 9
[torch.storage._TypedStorage(dtype=torch.int64, device=cpu) of size 10]
1:  0
 1
 2
 3
 4
 5
 6
 7
 8
 9
[torch.storage._TypedStorage(dtype=torch.int64, device=cpu) of size 10]
2: (tensor([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),)
3: ( 0
 1
 2
 3
 4
 5
 6
 7
 8
 9
[torch.storage._TypedStorage(dtype=torch.int64, device=cpu) of size 10],)
Traceback (most recent call last):
  File "/work2/kurtamohler/development/pytorch-perf-test-scripts/rpc/worker0.py", line 51, in <module>
    ret = rpc.rpc_sync(
  File "/work2/kurtamohler/development/pytorch-storage-virtualization/torch/distributed/rpc/api.py", line 77, in wrapper
    return func(*args, **kwargs)
  File "/work2/kurtamohler/development/pytorch-storage-virtualization/torch/distributed/rpc/api.py", line 766, in rpc_sync
    fut = _invoke_rpc(to, func, RPCExecMode.SYNC, args, kwargs, timeout)
  File "/work2/kurtamohler/development/pytorch-storage-virtualization/torch/distributed/rpc/api.py", line 675, in _invoke_rpc
    (pickled_python_udf, tensors) = _default_pickler.serialize(
  File "/work2/kurtamohler/development/pytorch-storage-virtualization/torch/distributed/rpc/internal.py", line 133, in serialize
    p.dump(obj)
  File "/work2/kurtamohler/development/pytorch-storage-virtualization/torch/storage.py", line 79, in __reduce__
    torch.save(self, b, _use_new_zipfile_serialization=False)
  File "/work2/kurtamohler/development/pytorch-storage-virtualization/torch/serialization.py", line 382, in save
    _legacy_save(obj, opened_file, pickle_module, pickle_protocol)
  File "/work2/kurtamohler/development/pytorch-storage-virtualization/torch/serialization.py", line 516, in _legacy_save
    pickler.dump(obj)
  File "/work2/kurtamohler/development/pytorch-storage-virtualization/torch/serialization.py", line 429, in persistent_id
    storage_dtype = storage.dtype
AttributeError: 'torch._UntypedStorage' object has no attribute 'dtype'

Output of worker1.py (click to expand/collapse):

[W tensorpipe_agent.cpp:682] RPC agent for worker1 encountered error when reading incoming request from worker0: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259)

We should fix this error, but I assume the error you posted is more urgent since it's introduced by this PR. Can you help me reproduce it?

EDIT: Woops, there was a typo in my script for the tensor --> untyped storage case. After fixing that, I do get an error for that case, but it's also different from the one you posted and I get the same behavior on master:

Click to expand/collapse

Traceback (most recent call last):
  File "/work2/kurtamohler/development/pytorch-perf-test-scripts/rpc/worker0.py", line 27, in <module>
    ret = rpc.rpc_sync(
  File "/work2/kurtamohler/development/pytorch-storage-virtualization/torch/distributed/rpc/api.py", line 77, in wrapper
    return func(*args, **kwargs)
  File "/work2/kurtamohler/development/pytorch-storage-virtualization/torch/distributed/rpc/api.py", line 767, in rpc_sync
    return fut.wait()
RuntimeError: AttributeError: 'torch._UntypedStorage' object has no attribute 'dtype'

At:
  /work2/kurtamohler/development/pytorch-storage-virtualization/torch/serialization.py(429): persistent_id
  /work2/kurtamohler/development/pytorch-storage-virtualization/torch/serialization.py(516): _legacy_save
  /work2/kurtamohler/development/pytorch-storage-virtualization/torch/serialization.py(382): save
  /work2/kurtamohler/development/pytorch-storage-virtualization/torch/storage.py(79): __reduce__
  /work2/kurtamohler/development/pytorch-storage-virtualization/torch/distributed/rpc/internal.py(133): serialize
  /work2/kurtamohler/development/pytorch-storage-virtualization/torch/distributed/rpc/internal.py(186): serialize

ezyang · 2022-03-09T16:30:53Z

Sorry about sending you on a good chase, it looks like the internal test is defining yet another pickler.

ezyang · 2022-03-09T16:32:08Z

This is their pickler implementation:

import copyreg
            
import torch
import torch.multiprocessing.reductions as TorchMpReductions
from torch.distributed.rpc.internal import _InternalRPCPickler


class ShareMemoryRPCPickler(_InternalRPCPickler):
    def __init__(self) -> None:
        super().__init__()
        self._dispatch_table
        # pyre-fixme[4]: Attribute must be annotated.
        self._dispatch_table = copyreg.dispatch_table.copy()

        for t in torch._storage_classes:
            self._dispatch_table[t] = TorchMpReductions.reduce_storage

        for t in torch._tensor_classes:
            self._dispatch_table[t] = TorchMpReductions.reduce_tensor
        self._dispatch_table[torch.Tensor] = TorchMpReductions.reduce_tensor
        self._dispatch_table[
            torch.nn.parameter.Parameter
        ] = TorchMpReductions.reduce_tensor

and in their test they set it up as

from torch.distributed.rpc.api import _use_rpc_pickler

...


    with _use_rpc_pickler(ShareMemoryRPCPickler()):
        for idx, worker in enumerate(workers):
            # pyre-fixme[16]: Module `rpc` has no attribute `remote`.
            rref_to_remote_param[worker] = rpc.remote(
                worker,
                mock_worker_function,
                kwargs={
                    "module": module,
                    # !!Important!! Only set parameter in ONE worker.
                    "parameter_val": parameter_val if idx == 0 else None,
                },
            )

kurtamohler · 2022-03-10T00:17:00Z

@ezyang, thanks for the extra info. I'm still having trouble catching the goose though.

I have the following files:

utils.py

import copyreg
import torch
import torch.multiprocessing.reductions as TorchMpReductions
from torch.distributed.rpc.internal import _InternalRPCPickler

class ShareMemoryRPCPickler(_InternalRPCPickler):
    def __init__(self) -> None:
        super().__init__()
        self._dispatch_table
        # pyre-fixme[4]: Attribute must be annotated.
        self._dispatch_table = copyreg.dispatch_table.copy()

        for t in torch._storage_classes:
            self._dispatch_table[t] = TorchMpReductions.reduce_storage

        for t in torch._tensor_classes:
            self._dispatch_table[t] = TorchMpReductions.reduce_tensor
        self._dispatch_table[torch.Tensor] = TorchMpReductions.reduce_tensor
        self._dispatch_table[
            torch.nn.parameter.Parameter
        ] = TorchMpReductions.reduce_tensor

worker0.py

import torch
import torch.distributed.rpc as rpc
from torch.distributed.rpc.api import _use_rpc_pickler
from utils import ShareMemoryRPCPickler, passthrough

with _use_rpc_pickler(ShareMemoryRPCPickler()):
    rpc.init_rpc(
        'worker0',
        rank=0,
        world_size=2,
    )
    rref = rpc.remote(
        'worker1',
        passthrough,
        args=(
            torch.arange(10),
        )
    )
    print(rref.to_here())

worker1.py (NOTE: whether I use `ShareMemoryRPCPickler` or not in this script, I get the same behavior, so I guess it probably only needs to be used for the `rpc.remote()` call in `worker0.py`)

import torch
import torch.distributed.rpc as rpc
from torch.distributed.rpc.api import _use_rpc_pickler
from utils import ShareMemoryRPCPickler, passthrough

# Not sure if this `_use_rpc_pickler` call is actually doing anything
# in this script
with _use_rpc_pickler(ShareMemoryRPCPickler()):
    rpc.init_rpc('worker1', rank=1, world_size=2)
    rpc.shutdown()

When I run worker0.py and worker1.py at the same time, I still don't get the same error as the one you posted. I get this instead:

Click to expand/collapse

Traceback (most recent call last):
  File "/home/kurtamohler/.conda/envs/pytorch-storage-virt/lib/python3.9/multiprocessing/resource_sharer.py", line 138, in _serve
    with self._listener.accept() as conn:
  File "/home/kurtamohler/.conda/envs/pytorch-storage-virt/lib/python3.9/multiprocessing/connection.py", line 470, in accept
    deliver_challenge(c, self._authkey)
  File "/home/kurtamohler/.conda/envs/pytorch-storage-virt/lib/python3.9/multiprocessing/connection.py", line 750, in deliver_challenge
    raise AuthenticationError('digest received was wrong')
multiprocessing.context.AuthenticationError: digest received was wrong
[E thread_pool.cpp:113] Exception in thread pool task: Error on Node 1: AuthenticationError: digest sent was rejected

At:
  /home/kurtamohler/.conda/envs/pytorch-storage-virt/lib/python3.9/multiprocessing/connection.py(764): answer_challenge
  /home/kurtamohler/.conda/envs/pytorch-storage-virt/lib/python3.9/multiprocessing/connection.py(513): Client
  /home/kurtamohler/.conda/envs/pytorch-storage-virt/lib/python3.9/multiprocessing/resource_sharer.py(86): get_connection
  /home/kurtamohler/.conda/envs/pytorch-storage-virt/lib/python3.9/multiprocessing/resource_sharer.py(57): detach
  /work2/kurtamohler/development/pytorch-storage-virtualization/torch/multiprocessing/reductions.py(295): rebuild_storage_fd
  /work2/kurtamohler/development/pytorch-storage-virtualization/torch/distributed/rpc/internal.py(169): deserialize
  /work2/kurtamohler/development/pytorch-storage-virtualization/torch/distributed/rpc/internal.py(190): deserialize

Exception raised from handleException at /work2/kurtamohler/development/pytorch-storage-virtualization/torch/csrc/distributed/rpc/rref_context.cpp:113 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x55 (0x7f6e3bb95125 in /work2/kurtamohler/development/pytorch-storage-virtualization/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xdb (0x7f6e3bb74979 in /work2/kurtamohler/development/pytorch-storage-virtualization/torch/lib/libc10.so)
frame #2: torch::distributed::rpc::RRefContext::handleException(c10::ivalue::Future const&) + 0xd9 (0x7f6e480a89c9 in /work2/kurtamohler/development/pytorch-storage-virtualization/torch/lib/libtorch_cpu.so)
frame #3: torch::distributed::rpc::RRef::handleError(torch::distributed::rpc::RPCErrorType, c10::ivalue::Future const&) + 0xb4 (0x7f6e480b3764 in /work2/kurtamohler/development/pytorch-storage-virtualization/torch/lib/libtorch_cpu.so)
frame #4: torch::distributed::rpc::callback::confirmPendingUser(c10::ivalue::Future const&, torch::distributed::rpc::GloballyUniqueId const&) + 0x10f (0x7f6e480abcdf in /work2/kurtamohler/development/pytorch-storage-virtualization/torch/lib/libtorch_cpu.so)
frame #5: <unknown function> + 0x977ffa (0x7f6e4c03fffa in /work2/kurtamohler/development/pytorch-storage-virtualization/torch/lib/libtorch_python.so)
frame #6: void c10::ivalue::Future::invokeCallback<std::function<void (c10::ivalue::Future&)> >(std::function<void (c10::ivalue::Future&)>) + 0x192 (0x7f6e4bd03a92 in /work2/kurtamohler/development/pytorch-storage-virtualization/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0xf506b3 (0x7f6e4586c6b3 in /work2/kurtamohler/development/pytorch-storage-virtualization/torch/lib/libtorch_cpu.so)
frame #8: <unknown function> + 0x37a7728 (0x7f6e480c3728 in /work2/kurtamohler/development/pytorch-storage-virtualization/torch/lib/libtorch_cpu.so)
frame #9: c10::ThreadPool::main_loop(unsigned long) + 0x285 (0x7f6e3bb86215 in /work2/kurtamohler/development/pytorch-storage-virtualization/torch/lib/libc10.so)
frame #10: <unknown function> + 0xcc9d4 (0x7f6e57c1b9d4 in /home/kurtamohler/.conda/envs/pytorch-storage-virt/lib/libstdc++.so.6)
frame #11: <unknown function> + 0x76db (0x7f6e777e56db in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #12: clone + 0x3f (0x7f6e76b6161f in /lib/x86_64-linux-gnu/libc.so.6)

I get this same error whether I try to send a tensor, typed storage, or untyped storage. Also, I get the same error on master. I am able to send a native Python serializable type, like a list, without error.

I also tried setting the world size to 1 and send to worker0 instead, and only run the worker0.py script. I get this error if I try to send either a tensor or typed storage:

click to expand/collapse

On WorkerInfo(id=0, name=worker0):
AttributeError("type object 'torch.storage._TypedStorage' has no attribute '_new_shared_fd' Default RPC pickler does not serialize\n            function code. Ensure that UDFs are defined on both caller and\n            callee modules.")
Traceback (most recent call last):
  File "/work2/kurtamohler/development/pytorch-storage-virtualization/torch/distributed/rpc/internal.py", line 203, in _run_function
    raise python_udf
AttributeError: type object 'torch.storage._TypedStorage' has no attribute '_new_shared_fd' Default RPC pickler does not serialize
            function code. Ensure that UDFs are defined on both caller and
            callee modules.

Traceback (most recent call last):
  File "/work2/kurtamohler/development/pytorch-perf-test-scripts/rpc/worker0.py", line 19, in <module>
    print(rref.to_here())
  File "/work2/kurtamohler/development/pytorch-storage-virtualization/torch/distributed/rpc/internal.py", line 218, in _handle_exception
    raise result.exception_type(result.msg.encode("utf-8").decode("unicode_escape"))
AttributeError: On WorkerInfo(id=0, name=worker0):
AttributeError("type object 'torch.storage._TypedStorage' has no attribute '_new_shared_fd' Default RPC pickler does not serialize
            function code. Ensure that UDFs are defined on both caller and
            callee modules.")
Traceback (most recent call last):
  File "/work2/kurtamohler/development/pytorch-storage-virtualization/torch/distributed/rpc/internal.py", line 203, in _run_function
    raise python_udf
AttributeError: type object 'torch.storage._TypedStorage' has no attribute '_new_shared_fd' Default RPC pickler does not serialize
            function code. Ensure that UDFs are defined on both caller and
            callee modules.

But it works if I send an untyped storage. And again, the behavior is basically the same on master.

It seems like these are probably genuine issues, but again, they don't seem to be introduced by this PR, and I still haven't been able to reproduce the one you posted.

Can you see anything obvious that I'm doing wrong? Or any extra details that might help?

ezyang · 2022-03-10T03:34:15Z

digest received was wrong sounds like there's a problem in the overall multiprocessing setup (but I'm not terribly familiar with this API so I'm not sure how to fix it)

ezyang · 2022-03-10T04:07:08Z

It looks like the missing ingredient was:

torch.multiprocessing.set_sharing_strategy("file_system")

without it, indeed the test fails. Here is now a complete test, and I also fixed your auth problem (you have to spawn the processes from the same parent process or auth will fail)

import torch
import torch.distributed.rpc as rpc
from torch.distributed.rpc.api import _use_rpc_pickler
import torch.multiprocessing as mp
import copyreg
import torch
import torch.multiprocessing.reductions as TorchMpReductions
from torch.distributed.rpc.internal import _InternalRPCPickler
from torch.nn import Linear
from torch import multiprocessing, nn


multiprocessing.set_sharing_strategy("file_system")

class ShareMemoryRPCPickler(_InternalRPCPickler):
    def __init__(self) -> None:
        super().__init__()
        self._dispatch_table
        # pyre-fixme[4]: Attribute must be annotated.
        self._dispatch_table = copyreg.dispatch_table.copy()

        for t in torch._storage_classes:
            self._dispatch_table[t] = TorchMpReductions.reduce_storage

        for t in torch._tensor_classes:
            self._dispatch_table[t] = TorchMpReductions.reduce_tensor
        self._dispatch_table[torch.Tensor] = TorchMpReductions.reduce_tensor
        self._dispatch_table[
            torch.nn.parameter.Parameter
        ] = TorchMpReductions.reduce_tensor

def worker_loop(a):
    rpc.init_rpc('worker1', rank=1, world_size=2)
    rpc.shutdown()

def _flatten_parameter(parameters):
    r"""
    Flatten the module parameters into a 1D tensor 
    """
    return torch.cat([p.reshape(-1) for p in parameters])

def worker_fn(m):
    return _flatten_parameter(m.parameters())

if __name__ == '__main__':
    r = mp.spawn(
        worker_loop,
        join=False
        )

    try:
        with _use_rpc_pickler(ShareMemoryRPCPickler()):
            rpc.init_rpc(
                'worker0',
                rank=0,
                world_size=2,
            )
            m = Linear(1, 2)
            m.share_memory()
            rref = rpc.remote(
                'worker1',
                worker_fn,
                args=(
                    m,
                )
            )
            print(rref.to_here())
    finally:
        rpc.shutdown()
        r.join()

run as

MASTER_ADDR=localhost MASTER_PORT=29500 python worker0.py

I get this on master

(/home/ezyang/local/pytorch-tmp-env) [ezyang@devvm066.ash0 ~/local/labs] MASTER_ADDR=localhost MASTER_PORT=29500 python worker0.py  
tensor([-0.6383,  0.2935,  0.6883, -0.9899], requires_grad=True)
Exception ignored in: <function StorageWeakRef.__del__ at 0x7f15538914c0>
Traceback (most recent call last):
  File "/data/users/ezyang/pytorch-tmp/torch/multiprocessing/reductions.py", line 36, in __del__
  File "/data/users/ezyang/pytorch-tmp/torch/storage.py", line 520, in _free_weak_ref
AttributeError: 'NoneType' object has no attribute '_free_weak_ref'
Exception ignored in: <function StorageWeakRef.__del__ at 0x7f15538914c0>
Traceback (most recent call last):
  File "/data/users/ezyang/pytorch-tmp/torch/multiprocessing/reductions.py", line 36, in __del__
  File "/data/users/ezyang/pytorch-tmp/torch/storage.py", line 520, in _free_weak_ref
AttributeError: 'NoneType' object has no attribute '_free_weak_ref'
(/home/ezyang/local/pytorch-tmp-env) [ezyang@devvm066.ash0 ~/local/labs] echo $?
0

and I'm guessing it fails on this PR

ezyang · 2022-03-10T04:15:14Z

Confirmed it fails on the PR and not on master.

kurtamohler · 2022-03-10T04:26:36Z

Awesome, thanks @ezyang! I'll get started fixing it

ezyang · 2022-03-10T04:46:41Z

It's possible I need to patch the FB, code, it just wasn't obvious to me who was in the wrong. It's probably the RPC picklers fault though haha

kurtamohler · 2022-03-11T22:14:57Z

There's definitely a problem with the multiprocessing pickler when the sharing strategy is "file_system". When a _TypedStorage is serialized, the deserialization step is always creating an _UntypedStorage. So when that's passed to the rebuild_tensor function, we get an AttributeError because that function expects a _TypedStorage and it's trying access the dtype attribute. I'll try to fix it today or this weekend, but if I can't get it working by then, just FYI, I'm on PTO next week

…66228

ezyang · 2022-03-12T00:51:14Z

This has been in the works for a while, what's a little more time :o)

kurtamohler · 2022-03-12T00:52:03Z

Ok I think I've fixed the issue. At least the AttributeError for storage.dtype within rebuild_tensor doesn't happen anymore.

The other error (AttributeError: 'NoneType' object has no attribute '_free_weak_ref'), which was also showing up in master, is still happening. I don't 100% understand what's going on with that. It looks like _TypedStorage._free_weak_ref() is getting called after the torch module is deleted at the end of the script, and I'm not sure why

ezyang · 2022-03-12T02:06:50Z

I filed a tracking issue for it at. #74016 We don't have to fix it here.

facebook-github-bot · 2022-03-12T02:07:16Z

@ezyang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

…66228

facebook-github-bot · 2022-03-22T01:36:05Z

@ezyang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

ezyang · 2022-03-22T23:24:17Z

Bruh internal tests are passing. Let's GOOOO

Summary: Fixes #66228 cc ezyang bhosmer smessmer ljk53 bdhirsh Pull Request resolved: #66970 Reviewed By: bdhirsh Differential Revision: D33245612 Pulled By: ezyang fbshipit-source-id: 4c61c2cb029e2b94b0e68927c377d3e1c358dd7c

github-actions · 2022-03-22T23:46:04Z

Hey @kurtamohler.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

Summary: Fixes #66228 cc ezyang bhosmer smessmer ljk53 bdhirsh Pull Request resolved: #66970 Reviewed By: bdhirsh Differential Revision: D33245612 Pulled By: ezyang fbshipit-source-id: 4c61c2cb029e2b94b0e68927c377d3e1c358dd7c (cherry picked from commit d29fcdf)

pytorch-probot bot added the ciflow/default label Oct 20, 2021

facebook-github-bot added the cla signed label Oct 20, 2021

kurtamohler added the module: internals Related to internal abstractions in c10 and ATen label Oct 20, 2021

pytorchbot added the open source label Oct 20, 2021

kurtamohler force-pushed the storage-virtualize-66228 branch from 3cedc1b to abc4253 Compare October 21, 2021 23:14

kurtamohler mentioned this pull request Oct 26, 2021

Throw error when saving storages that view same data with different type #66949

Closed

kurtamohler force-pushed the storage-virtualize-66228 branch 3 times, most recently from b902133 to 160718f Compare November 2, 2021 21:09

kurtamohler force-pushed the storage-virtualize-66228 branch 6 times, most recently from df89096 to 5bcae84 Compare November 8, 2021 20:03

kurtamohler marked this pull request as ready for review November 9, 2021 02:49

kurtamohler requested a review from ngimel November 9, 2021 02:49

zou3519 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Nov 9, 2021

kurtamohler force-pushed the storage-virtualize-66228 branch from 5bcae84 to d17dcf8 Compare November 24, 2021 22:26

kurtamohler force-pushed the storage-virtualize-66228 branch from d17dcf8 to e7c4cbf Compare November 24, 2021 22:33

Virtualize <type>Storage classes

8857e4f

kurtamohler force-pushed the storage-virtualize-66228 branch from e7c4cbf to 8857e4f Compare November 25, 2021 18:34

ngimel approved these changes Dec 20, 2021

View reviewed changes

ezyang mentioned this pull request Mar 10, 2022

AttributeError: 'NoneType' object has no attribute '_free_weak_ref' #74016

Closed

kurtamohler added 2 commits March 11, 2022 18:40

Fix "file_system" multiprocessing pickler for typed storages

e3bbd1d

Merge remote-tracking branch 'origin/master' into storage-virtualize-…

0bf19b5

…66228

kurtamohler added 2 commits March 21, 2022 14:22

Fix mypy error

0245de0

Merge remote-tracking branch 'origin/master' into storage-virtualize-…

938229e

…66228

suo removed the ciflow/default label Mar 22, 2022

pytorchmergebot closed this in 79ddc72 Mar 22, 2022

kurtamohler mentioned this pull request May 31, 2022

Fix _free_weak_ref error #78575

Closed

WBobby mentioned this pull request Aug 17, 2022

Add ROCm5.2.3/AMDGPU support for PyTorch WBobby/pytorch#2

Closed

Virtualize <type>Storage classes #66970

Virtualize <type>Storage classes #66970

Uh oh!

Conversation

kurtamohler commented Oct 20, 2021 • edited by pytorch-probot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-probot bot commented Oct 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚛️ CI Flow

Uh oh!

facebook-github-bot commented Oct 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful links

💊 CI failures summary and remediations

Uh oh!

kurtamohler commented Nov 9, 2021

Uh oh!

kurtamohler commented Nov 24, 2021

Uh oh!

facebook-github-bot commented Dec 20, 2021

Uh oh!

ngimel commented Dec 21, 2021

Uh oh!

kurtamohler commented Dec 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ezyang commented Mar 4, 2022

Uh oh!

kurtamohler commented Mar 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ezyang commented Mar 9, 2022

Uh oh!

ezyang commented Mar 9, 2022

Uh oh!

kurtamohler commented Mar 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ezyang commented Mar 10, 2022

Uh oh!

ezyang commented Mar 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ezyang commented Mar 10, 2022

Uh oh!

kurtamohler commented Mar 10, 2022

Uh oh!

ezyang commented Mar 10, 2022

Uh oh!

kurtamohler commented Mar 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ezyang commented Mar 12, 2022

Uh oh!

kurtamohler commented Mar 12, 2022

Uh oh!

ezyang commented Mar 12, 2022

Uh oh!

facebook-github-bot commented Mar 12, 2022

Uh oh!

facebook-github-bot commented Mar 22, 2022

Uh oh!

ezyang commented Mar 22, 2022

Uh oh!

github-actions bot commented Mar 22, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Virtualize `<type>Storage` classes #66970

Virtualize `<type>Storage` classes #66970

kurtamohler commented Oct 20, 2021 •

edited by pytorch-probot bot

Loading

pytorch-probot bot commented Oct 20, 2021 •

edited

Loading

facebook-github-bot commented Oct 20, 2021 •

edited

Loading

kurtamohler commented Dec 21, 2021 •

edited

Loading

kurtamohler commented Mar 8, 2022 •

edited

Loading

kurtamohler commented Mar 10, 2022 •

edited

Loading

ezyang commented Mar 10, 2022 •

edited

Loading

kurtamohler commented Mar 11, 2022 •

edited

Loading