Skip to content

Conversation

@kurtamohler
Copy link
Collaborator

@kurtamohler kurtamohler commented Oct 20, 2021

@pytorch-probot
Copy link

pytorch-probot bot commented Oct 20, 2021

CI Flow Status

⚛️ CI Flow

Ruleset - Version: v1
Ruleset - File: https://github.com/kurtamohler/pytorch/blob/8857e4f2333b6ea13b4bf192a9530f175c51a1f4/.github/generated-ciflow-ruleset.json
PR ciflow labels: ciflow/default

Workflows Labels (bold enabled) Status
Triggered Workflows
linux-bionic-py3.6-clang9 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/noarch, ciflow/xla ✅ triggered
linux-vulkan-bionic-py3.6-clang9 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/vulkan ✅ triggered
linux-xenial-cuda11.3-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/default, ciflow/linux ✅ triggered
linux-xenial-py3-clang5-mobile-build ciflow/all, ciflow/default, ciflow/linux, ciflow/mobile ✅ triggered
linux-xenial-py3-clang5-mobile-custom-build-static ciflow/all, ciflow/default, ciflow/linux, ciflow/mobile ✅ triggered
linux-xenial-py3.6-clang7-asan ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/sanitizers ✅ triggered
linux-xenial-py3.6-clang7-onnx ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/onnx ✅ triggered
linux-xenial-py3.6-gcc5.4 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux ✅ triggered
linux-xenial-py3.6-gcc7 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux ✅ triggered
linux-xenial-py3.6-gcc7-bazel-test ciflow/all, ciflow/bazel, ciflow/cpu, ciflow/default, ciflow/linux ✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single ciflow/all, ciflow/android, ciflow/cpu, ciflow/default, ciflow/linux ✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit ciflow/all, ciflow/android, ciflow/cpu, ciflow/default, ciflow/linux ✅ triggered
win-vs2019-cpu-py3 ciflow/all, ciflow/cpu, ciflow/default, ciflow/win ✅ triggered
win-vs2019-cuda11.3-py3 ciflow/all, ciflow/cuda, ciflow/default, ciflow/win ✅ triggered
Skipped Workflows
caffe2-linux-xenial-py3.6-gcc5.4 ciflow/all, ciflow/cpu, ciflow/linux 🚫 skipped
docker-builds ciflow/all 🚫 skipped
ios-12-5-1-arm64 ciflow/all, ciflow/ios, ciflow/macos 🚫 skipped
ios-12-5-1-arm64-coreml ciflow/all, ciflow/ios, ciflow/macos 🚫 skipped
ios-12-5-1-arm64-custom-ops ciflow/all, ciflow/ios, ciflow/macos 🚫 skipped
ios-12-5-1-arm64-full-jit ciflow/all, ciflow/ios, ciflow/macos 🚫 skipped
ios-12-5-1-arm64-metal ciflow/all, ciflow/ios, ciflow/macos 🚫 skipped
ios-12-5-1-x86-64 ciflow/all, ciflow/ios, ciflow/macos 🚫 skipped
ios-12-5-1-x86-64-coreml ciflow/all, ciflow/ios, ciflow/macos 🚫 skipped
ios-12-5-1-x86-64-full-jit ciflow/all, ciflow/ios, ciflow/macos 🚫 skipped
libtorch-linux-xenial-cuda10.2-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux 🚫 skipped
libtorch-linux-xenial-cuda11.3-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux 🚫 skipped
linux-bionic-cuda10.2-py3.9-gcc7 ciflow/all, ciflow/cuda, ciflow/linux, ciflow/slow 🚫 skipped
macos-10-15-py3-arm64 ciflow/all, ciflow/macos 🚫 skipped
macos-10-15-py3-lite-interpreter-x86-64 ciflow/all, ciflow/macos 🚫 skipped
macos-11-py3-x86-64 ciflow/all, ciflow/macos 🚫 skipped
parallelnative-linux-xenial-py3.6-gcc5.4 ciflow/all, ciflow/cpu, ciflow/linux 🚫 skipped
periodic-libtorch-linux-xenial-cuda11.1-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled, ciflow/slow, ciflow/slow-gradcheck 🚫 skipped
periodic-linux-xenial-cuda11.1-py3.6-gcc7-debug ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-win-vs2019-cuda11.1-py3 ciflow/all, ciflow/cuda, ciflow/scheduled, ciflow/win 🚫 skipped

You can add a comment to the PR and tag @pytorchbot with the following commands:
# ciflow rerun, "ciflow/default" will always be added automatically
@pytorchbot ciflow rerun

# ciflow rerun with additional labels "-l <ciflow/label_name>", which is equivalent to adding these labels manually and trigger the rerun
@pytorchbot ciflow rerun -l ciflow/scheduled -l ciflow/slow

For more information, please take a look at the CI Flow Wiki.

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Oct 20, 2021

🔗 Helpful links

💊 CI failures summary and remediations

As of commit 938229e (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@kurtamohler kurtamohler added the module: internals Related to internal abstractions in c10 and ATen label Oct 20, 2021
@kurtamohler kurtamohler force-pushed the storage-virtualize-66228 branch from 3cedc1b to abc4253 Compare October 21, 2021 23:14
@kurtamohler kurtamohler force-pushed the storage-virtualize-66228 branch 3 times, most recently from b902133 to 160718f Compare November 2, 2021 21:09
@kurtamohler kurtamohler force-pushed the storage-virtualize-66228 branch 6 times, most recently from df89096 to 5bcae84 Compare November 8, 2021 20:03
@kurtamohler kurtamohler marked this pull request as ready for review November 9, 2021 02:49
@kurtamohler kurtamohler requested a review from ngimel November 9, 2021 02:49
@kurtamohler
Copy link
Collaborator Author

Note that TypedStorage and UntypedStorage are still not documented yet. I will post a PR soon to add that

@zou3519 zou3519 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Nov 9, 2021
@kurtamohler kurtamohler force-pushed the storage-virtualize-66228 branch from 5bcae84 to d17dcf8 Compare November 24, 2021 22:26
@kurtamohler
Copy link
Collaborator Author

I've added documentation updates to this PR

@kurtamohler kurtamohler force-pushed the storage-virtualize-66228 branch from d17dcf8 to e7c4cbf Compare November 24, 2021 22:33
@kurtamohler kurtamohler force-pushed the storage-virtualize-66228 branch from e7c4cbf to 8857e4f Compare November 25, 2021 18:34
@facebook-github-bot
Copy link
Contributor

@ngimel has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@ngimel
Copy link
Collaborator

ngimel commented Dec 21, 2021

A lot of internal tests are failing with

TypeError: _new_shared() missing 1 required positional argument: 'size'

seems like it's caused by

in allocate_shared_tensor
    storage = tensor.storage_type()._new_shared(size.numel())

There are also other failures where process doesn't exit with 0 exit code, but I haven't tracked what's causing it.

@kurtamohler
Copy link
Collaborator Author

kurtamohler commented Dec 21, 2021

@ngimel, the _new_shared error comes from the fact that TypedStorage._new_shared can currently only be used as an instance method, because it needs to access the dtype and device of the instantiated TypedStorage object. For the legacy storage types, <type>Storage._new_shared is a classmethod--the dtype and device are constant for a given legacy storage class.

Maybe we could consider making TypedStorage._new_shared work as both a class method and an instance method (although I'm not sure how to do that at the moment). If it's called as a class method, we could use the default dtype and device. However, I'm not sure if this is a good idea. I have a feeling that it would be better to avoid calling _new_shared as a classmethod entirely. For instance, consider this:

s = torch.IntStorage()

s._new_shared(size)             # This returns a correct TypedStorage with dtype=torch.int
type(s)._new_shared(size)       # This would return an incorrect TypedStorage with the default dtype=torch.float

Before this PR, the type(s)._new_shared(size) call would have returned the proper <type>Storage object, with matching dtype and device. But if we decided allow TypedStorage._new_shared to be called as a classmethod, we would have no way to know what the correct dtype and device were.

@ezyang
Copy link
Contributor

ezyang commented Mar 4, 2022

Based on discussion in the linked issue, I regret to inform you that you will have to update the rpc pickler/unpickler logic as well. You can probably jump straight to the untyped storage format as there is no BC/FC need for the on-the-wire format.

@kurtamohler
Copy link
Collaborator Author

kurtamohler commented Mar 8, 2022

@ezyang, I'm not exactly sure how to reproduce the RPC error you posted.

I tried a few different things:

  • Send a tensor to a worker, the worker sends back the tensor's _TypedStorage
  • Send a tensor, send back _UntypedStorage
  • Send a tensor, send back a tensor
  • Send a _TypedStorage, send it back
  • Send an _UntypedStorage, send it back

I only get an error for the last one, sending an _UntypedStorage back and forth. The error I get for this case is different than the one you posted above--it happens during serialization, so it never even gets to the _rebuild_tensor call during the deserialization step, where your error happens. Furthermore, I get the same result whether I run these cases on this branch or on master.

Here are my scripts:

worker0.py (click to expand/collapse):
import torch
import torch.distributed.rpc as rpc 

rpc.init_rpc('worker0', rank=0, world_size=2)

def get_storage(a):
    return a.storage()

def get_untyped_storage(a):
    return a.storage()

def passthrough(*args):
    return args

ret = rpc.rpc_sync(
    'worker1',
    get_storage,
    args=(
        torch.arange(10),
    ))  

print(f'0: {ret}')

ret = rpc.rpc_sync(
    'worker1',
    get_untyped_storage,
    args=(
        torch.arange(10),
    ))  

print(f'1: {ret}')

ret = rpc.rpc_sync(
    'worker1',
    passthrough,
    args=(
        torch.arange(10),
    ))  

print(f'2: {ret}')

ret = rpc.rpc_sync(
    'worker1',
    passthrough,
    args=(
        torch.arange(10).storage(),
    ))

print(f'3: {ret}')

ret = rpc.rpc_sync(
    'worker1',
    passthrough,
    args=(
        torch._UntypedStorage(10),
    ))

print(f'4: {ret}')

rpc.shutdown()
worker1.py (click to expand/collapse):
import torch.distributed.rpc as rpc 


def passthrough(*args):
    return args

def get_storage(a):
    return a.storage()

def get_untyped_storage(a):
    return a.storage()

rpc.init_rpc('worker1', rank=1, world_size=2)

rpc.shutdown()

And here's what I see when I run them:

Output of worker0.py (click to expand/collapse):
0:  0
 1
 2
 3
 4
 5
 6
 7
 8
 9
[torch.storage._TypedStorage(dtype=torch.int64, device=cpu) of size 10]
1:  0
 1
 2
 3
 4
 5
 6
 7
 8
 9
[torch.storage._TypedStorage(dtype=torch.int64, device=cpu) of size 10]
2: (tensor([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),)
3: ( 0
 1
 2
 3
 4
 5
 6
 7
 8
 9
[torch.storage._TypedStorage(dtype=torch.int64, device=cpu) of size 10],)
Traceback (most recent call last):
  File "/work2/kurtamohler/development/pytorch-perf-test-scripts/rpc/worker0.py", line 51, in <module>
    ret = rpc.rpc_sync(
  File "/work2/kurtamohler/development/pytorch-storage-virtualization/torch/distributed/rpc/api.py", line 77, in wrapper
    return func(*args, **kwargs)
  File "/work2/kurtamohler/development/pytorch-storage-virtualization/torch/distributed/rpc/api.py", line 766, in rpc_sync
    fut = _invoke_rpc(to, func, RPCExecMode.SYNC, args, kwargs, timeout)
  File "/work2/kurtamohler/development/pytorch-storage-virtualization/torch/distributed/rpc/api.py", line 675, in _invoke_rpc
    (pickled_python_udf, tensors) = _default_pickler.serialize(
  File "/work2/kurtamohler/development/pytorch-storage-virtualization/torch/distributed/rpc/internal.py", line 133, in serialize
    p.dump(obj)
  File "/work2/kurtamohler/development/pytorch-storage-virtualization/torch/storage.py", line 79, in __reduce__
    torch.save(self, b, _use_new_zipfile_serialization=False)
  File "/work2/kurtamohler/development/pytorch-storage-virtualization/torch/serialization.py", line 382, in save
    _legacy_save(obj, opened_file, pickle_module, pickle_protocol)
  File "/work2/kurtamohler/development/pytorch-storage-virtualization/torch/serialization.py", line 516, in _legacy_save
    pickler.dump(obj)
  File "/work2/kurtamohler/development/pytorch-storage-virtualization/torch/serialization.py", line 429, in persistent_id
    storage_dtype = storage.dtype
AttributeError: 'torch._UntypedStorage' object has no attribute 'dtype'
Output of worker1.py (click to expand/collapse):
[W tensorpipe_agent.cpp:682] RPC agent for worker1 encountered error when reading incoming request from worker0: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259)

We should fix this error, but I assume the error you posted is more urgent since it's introduced by this PR. Can you help me reproduce it?

EDIT: Woops, there was a typo in my script for the tensor --> untyped storage case. After fixing that, I do get an error for that case, but it's also different from the one you posted and I get the same behavior on master:

Click to expand/collapse
Traceback (most recent call last):
  File "/work2/kurtamohler/development/pytorch-perf-test-scripts/rpc/worker0.py", line 27, in <module>
    ret = rpc.rpc_sync(
  File "/work2/kurtamohler/development/pytorch-storage-virtualization/torch/distributed/rpc/api.py", line 77, in wrapper
    return func(*args, **kwargs)
  File "/work2/kurtamohler/development/pytorch-storage-virtualization/torch/distributed/rpc/api.py", line 767, in rpc_sync
    return fut.wait()
RuntimeError: AttributeError: 'torch._UntypedStorage' object has no attribute 'dtype'

At:
  /work2/kurtamohler/development/pytorch-storage-virtualization/torch/serialization.py(429): persistent_id
  /work2/kurtamohler/development/pytorch-storage-virtualization/torch/serialization.py(516): _legacy_save
  /work2/kurtamohler/development/pytorch-storage-virtualization/torch/serialization.py(382): save
  /work2/kurtamohler/development/pytorch-storage-virtualization/torch/storage.py(79): __reduce__
  /work2/kurtamohler/development/pytorch-storage-virtualization/torch/distributed/rpc/internal.py(133): serialize
  /work2/kurtamohler/development/pytorch-storage-virtualization/torch/distributed/rpc/internal.py(186): serialize

@ezyang
Copy link
Contributor

ezyang commented Mar 9, 2022

Sorry about sending you on a good chase, it looks like the internal test is defining yet another pickler.

@ezyang
Copy link
Contributor

ezyang commented Mar 9, 2022

This is their pickler implementation:

import copyreg
            
import torch
import torch.multiprocessing.reductions as TorchMpReductions
from torch.distributed.rpc.internal import _InternalRPCPickler


class ShareMemoryRPCPickler(_InternalRPCPickler):
    def __init__(self) -> None:
        super().__init__()
        self._dispatch_table
        # pyre-fixme[4]: Attribute must be annotated.
        self._dispatch_table = copyreg.dispatch_table.copy()

        for t in torch._storage_classes:
            self._dispatch_table[t] = TorchMpReductions.reduce_storage

        for t in torch._tensor_classes:
            self._dispatch_table[t] = TorchMpReductions.reduce_tensor
        self._dispatch_table[torch.Tensor] = TorchMpReductions.reduce_tensor
        self._dispatch_table[
            torch.nn.parameter.Parameter
        ] = TorchMpReductions.reduce_tensor

and in their test they set it up as

from torch.distributed.rpc.api import _use_rpc_pickler

...


    with _use_rpc_pickler(ShareMemoryRPCPickler()):
        for idx, worker in enumerate(workers):
            # pyre-fixme[16]: Module `rpc` has no attribute `remote`.
            rref_to_remote_param[worker] = rpc.remote(
                worker,
                mock_worker_function,
                kwargs={
                    "module": module,
                    # !!Important!! Only set parameter in ONE worker.
                    "parameter_val": parameter_val if idx == 0 else None,
                },
            )

@kurtamohler
Copy link
Collaborator Author

kurtamohler commented Mar 10, 2022

@ezyang, thanks for the extra info. I'm still having trouble catching the goose though.

I have the following files:

utils.py
import copyreg
import torch
import torch.multiprocessing.reductions as TorchMpReductions
from torch.distributed.rpc.internal import _InternalRPCPickler

class ShareMemoryRPCPickler(_InternalRPCPickler):
    def __init__(self) -> None:
        super().__init__()
        self._dispatch_table
        # pyre-fixme[4]: Attribute must be annotated.
        self._dispatch_table = copyreg.dispatch_table.copy()

        for t in torch._storage_classes:
            self._dispatch_table[t] = TorchMpReductions.reduce_storage

        for t in torch._tensor_classes:
            self._dispatch_table[t] = TorchMpReductions.reduce_tensor
        self._dispatch_table[torch.Tensor] = TorchMpReductions.reduce_tensor
        self._dispatch_table[
            torch.nn.parameter.Parameter
        ] = TorchMpReductions.reduce_tensor
worker0.py
import torch
import torch.distributed.rpc as rpc
from torch.distributed.rpc.api import _use_rpc_pickler
from utils import ShareMemoryRPCPickler, passthrough

with _use_rpc_pickler(ShareMemoryRPCPickler()):
    rpc.init_rpc(
        'worker0',
        rank=0,
        world_size=2,
    )
    rref = rpc.remote(
        'worker1',
        passthrough,
        args=(
            torch.arange(10),
        )
    )
    print(rref.to_here())
worker1.py (NOTE: whether I use `ShareMemoryRPCPickler` or not in this script, I get the same behavior, so I guess it probably only needs to be used for the `rpc.remote()` call in `worker0.py`)
import torch
import torch.distributed.rpc as rpc
from torch.distributed.rpc.api import _use_rpc_pickler
from utils import ShareMemoryRPCPickler, passthrough

# Not sure if this `_use_rpc_pickler` call is actually doing anything
# in this script
with _use_rpc_pickler(ShareMemoryRPCPickler()):
    rpc.init_rpc('worker1', rank=1, world_size=2)
    rpc.shutdown()

When I run worker0.py and worker1.py at the same time, I still don't get the same error as the one you posted. I get this instead:

Click to expand/collapse
Traceback (most recent call last):
  File "/home/kurtamohler/.conda/envs/pytorch-storage-virt/lib/python3.9/multiprocessing/resource_sharer.py", line 138, in _serve
    with self._listener.accept() as conn:
  File "/home/kurtamohler/.conda/envs/pytorch-storage-virt/lib/python3.9/multiprocessing/connection.py", line 470, in accept
    deliver_challenge(c, self._authkey)
  File "/home/kurtamohler/.conda/envs/pytorch-storage-virt/lib/python3.9/multiprocessing/connection.py", line 750, in deliver_challenge
    raise AuthenticationError('digest received was wrong')
multiprocessing.context.AuthenticationError: digest received was wrong
[E thread_pool.cpp:113] Exception in thread pool task: Error on Node 1: AuthenticationError: digest sent was rejected

At:
  /home/kurtamohler/.conda/envs/pytorch-storage-virt/lib/python3.9/multiprocessing/connection.py(764): answer_challenge
  /home/kurtamohler/.conda/envs/pytorch-storage-virt/lib/python3.9/multiprocessing/connection.py(513): Client
  /home/kurtamohler/.conda/envs/pytorch-storage-virt/lib/python3.9/multiprocessing/resource_sharer.py(86): get_connection
  /home/kurtamohler/.conda/envs/pytorch-storage-virt/lib/python3.9/multiprocessing/resource_sharer.py(57): detach
  /work2/kurtamohler/development/pytorch-storage-virtualization/torch/multiprocessing/reductions.py(295): rebuild_storage_fd
  /work2/kurtamohler/development/pytorch-storage-virtualization/torch/distributed/rpc/internal.py(169): deserialize
  /work2/kurtamohler/development/pytorch-storage-virtualization/torch/distributed/rpc/internal.py(190): deserialize

Exception raised from handleException at /work2/kurtamohler/development/pytorch-storage-virtualization/torch/csrc/distributed/rpc/rref_context.cpp:113 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x55 (0x7f6e3bb95125 in /work2/kurtamohler/development/pytorch-storage-virtualization/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xdb (0x7f6e3bb74979 in /work2/kurtamohler/development/pytorch-storage-virtualization/torch/lib/libc10.so)
frame #2: torch::distributed::rpc::RRefContext::handleException(c10::ivalue::Future const&) + 0xd9 (0x7f6e480a89c9 in /work2/kurtamohler/development/pytorch-storage-virtualization/torch/lib/libtorch_cpu.so)
frame #3: torch::distributed::rpc::RRef::handleError(torch::distributed::rpc::RPCErrorType, c10::ivalue::Future const&) + 0xb4 (0x7f6e480b3764 in /work2/kurtamohler/development/pytorch-storage-virtualization/torch/lib/libtorch_cpu.so)
frame #4: torch::distributed::rpc::callback::confirmPendingUser(c10::ivalue::Future const&, torch::distributed::rpc::GloballyUniqueId const&) + 0x10f (0x7f6e480abcdf in /work2/kurtamohler/development/pytorch-storage-virtualization/torch/lib/libtorch_cpu.so)
frame #5: <unknown function> + 0x977ffa (0x7f6e4c03fffa in /work2/kurtamohler/development/pytorch-storage-virtualization/torch/lib/libtorch_python.so)
frame #6: void c10::ivalue::Future::invokeCallback<std::function<void (c10::ivalue::Future&)> >(std::function<void (c10::ivalue::Future&)>) + 0x192 (0x7f6e4bd03a92 in /work2/kurtamohler/development/pytorch-storage-virtualization/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0xf506b3 (0x7f6e4586c6b3 in /work2/kurtamohler/development/pytorch-storage-virtualization/torch/lib/libtorch_cpu.so)
frame #8: <unknown function> + 0x37a7728 (0x7f6e480c3728 in /work2/kurtamohler/development/pytorch-storage-virtualization/torch/lib/libtorch_cpu.so)
frame #9: c10::ThreadPool::main_loop(unsigned long) + 0x285 (0x7f6e3bb86215 in /work2/kurtamohler/development/pytorch-storage-virtualization/torch/lib/libc10.so)
frame #10: <unknown function> + 0xcc9d4 (0x7f6e57c1b9d4 in /home/kurtamohler/.conda/envs/pytorch-storage-virt/lib/libstdc++.so.6)
frame #11: <unknown function> + 0x76db (0x7f6e777e56db in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #12: clone + 0x3f (0x7f6e76b6161f in /lib/x86_64-linux-gnu/libc.so.6)

I get this same error whether I try to send a tensor, typed storage, or untyped storage. Also, I get the same error on master. I am able to send a native Python serializable type, like a list, without error.

I also tried setting the world size to 1 and send to worker0 instead, and only run the worker0.py script. I get this error if I try to send either a tensor or typed storage:

click to expand/collapse
On WorkerInfo(id=0, name=worker0):
AttributeError("type object 'torch.storage._TypedStorage' has no attribute '_new_shared_fd' Default RPC pickler does not serialize\n            function code. Ensure that UDFs are defined on both caller and\n            callee modules.")
Traceback (most recent call last):
  File "/work2/kurtamohler/development/pytorch-storage-virtualization/torch/distributed/rpc/internal.py", line 203, in _run_function
    raise python_udf
AttributeError: type object 'torch.storage._TypedStorage' has no attribute '_new_shared_fd' Default RPC pickler does not serialize
            function code. Ensure that UDFs are defined on both caller and
            callee modules.

Traceback (most recent call last):
  File "/work2/kurtamohler/development/pytorch-perf-test-scripts/rpc/worker0.py", line 19, in <module>
    print(rref.to_here())
  File "/work2/kurtamohler/development/pytorch-storage-virtualization/torch/distributed/rpc/internal.py", line 218, in _handle_exception
    raise result.exception_type(result.msg.encode("utf-8").decode("unicode_escape"))
AttributeError: On WorkerInfo(id=0, name=worker0):
AttributeError("type object 'torch.storage._TypedStorage' has no attribute '_new_shared_fd' Default RPC pickler does not serialize
            function code. Ensure that UDFs are defined on both caller and
            callee modules.")
Traceback (most recent call last):
  File "/work2/kurtamohler/development/pytorch-storage-virtualization/torch/distributed/rpc/internal.py", line 203, in _run_function
    raise python_udf
AttributeError: type object 'torch.storage._TypedStorage' has no attribute '_new_shared_fd' Default RPC pickler does not serialize
            function code. Ensure that UDFs are defined on both caller and
            callee modules.

But it works if I send an untyped storage. And again, the behavior is basically the same on master.

It seems like these are probably genuine issues, but again, they don't seem to be introduced by this PR, and I still haven't been able to reproduce the one you posted.

Can you see anything obvious that I'm doing wrong? Or any extra details that might help?

@ezyang
Copy link
Contributor

ezyang commented Mar 10, 2022

digest received was wrong sounds like there's a problem in the overall multiprocessing setup (but I'm not terribly familiar with this API so I'm not sure how to fix it)

@ezyang
Copy link
Contributor

ezyang commented Mar 10, 2022

It looks like the missing ingredient was:

torch.multiprocessing.set_sharing_strategy("file_system")

without it, indeed the test fails. Here is now a complete test, and I also fixed your auth problem (you have to spawn the processes from the same parent process or auth will fail)

import torch
import torch.distributed.rpc as rpc
from torch.distributed.rpc.api import _use_rpc_pickler
import torch.multiprocessing as mp
import copyreg
import torch
import torch.multiprocessing.reductions as TorchMpReductions
from torch.distributed.rpc.internal import _InternalRPCPickler
from torch.nn import Linear
from torch import multiprocessing, nn


multiprocessing.set_sharing_strategy("file_system")

class ShareMemoryRPCPickler(_InternalRPCPickler):
    def __init__(self) -> None:
        super().__init__()
        self._dispatch_table
        # pyre-fixme[4]: Attribute must be annotated.
        self._dispatch_table = copyreg.dispatch_table.copy()

        for t in torch._storage_classes:
            self._dispatch_table[t] = TorchMpReductions.reduce_storage

        for t in torch._tensor_classes:
            self._dispatch_table[t] = TorchMpReductions.reduce_tensor
        self._dispatch_table[torch.Tensor] = TorchMpReductions.reduce_tensor
        self._dispatch_table[
            torch.nn.parameter.Parameter
        ] = TorchMpReductions.reduce_tensor

def worker_loop(a):
    rpc.init_rpc('worker1', rank=1, world_size=2)
    rpc.shutdown()

def _flatten_parameter(parameters):
    r"""
    Flatten the module parameters into a 1D tensor 
    """
    return torch.cat([p.reshape(-1) for p in parameters])

def worker_fn(m):
    return _flatten_parameter(m.parameters())

if __name__ == '__main__':
    r = mp.spawn(
        worker_loop,
        join=False
        )

    try:
        with _use_rpc_pickler(ShareMemoryRPCPickler()):
            rpc.init_rpc(
                'worker0',
                rank=0,
                world_size=2,
            )
            m = Linear(1, 2)
            m.share_memory()
            rref = rpc.remote(
                'worker1',
                worker_fn,
                args=(
                    m,
                )
            )
            print(rref.to_here())
    finally:
        rpc.shutdown()
        r.join()

run as

MASTER_ADDR=localhost MASTER_PORT=29500 python worker0.py

I get this on master

(/home/ezyang/local/pytorch-tmp-env) [ezyang@devvm066.ash0 ~/local/labs] MASTER_ADDR=localhost MASTER_PORT=29500 python worker0.py  
tensor([-0.6383,  0.2935,  0.6883, -0.9899], requires_grad=True)
Exception ignored in: <function StorageWeakRef.__del__ at 0x7f15538914c0>
Traceback (most recent call last):
  File "/data/users/ezyang/pytorch-tmp/torch/multiprocessing/reductions.py", line 36, in __del__
  File "/data/users/ezyang/pytorch-tmp/torch/storage.py", line 520, in _free_weak_ref
AttributeError: 'NoneType' object has no attribute '_free_weak_ref'
Exception ignored in: <function StorageWeakRef.__del__ at 0x7f15538914c0>
Traceback (most recent call last):
  File "/data/users/ezyang/pytorch-tmp/torch/multiprocessing/reductions.py", line 36, in __del__
  File "/data/users/ezyang/pytorch-tmp/torch/storage.py", line 520, in _free_weak_ref
AttributeError: 'NoneType' object has no attribute '_free_weak_ref'
(/home/ezyang/local/pytorch-tmp-env) [ezyang@devvm066.ash0 ~/local/labs] echo $?
0

and I'm guessing it fails on this PR

@ezyang
Copy link
Contributor

ezyang commented Mar 10, 2022

Confirmed it fails on the PR and not on master.

@kurtamohler
Copy link
Collaborator Author

Awesome, thanks @ezyang! I'll get started fixing it

@ezyang
Copy link
Contributor

ezyang commented Mar 10, 2022

It's possible I need to patch the FB, code, it just wasn't obvious to me who was in the wrong. It's probably the RPC picklers fault though haha

@kurtamohler
Copy link
Collaborator Author

kurtamohler commented Mar 11, 2022

There's definitely a problem with the multiprocessing pickler when the sharing strategy is "file_system". When a _TypedStorage is serialized, the deserialization step is always creating an _UntypedStorage. So when that's passed to the rebuild_tensor function, we get an AttributeError because that function expects a _TypedStorage and it's trying access the dtype attribute. I'll try to fix it today or this weekend, but if I can't get it working by then, just FYI, I'm on PTO next week

@ezyang
Copy link
Contributor

ezyang commented Mar 12, 2022

This has been in the works for a while, what's a little more time :o)

@kurtamohler
Copy link
Collaborator Author

Ok I think I've fixed the issue. At least the AttributeError for storage.dtype within rebuild_tensor doesn't happen anymore.

The other error (AttributeError: 'NoneType' object has no attribute '_free_weak_ref'), which was also showing up in master, is still happening. I don't 100% understand what's going on with that. It looks like _TypedStorage._free_weak_ref() is getting called after the torch module is deleted at the end of the script, and I'm not sure why

@ezyang
Copy link
Contributor

ezyang commented Mar 12, 2022

I filed a tracking issue for it at. #74016 We don't have to fix it here.

@facebook-github-bot
Copy link
Contributor

@ezyang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@ezyang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@ezyang
Copy link
Contributor

ezyang commented Mar 22, 2022

Bruh internal tests are passing. Let's GOOOO

facebook-github-bot pushed a commit that referenced this pull request Mar 22, 2022
Summary:
Fixes #66228

cc ezyang bhosmer smessmer ljk53 bdhirsh

Pull Request resolved: #66970

Reviewed By: bdhirsh

Differential Revision: D33245612

Pulled By: ezyang

fbshipit-source-id: 4c61c2cb029e2b94b0e68927c377d3e1c358dd7c
@github-actions
Copy link
Contributor

Hey @kurtamohler.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

shahofblah pushed a commit that referenced this pull request Mar 25, 2022
Summary:
Fixes #66228

cc ezyang bhosmer smessmer ljk53 bdhirsh

Pull Request resolved: #66970

Reviewed By: bdhirsh

Differential Revision: D33245612

Pulled By: ezyang

fbshipit-source-id: 4c61c2cb029e2b94b0e68927c377d3e1c358dd7c
(cherry picked from commit d29fcdf)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed module: internals Related to internal abstractions in c10 and ATen open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Virtualize FloatStorage and other <type>Storage classes

7 participants