Skip to content

Interrupted system call error appearing after updating install today #5363

@meder411

Description

@meder411

I updated my PyTorch installation this afternoon in order to use LayerNorm, and now code that worked fine before is giving me data loader errors after 2 epochs of training. Training is all on one GPU currently.

Traceback (most recent call last):
  File "train.py", line 54, in <module>
    main()
  File "train.py", line 48, in main
    solver.run_wasserstein(num_epochs)
  File "/playpen/meder/projects/point-gan/point_wgan_solver.py", line 551, in run_wasserstein
    sample = data_iter.next()
  File "/usr/local/lib/python2.7/dist-packages/torch/utils/data/dataloader.py", line 275, in __next__
    idx, batch = self._get_batch()
  File "/usr/local/lib/python2.7/dist-packages/torch/utils/data/dataloader.py", line 254, in _get_batch
    return self.data_queue.get()
  File "/usr/lib/python2.7/multiprocessing/queues.py", line 378, in get
    return recv()
  File "/usr/local/lib/python2.7/dist-packages/torch/multiprocessing/queue.py", line 22, in recv
    return pickle.loads(buf)
  File "/usr/lib/python2.7/pickle.py", line 1388, in loads
    return Unpickler(file).load()
  File "/usr/lib/python2.7/pickle.py", line 864, in load
    dispatch[key](self)
  File "/usr/lib/python2.7/pickle.py", line 1139, in load_reduce
    value = func(*args)
  File "/usr/local/lib/python2.7/dist-packages/torch/multiprocessing/reductions.py", line 68, in rebuild_storage_fd
    fd = multiprocessing.reduction.rebuild_handle(df)
  File "/usr/lib/python2.7/multiprocessing/reduction.py", line 157, in rebuild_handle
    new_handle = recv_handle(conn)
  File "/usr/lib/python2.7/multiprocessing/reduction.py", line 83, in recv_handle
    return _multiprocessing.recvfd(conn.fileno())
OSError: [Errno 4] Interrupted system call

The only difference in my code from before and after the install is the addition of 3 LayerNorm layers. Perhaps something got buggy in the update?

I'm currently checking if the same issue occurs without LayerNorm. Not quite sure why that would be the source of the issue though.

  • OS: Ubuntu 16.04
  • PyTorch version: GitHub source (as of ~6pm EST)
  • How you installed PyTorch (conda, pip, source): source
  • Python version: 2.7
  • CUDA/cuDNN version: 9.0
  • GPU models and configuration: Titan X & 2 Titan Z
  • GCC version (if compiling from source): 5.4.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions