Skip to content

Conversation

@peterjc123
Copy link
Collaborator

While the other missing components are optional, this one is of much importance. The DataLoader in Windows is not so useful so we have to fix this.

self.assertEqual(storage_size, 5)

@unittest.skipIf(IS_WINDOWS, 'NYI: not supported on Windows')
# @unittest.skipIf(IS_WINDOWS, 'NYI: not supported on Windows')

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

@peterjc123
Copy link
Collaborator Author

I'm not so sure that it works now. @yf225, Do you know when the CI can be put into work? If Windows CI can be setup quickly, then I'd rather have it tested on CI. Otherwise, I'll test it on my own PC.

@yf225
Copy link
Contributor

yf225 commented Jan 4, 2018

@peterjc123 Windows CI will be enabled again after this fix #4469. We can test it on CI.

@yf225
Copy link
Contributor

yf225 commented Jan 4, 2018

@pytorchbot retest this please

@peterjc123
Copy link
Collaborator Author

peterjc123 commented Jan 4, 2018

@apaszke It seems that the timeout options in dataloader.py is too small. The creation of processes takes longer time in Windows. Shall we modify them one by one or set a base timeout and use the multiplied timeout?

@apaszke
Copy link
Contributor

apaszke commented Jan 4, 2018

Let’s multiply the timeout for Windows only

@peterjc123 peterjc123 changed the title [WIP]Fix multiprocessing and dataloader tests on Windows Fix multiprocessing and dataloader tests on Windows Jan 6, 2018
@apaszke
Copy link
Contributor

apaszke commented Jan 6, 2018

@pytorchbot test this please

@apaszke apaszke merged commit 2dd7039 into pytorch:master Jan 6, 2018
@apaszke
Copy link
Contributor

apaszke commented Jan 6, 2018

Thank you @peterjc123!

@peterjc123 peterjc123 deleted the mp_fix branch January 7, 2018 04:30
@ssnl
Copy link
Collaborator

ssnl commented Jan 11, 2018

I'm curious. How does Windows build pass the test_segfault test? Wouldn't the main process get stuck waiting for data at data_queue.get (like the POSIX builds before SIGCHLD handler)?

@peterjc123
Copy link
Collaborator Author

@ssnl No, it won't. Because there's no fork in Windows. Although you initialized the CUDA context of one tensor, the one in the child process is a brand new one due to spawn, so the segfault won't be triggered.

@ssnl
Copy link
Collaborator

ssnl commented Jan 11, 2018

@peterjc123 I'm still a bit confused. Are you saying that ctype.string(0) doesn't segfault on windows because child processes get new CUDA contexts due to spawn?

@peterjc123
Copy link
Collaborator Author

peterjc123 commented Jan 12, 2018

@ssnl Sorry, I messup that up with test_cuda_bad_call. As for test_segfault,the process should be test_segfault starting a new process to process function _test_segfault, the new process called the DataLoader and the processed was finally closed because there's an exception in the DataLoader. I guess you're asking about the timeout of the data_queue.get() right?
Here is full traceback:

Process Process-1:
Traceback (most recent call last):
  File "C:\Anaconda2\envs\test_new\lib\multiprocessing\process.py", line 258, in _bootstrap
    self.run()
  File "C:\Anaconda2\envs\test_new\lib\multiprocessing\process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "D:\pytorch\mp_dl_test.py", line 107, in _test_segfault
    _ = next(iter(dataloader))
  File "C:\Anaconda2\envs\test_new\lib\site-packages\torch\utils\data\dataloader.py", line 273, in __next__
    return self._process_next_batch(batch)
  File "C:\Anaconda2\envs\test_new\lib\site-packages\torch\utils\data\dataloader.py", line 293, in _process_next_batch
    raise batch.exc_type(batch.exc_msg)
OSError: Traceback (most recent call last):
  File "C:\Anaconda2\envs\test_new\lib\site-packages\torch\utils\data\dataloader.py", line 56, in _worker_loop
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "C:\Anaconda2\envs\test_new\lib\site-packages\torch\utils\data\dataloader.py", line 56, in <listcomp>
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "D:\pytorch\mp_dl_test.py", line 48, in __getitem__
    return ctypes.string_at(0)
  File "C:\Anaconda2\envs\test_new\lib\ctypes\__init__.py", line 492, in string_at
    return _string_at(ptr, size)
OSError: exception: access violation reading 0x0000000000000000

The data_queue in the child process is wrapped in the try..catch block, as commented below:

def _worker_loop(dataset, index_queue, data_queue, collate_fn, seed, init_fn, worker_id):
    global _use_shared_memory
    _use_shared_memory = True

    # Intialize C side signal handlers for SIGBUS and SIGSEGV. Python signal
    # module's handlers are executed after Python returns from C low-level
    # handlers, likely when the same fatal signal happened again already.
    # https://docs.python.org/3/library/signal.html Sec. 18.8.1.1
    _set_worker_signal_handlers()

    torch.set_num_threads(1)
    torch.manual_seed(seed)

    if init_fn is not None:
        init_fn(worker_id)

    while True:
        r = index_queue.get()
        if r is None:
            break
        idx, batch_indices = r
        try:
            samples = collate_fn([dataset[i] for i in batch_indices])   <- throw exception
        except Exception:
            data_queue.put((idx, ExceptionWrapper(sys.exc_info()))) <- exception returns here
        else:
            data_queue.put((idx, samples))

After the exception was thrown, the data_queue.get() receives this and after that, the function process_next_batch checks this and stops the new process. So it should be passing without problem.

@peterjc123
Copy link
Collaborator Author

@ssnl However, if the index_queue.get() throws exception in the previous example(worker_loop), then it will cause the deadlock.

@ssnl
Copy link
Collaborator

ssnl commented Jan 12, 2018

@peterjc123 I see. Interesting that windows can throw an OSError upon invalid memory access. In posix, we had to do some signal handling to manually capture such issues. :)

ssnl pushed a commit to ssnl/pytorch that referenced this pull request Feb 8, 2018
soumith pushed a commit that referenced this pull request Feb 9, 2018
* cherry pick Fix multiprocessing and dataloader tests on Windows (#4453)

* cherry pick Dataloader issues #4643

* fix common IS_WINDOWS
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants