Skip to content

Timeout option for parallel DataLoader #2474

@vadimkantorov

Description

@vadimkantorov

At least with pin_memory = False, torch uses Python's SimpleQueue to implement producer-consumer pattern to sample example indices and then load/preprocess them on worker threads. A problem with SimpleQueue is that they use native OS pipes which have small buffer sizes (64Kb on Linux), lock when this buffer size is exceeded and don't provide a way to fail on lock timeout.

When for some reason worker threads are locked or die (e.g. due to failed unpickling) this leads to a permanent hang in _put_indices { indices_queue.put(...) } in DataLoaderIter (for me it happens even in constructor during batch pre-fetching).

If we could have a way to specify a timeout for SimpleQueue.put, this may eliminate a class of uncertain hangs and at least turn them into exceptions. The put calls are in _put_indices and in _worker_loop.

A particularly nasty deadlock may theoretically happen if all worker threads are blocked because data_queue exceeded its buffer size, and the main thread instead of removing the results from data_queue, fills up indices_queue and gets blocked because indices_queue is filled up. Given large enough batches (say with large lists in them transferred by value), this might happen during initial batch pre-fetching / priming.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions