Timeout option for parallel DataLoader

At least with `pin_memory = False`, torch uses Python's `SimpleQueue` to implement producer-consumer pattern to sample example indices and then load/preprocess them on worker threads. A problem with `SimpleQueue` is that they use native OS pipes which have small buffer sizes (64Kb on Linux), lock when this buffer size is exceeded and don't provide a way to fail on lock timeout.

When for some reason worker threads are locked or die (e.g. due to failed unpickling) this leads to a permanent hang in `_put_indices { indices_queue.put(...) }` in `DataLoaderIter` (for me it happens even in constructor during batch pre-fetching).

If we could have a way to specify a timeout for `SimpleQueue.put`, this may eliminate a class of uncertain hangs and at least turn them into exceptions. The `put` calls are in `_put_indices` and in `_worker_loop`.

A particularly nasty deadlock may theoretically happen if all worker threads are blocked because `data_queue` exceeded its buffer size, and the main thread instead of removing the results from `data_queue`, fills up `indices_queue` and gets blocked because `indices_queue` is filled up. Given large enough batches (say with large lists in them transferred by value), this might happen during initial batch pre-fetching / priming.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Timeout option for parallel DataLoader #2474

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Timeout option for parallel DataLoader #2474

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions