-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Description
At least with pin_memory = False, torch uses Python's SimpleQueue to implement producer-consumer pattern to sample example indices and then load/preprocess them on worker threads. A problem with SimpleQueue is that they use native OS pipes which have small buffer sizes (64Kb on Linux), lock when this buffer size is exceeded and don't provide a way to fail on lock timeout.
When for some reason worker threads are locked or die (e.g. due to failed unpickling) this leads to a permanent hang in _put_indices { indices_queue.put(...) } in DataLoaderIter (for me it happens even in constructor during batch pre-fetching).
If we could have a way to specify a timeout for SimpleQueue.put, this may eliminate a class of uncertain hangs and at least turn them into exceptions. The put calls are in _put_indices and in _worker_loop.
A particularly nasty deadlock may theoretically happen if all worker threads are blocked because data_queue exceeded its buffer size, and the main thread instead of removing the results from data_queue, fills up indices_queue and gets blocked because indices_queue is filled up. Given large enough batches (say with large lists in them transferred by value), this might happen during initial batch pre-fetching / priming.