Race condition in multi‑GPU allocation when multiple processes/containers share device visibility #5757
wirlessBrain
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
When multiple GPUs are connected to a host, our algorithm selects a list of devices to run a network that requires multiple GPUs. These devices are then used for inference.
The problem arises when two processes run simultaneously (either directly on the host or inside separate containers) with visibility to all GPUs. Currently, there is no mechanism for one process to know that certain GPUs have already been picked and reserved by another process.
This leads to a race condition:
Process 1 selects a set of GPUs and begins inference.
Process 2, unaware of Process 1’s allocation, may also select overlapping GPUs.
Both processes attempt to use the same devices, causing conflicts, degraded performance, or failures.
If we expose visibility of all GPUs to more than one container, what mechanisms exist to prevent race conditions in GPU allocation?
Beta Was this translation helpful? Give feedback.
All reactions