Race condition in multi‑GPU allocation when multiple processes/containers share device visibility #5757

wirlessBrain · 2025-12-10T05:23:21Z

wirlessBrain
Dec 10, 2025

When multiple GPUs are connected to a host, our algorithm selects a list of devices to run a network that requires multiple GPUs. These devices are then used for inference.

The problem arises when two processes run simultaneously (either directly on the host or inside separate containers) with visibility to all GPUs. Currently, there is no mechanism for one process to know that certain GPUs have already been picked and reserved by another process.

This leads to a race condition:

Process 1 selects a set of GPUs and begins inference.

Process 2, unaware of Process 1’s allocation, may also select overlapping GPUs.

Both processes attempt to use the same devices, causing conflicts, degraded performance, or failures.

If we expose visibility of all GPUs to more than one container, what mechanisms exist to prevent race conditions in GPU allocation?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Race condition in multi‑GPU allocation when multiple processes/containers share device visibility #5757

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Race condition in multi‑GPU allocation when multiple processes/containers share device visibility #5757

Uh oh!

wirlessBrain Dec 10, 2025

Replies: 0 comments

wirlessBrain
Dec 10, 2025