-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[CUDACachingAlloc/GPUInference] Implement garbage collection without GPU sync #74261
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
CI Flow Status⚛️ CI FlowRuleset - Version:
|
🔗 Helpful links
💊 CI failures summary and remediationsAs of commit a94589f (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Please report bugs/suggestions to the (internal) Dr. CI Users group. |
|
This pull request was exported from Phabricator. Differential Revision: D34482514 |
|
cc @mcarilli |
|
This pull request was exported from Phabricator. Differential Revision: D34482514 |
87b11d7 to
4259d76
Compare
|
This pull request was exported from Phabricator. Differential Revision: D34482514 |
4259d76 to
e8666d3
Compare
|
This pull request was exported from Phabricator. Differential Revision: D34482514 |
e8666d3 to
915860d
Compare
|
This pull request was exported from Phabricator. Differential Revision: D34482514 |
915860d to
e567d2c
Compare
|
This pull request was exported from Phabricator. Differential Revision: D34482514 |
e567d2c to
1624443
Compare
|
This pull request was exported from Phabricator. Differential Revision: D34482514 |
1624443 to
14c5db7
Compare
|
This pull request was exported from Phabricator. Differential Revision: D34482514 |
14c5db7 to
f0a756f
Compare
|
This pull request was exported from Phabricator. Differential Revision: D34482514 |
f0a756f to
803e2c2
Compare
|
This pull request was exported from Phabricator. Differential Revision: D34482514 |
803e2c2 to
9f82fd2
Compare
9f82fd2 to
3b86ea0
Compare
|
This pull request was exported from Phabricator. Differential Revision: D34482514 |
|
cc @mwootton |
…GPU sync (pytorch#74261) Summary: Pull Request resolved: pytorch#74261 ### Goal Implement a cheap way to reclaim GPU memory (garbage collection) without incurring GPU sync. ### Why do we need this? Currently, there are only two ways to reclaim GPU memory block already assigned to a particular stream. - `release_available_cached_blocks(params)`: Free blocks exceeding the `CachingAllocatorConfig::max_split_size()` until we can satisfy the request. Issue: If the `max_split_size` is unset (default), this function is a no-op. Even if this is set, the reclamation is quite conservative (e.g., never frees blocks under max_split_size). - `release_cached_blocks()`: Waits for all the in-flight events and then reclaim blocks. Issue: 'waiting for all event' is very expensive as it will likely stall all the GPU operations. Many GPU applications without a proper handling of potential GPU throttling would suffer/crash. ### Proposed idea - If the garbage collection threshold is set, try to reclaim some memory blocks *without* synchronization. It should be safe to do so, as `release_available_cached_blocks` essentially does the same thing (but less aggressively). - GC is triggered only when we fail to serve a `malloc` request from the block pool. No need to free blocks when the block pool is functioning just fine. - Prioritize reclaiming blocks that weren't reused for long time. Reclamation stops once the used memory capacity < threshold. - This code path is totally optional; by default it won't be invoked. Test Plan: - Unit tests - Manually checked that the GPU memory usage stays as indicated by the garbage collector. If not the caching allocator at least tries to keep freeing the blocks. Reviewed By: jianyuh Differential Revision: D34482514 fbshipit-source-id: 1d29b589752d489ecbf9bd7d27fb624f980e5666
|
This pull request was exported from Phabricator. Differential Revision: D34482514 |
3b86ea0 to
a94589f
Compare
…GPU sync (#74261) Summary: Pull Request resolved: #74261 ### Goal Implement a cheap way to reclaim GPU memory (garbage collection) without incurring GPU sync. ### Why do we need this? Currently, there are only two ways to reclaim GPU memory block already assigned to a particular stream. - `release_available_cached_blocks(params)`: Free blocks exceeding the `CachingAllocatorConfig::max_split_size()` until we can satisfy the request. Issue: If the `max_split_size` is unset (default), this function is a no-op. Even if this is set, the reclamation is quite conservative (e.g., never frees blocks under max_split_size). - `release_cached_blocks()`: Waits for all the in-flight events and then reclaim blocks. Issue: 'waiting for all event' is very expensive as it will likely stall all the GPU operations. Many GPU applications without a proper handling of potential GPU throttling would suffer/crash. ### Proposed idea - If the garbage collection threshold is set, try to reclaim some memory blocks *without* synchronization. It should be safe to do so, as `release_available_cached_blocks` essentially does the same thing (but less aggressively). - GC is triggered only when we fail to serve a `malloc` request from the block pool. No need to free blocks when the block pool is functioning just fine. - Prioritize reclaiming blocks that weren't reused for long time. Reclamation stops once the used memory capacity < threshold. - This code path is totally optional; by default it won't be invoked. Test Plan: - Unit tests - Manually checked that the GPU memory usage stays as indicated by the garbage collector. If not the caching allocator at least tries to keep freeing the blocks. Reviewed By: jianyuh Differential Revision: D34482514 fbshipit-source-id: d5eae62ac60b94b0bca851f9d233a092d086e3c2
|
Hey @jaewonlee-fb. |
…GPU sync (#74261) Summary: Pull Request resolved: #74261 ### Goal Implement a cheap way to reclaim GPU memory (garbage collection) without incurring GPU sync. ### Why do we need this? Currently, there are only two ways to reclaim GPU memory block already assigned to a particular stream. - `release_available_cached_blocks(params)`: Free blocks exceeding the `CachingAllocatorConfig::max_split_size()` until we can satisfy the request. Issue: If the `max_split_size` is unset (default), this function is a no-op. Even if this is set, the reclamation is quite conservative (e.g., never frees blocks under max_split_size). - `release_cached_blocks()`: Waits for all the in-flight events and then reclaim blocks. Issue: 'waiting for all event' is very expensive as it will likely stall all the GPU operations. Many GPU applications without a proper handling of potential GPU throttling would suffer/crash. ### Proposed idea - If the garbage collection threshold is set, try to reclaim some memory blocks *without* synchronization. It should be safe to do so, as `release_available_cached_blocks` essentially does the same thing (but less aggressively). - GC is triggered only when we fail to serve a `malloc` request from the block pool. No need to free blocks when the block pool is functioning just fine. - Prioritize reclaiming blocks that weren't reused for long time. Reclamation stops once the used memory capacity < threshold. - This code path is totally optional; by default it won't be invoked. Test Plan: - Unit tests - Manually checked that the GPU memory usage stays as indicated by the garbage collector. If not the caching allocator at least tries to keep freeing the blocks. Reviewed By: jianyuh Differential Revision: D34482514 fbshipit-source-id: d5eae62ac60b94b0bca851f9d233a092d086e3c2 (cherry picked from commit 05780f1)
| // cudaMalloc).. | ||
| struct BlockInfo { | ||
| int64_t size = 0; | ||
| int32_t gc_counter = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems a typo ? gc_counter -> gc_count
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh it's not used actually.
Summary:
Goal
Implement a cheap way to reclaim GPU memory (garbage collection) without incurring GPU sync.
Why do we need this?
Currently, there are only two ways to reclaim GPU memory block already assigned to a particular stream.
release_available_cached_blocks(params): Free blocks exceeding theCachingAllocatorConfig::max_split_size()until we can satisfy the request.Issue: If the
max_split_sizeis unset (default), this function is a no-op. Even if this is set, the reclamation is quite conservative (e.g., never frees blocks under max_split_size).release_cached_blocks(): Waits for all the in-flight events and then reclaim blocks.Issue: 'waiting for all event' is very expensive as it will likely stall all the GPU operations. Many GPU applications without a proper handling of potential GPU throttling would suffer/crash.
Proposed idea
release_available_cached_blocksessentially does the same thing (but less aggressively).mallocrequest from the block pool. No need to free blocks when the block pool is functioning just fine.Test Plan:
Differential Revision: D34482514