20x less memory use and 37.25% speedup in min_cut_rematerialization_partition when using the new dp knapsack solver, compared to existing default one (dp)#160914
Conversation
…ew alternative to 0/1 knapsack with full dp table implementation. Unit testing in progress. Benchmarks not started. The current, first implementation uses recursion and can be improved (explicit stack, vectorization and many others). That's a starting point. Some unit tests are failing and showing minor differences between new and previous implementations
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/160914
Note: Links to docs will display an error until the docs builds have been completed. ❌ 2 New Failures, 1 Unrelated FailureAs of commit 2b3706b with merge base 226850c ( NEW FAILURES - The following jobs have failed:
UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@pytorchbot label "module: functorch" |
…lit: and adjust right split accordingly. Remove quantized_max_memory <= 0 condition, add a condition for 0 memory length
|
Didn't find following labels among repository labels: topic: user facing |
|
@pytorchbot label "release notes: torch.func" not sure about it, pls someone adjust if needed |
|
I have some happy benchmarks to share!
Ran all solvers for 1000 times on as low noisy machine as possible (Ubuntu 24.04.2, AMD Ryzen 7 9800X3D, 64 GB RAM, no background activity except Terminal with benchmark script) How to reproduce: python userbenchmark/functorch/model_benchmark.py && python to_csv.pyIt will produce JSON and CSV with results. You can steer the number of runs for each test and modify a problem size you want to test (I used You can copy the content of CSV file to a copy of the spreadsheet below, split it to columns using a method from top context menu and you will get neatly organized benchmarks Results: Code for benchmarks is still kinda messy, I'll improve it before merging this PR I think it's a good moment to move it from 'draft' to 'ready to review' @zou3519 @Chillee @samdow @kshitij12345 |
|
Hi @bdhirsh, a gentle ping if you are available for a review @Chillee I read your post that you step away from PyTorch, but perhaps would you like to throw an eye on this particular change? You might be interested in this one, because it's your idea for the optimization (hirschberg algo + sliding window). If you're not interested though, then sorry for bothering you! |
|
Who I should ping for the review? @ezyang @janeyx99 @anijain2305 |
|
This PR needs review @ezyang @janeyx99 @anijain2305 @bdhirsh @Chillee |
|
@zou3519 review ping, this one is important because yields very solid memory savings (up to 20 times less RAM used!) compared to current default |
|
This is in my review box but it requires a very heavy review that I haven't had time for. It would be easier to justify reviewing this if there was a more clear description of when the CPU memory use/speed of the min cut partitioner is a limiting factor for compile time; my impression is that this is typically not the big problem. |
|
@ezyang The main reason for this change is RAM usage on DP, CPU speedup is a nice bonus. Current |
|
oh. we probably did not notice this because our machines have a ginormous amount of cpu ram lol |
| best_split = int(torch.argmax(left_dp_local).item()) | ||
|
|
||
| left_capacity = best_split | ||
| right_capacity = capacity - best_split |
There was a problem hiding this comment.
Should this be capacity - 1 - best_split?
There was a problem hiding this comment.
I think it should not, because it's not an index arithmetic - I really compute the memory left (capacity) here
|
ping @ezyang up |
|
@pytorchbot merge |
|
Pull workflow has not been scheduled for the PR yet. It could be because author doesn't have permissions to run those or skip-checks keywords were added to PR/commits, aborting merge. Please get/give approval for the workflows and/or remove skip ci decorators before next merge attempt. If you think this is a mistake, please contact PyTorch Dev Infra. |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 mandatory check(s) failed. The first few are: Dig deeper by viewing the failures on hud |
|
I pushed a lint fix and merged main, let's try again @ezyang |
|
@ezyang Let's try merge again pls |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 mandatory check(s) failed. The first few are: Dig deeper by viewing the failures on hud |
|
Seem like flaky tests @ezyang |
|
@pytorchbot merge -i |
Merge startedYour change will be merged while ignoring the following 3 checks: pull / linux-docs / build-docs-python-false, pull / linux-jammy-py3.10-gcc11 / test (docs_test, 1, 1, lf.linux.2xlarge), trunk / linux-jammy-py3-clang12-executorch / test (executorch, 1, 1, linux.2xlarge, unstable) Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
…artition when using the new dp knapsack solver, compared to existing default one (dp) (#160914) The goal: **Reduce memory usage** in [min_cut_rematerialization_partition](https://github.com/pytorch/pytorch/blob/1f1900369435933a013df9a7d5e07c75c1cebb5d/torch/_functorch/partitioners.py#L2601) with new knapsack implementation Rationale: The @Chillee's comment in original dp_knapsack suggests improving the code with [Hirschberg algorithm and sliding window](https://codeforces.com/blog/entry/47247?#comment-316200). In this PR, I create a new knapsack implementation which is based on the original implementation and uses Hirschberg + sliding window Existing `dp_knapsack` implementation instantiates full dp table of shape (n, W), where n is number of memory elements and W is quantized memory length. The new `dp_knapsack_sliding_hirschberg` uses only (2, W) memory - it needs only 2 dp profiles (1-dim tensors) to compute the same output as original `dp_knapsack` This optimization is possible, because at each step we use only current and previous row ("dp profile"). We split the indices of items (each item is a pair of memory and runtime) into two, compute dp profile for each half, then compute the index of split at which the sum of runtimes from both halfs is the highest, then we use this split index to decide how much budget we give to left half and right half and we recurse on left half and right half with new memory budgets Based on benchmarks, consider if we should keep two dp knapsack implementations or one, which one should be default, do we want to make it easier to use the new one etc. In general, think about the next steps Thanks in advance for all comments Pull Request resolved: #160914 Approved by: https://github.com/ezyang Co-authored-by: Edward Yang <ezyang@meta.com>
The goal:
Reduce memory usage in min_cut_rematerialization_partition with new knapsack implementation
Rationale:
The @Chillee's comment in original dp_knapsack suggests improving the code with Hirschberg algorithm and sliding window. In this PR, I create a new knapsack implementation which is based on the original implementation and uses Hirschberg + sliding window
Existing
dp_knapsackimplementation instantiates full dp table of shape (n, W), where n is number of memory elements and W is quantized memory length. The newdp_knapsack_sliding_hirschberguses only (2, W) memory - it needs only 2 dp profiles (1-dim tensors) to compute the same output as originaldp_knapsackThis optimization is possible, because at each step we use only current and previous row ("dp profile"). We split the indices of items (each item is a pair of memory and runtime) into two, compute dp profile for each half, then compute the index of split at which the sum of runtimes from both halfs is the highest, then we use this split index to decide how much budget we give to left half and right half and we recurse on left half and right half with new memory budgets
Based on benchmarks, consider if we should keep two dp knapsack implementations or one, which one should be default, do we want to make it easier to use the new one etc. In general, think about the next steps
Thanks in advance for all comments
cc @zou3519 @Chillee @samdow @kshitij12345