[Inductor-FX] Generalize FloorDiv conversion to handle more complex launch grids. Remove python_slow grid mode.#163828
[Inductor-FX] Generalize FloorDiv conversion to handle more complex launch grids. Remove python_slow grid mode.#163828blaine-rister wants to merge 8 commits intomainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/163828
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 1175db1 with merge base 5f90e8c ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@blaine-rister has imported this pull request. If you are a Meta employee, you can view this in D83209451. |
|
@blaine-rister has imported this pull request. If you are a Meta employee, you can view this in D83209451. |
|
@blaine-rister has imported this pull request. If you are a Meta employee, you can view this in D83209451. |
|
@blaine-rister has imported this pull request. If you are a Meta employee, you can view this in D83209451. |
|
@blaine-rister has imported this pull request. If you are a Meta employee, you can view this in D83209451. |
|
@blaine-rister has imported this pull request. If you are a Meta employee, you can view this in D83209451. |
|
@pytorchbot merge (Initiating merge automatically since Phabricator Diff has merged) |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Problem
Inductor's FX backend receives sympy expressions for Triton launch grids, and passes these to a tracer to generate equivalent FX IR. However, the tracer does not support all possible sympy expressions. In particular, it can't handle ops like
floorandPowwhich would be found in an expression likefloor(x / y). Instead, it expectsFloorDiv(x, y), which has the advantage that all intermediate values are integers, unlikex / y.Inductor's Python backend uses a trick where
ceil(x / y)is computed in Python as-(x // -y), which is faster when evaluating Python launch grids at runtime. However, this trick generates more complex sympy expressions, so the FX backend introduced a"python_slow"mode using a more familiar form of ceil division. However, this mode is slower to evaluate, which increased production CPU usage. (Internal reviewers see T237853632.)Solution
To get the best of both worlds, this PR removes
"python_slow"mode, and generalizes thereplace_floor_divfunction to handle the more complex expressions resulting from the"python"grid mode. The new algorithm is conceptually similar to the existing one, except instead of analyzing only the first argument to asympy.Mulop, it checks all factors, so it can handle expressions containing bothRationalandPowops, among other cases. It also usesMul.make_argsto handle the case when the argument toflooris not aMul. Finally, it usesexpr.is_positiveto check the sign of symbolic exponents.This new algorithm is guaranteed to convert all
floorops to an equivalent expression usingFloorDiv. (To see this, consider thatfloor(x) == FloorDiv(x, 1).) Note it may not remove allPowops, with a counterexample beingfloor(x / (2 + z ** y)), but it covers everything we've seen in practice for symbolic launch grids. In particular, it covers the typical case wherePowis a factor of the argument tofloor, and the exponent is-1. Is this situation, we move thePowto the denominator ofFloorDivand the exponent becomes1, eliminating thePowop.Test plan
This PR adds an end-to-end test for static padding with dynamic outer dimensions, which creates a difficult sympy expression that the existing algorithm would not be able to handle.
This PR also adds some unit tests for the
replace_floor_divfunction. It can be difficult to construct end-to-end tests that expose all the trickiest expressions, as those tests have to pass through a number of other systems handling dynamic shapes. Therefore, it's easier to expose the edge cases with these new unit tests. The tests check that we can replace allfloorops in the input expression withFloorDiv, then they expandFloorDivback tofloorand check equality with the original expression.Note this PR also requires some MTIA changes to pass internal tests. Those will be stacked onto the imported diff.
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben