-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[nvFuser] Latency improvements for pointwise + reduction fusion #45218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
💊 CI failures summary and remediationsAs of commit 2ecd5fe (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group. This comment has been revised 19 times. |
Co-authored-by: Christian Sarofeen <csarofeen@nvidia.com>
* Fix csarofeen#306 * Reenable smem block gemm cache test.
Fixes pytorch#230 removing WAR of contig flag for broadcasting removing unnecessary tests for the WAR
Add an lstm cell c++ test for convenience.
removing graph copy from critical code path; cache hasReduction result
Splits the origin (definition) links between Fusion IR and Kernel IR. This will allow moving the nodes into different containers (as well as cleaning up parts which are not really needed for the Kernel IR, ex. cloning) Also fixing isConstScalar() and a couple of build warnings in kernel_cache.cpp
Fixes pytorch#305 sys env to disabling fma and specify optimization level for jit compilation
Removing support for cloning Kernel IR nodes, which is not needed today.
Kernel IR expressions must call Fusion::registerLoweredExpr() instead of Fusion::registerExpr()
* Add an IRPrinter handler for kir::TensorView This is considered a temporary workaround as IRPrinter is meant to be exclusive to the fusion IR. * Add a comment
* Initial Dynamic Shared Memory Check if shared memory usage is within limits for current GPU Gather buffers in a single pass Use single dynamic shared memory for reduction/broadcast workspace Align dynamic shared memory by data type Co-authored-by: Ryan Spring <rspring@nvidia.com>
An example of this error happens with tv4 of testGPU_FusionComputeAtMultiBCast.
* Add computeAt tests with minor cleanup * Print names of IterDomains for better debugging experience
pytorch#333) Add Executor method to compile from a string for debug usage. Fix Reduction Scheduler to have TI level perf for FP16 inner dimension reductions. Fix tests to use randn() so large reductions aren't matching on inf.
…orch#339) Move IterVisitor derived classes from fusion.h to iter_visitor.h
Implement hasBlockBroadcast like hasGrid/BlockReduction, cache results of these functions in executor during compilation. Improves average latency on LSTMCell 77.5us -> 20.5us.
Removing support for Kernel IR nodes from IrGraphGenerator
While our kernels handle dynamic input sizes, we are now caching kernel selection and launch parameters on static sizes. This improves kernel launch latency for repeated input sizes. The encoding from input array to a unique_id is done at `GraphCache` level, where we record and encode every seen inputs. We plumb the unique_id through the `FusionExecutorCache` and `FusionExecutor`, so we do not repeatedly infer launch parameters / cache entry selections.
Adding environment variable to:
1. disable fma
lower jit optimization level for robust python end-2-end tests
2. disable fallback path
* Enable Global Intermediate Buffers * Set the default MemoryType to Local * Merge Sync_Allocations into Global_Allocations * Check that all inputs/outputs are in global memory Co-authored-by: Ryan Spring <rspring@nvidia.com>
Split the TV binary expressions across multiple lines
A lightweight built-in instrumentation. 1. To enable tracing, one just has to set PYTORCH_CUDA_FUSER_TRACE to the name of the trace file (new to be created, or will overwrite existing one). Ex. traces\experiment1.trace 2. Trace files can be viewed in Chrome/Chromium (open a new tab and type chrome://tracing in the address bar) 3. There are other options for viewing traces (Qt Creator or https://ui.perfetto.dev). 4. Since the trace files are in a simple json format, it's easy to posprocess or parse them (format file defined here: https://docs.google.com/document/d/1CvAClvFfyA5R-PhYUmn5OOQtYMH4h6I0nSsKchNAySU/preview) 5. In order to record a new operation you'd just have to add a FUSER_PERF_SCOPE macro at the top of the scope (function scope or an inner block)
Adding support for multi-process and multi-threaded tracing.
* Basic Write-After-Read (WAR) check to add __syncthreads to end of for-loop * Enable Tiled GEMM example * Check that IterDomain iterates from zero to some positive integer Co-authored-by: Ryan Spring <rspring@nvidia.com>
* Get a crazy test example working. * Change problem size and tile size, still an issue with N > 32. * Add sync threads in loops that read from smem, to make sure we finish reading before writing. * Predicate off threads bound to a broadcast dim of an output when its in shared memory. * Predicate smem tiling writing based on broadcasted dims in consumer. * Cleanup example a bit. * Revert "Add sync threads in loops that read from smem, to make sure we finish reading before writing." This reverts commit dffaa76. Revert this in favor of pytorch#383 * Add _syncthreads for Write-After-Read Race (pytorch#383) * Basic Write-After-Read (WAR) check to add __syncthreads to end of for-loop * Enable Tiled GEMM example * Check that IterDomain iterates from zero to some positive integer Co-authored-by: Ryan Spring <rspring@nvidia.com> * Refactor thread predication for writes to smem Co-authored-by: Naoya Maruyama <nmaruyama@nvidia.com> Co-authored-by: Ryan Spring <rdspring1@gmail.com> Co-authored-by: Ryan Spring <rspring@nvidia.com>
Fix reduction heuristics so we don't recompile and we use the correct launch params. Co-authored-by: Kevin Stephano <kevin.stephano@gmail.com>
This PR introduces a new interface for creating Kernel IR nodes: kir::IrBuilder. This is the only way to create new Kernel IR nodes,(so it's easy to track them), and it makes the connection between the IR nodes and the target kernel more explicit. If the Kernel object is readily available, an IrBuilder can be "wrapped" around it directly: kir::IrBuilder ir_builder(kernel); During lowering, another option is to create an IrBuilder for the kernel that is being created: kir::IrBuilder ir_builder(GpuLower::current()->kernel()); Once we have an IR builder instance, creating nodes looks auto new_node = ir_builder.create<kir::Int>(1)); auto result = ir_builder.mulExpr(lhs, rhs);
* add concretize pass * add proveEqual pass add concretize and equal utility clang-format * clang-tidy and variable naming * variable naming * style fix, doc string, additional tests * update const model for disjointset * variable naming * refactor test, restructure class, re-fromat comments * rename functions and add variable defaults * re-format test, rename functions * style fix * more comment fix * style fix * style fix Co-authored-by: Shiming Song <shimings@nvidia.com>
3628926 to
deb71c2
Compare
Codecov Report
@@ Coverage Diff @@
## master #45218 +/- ##
=======================================
Coverage 68.01% 68.01%
=======================================
Files 393 393
Lines 50847 50847
=======================================
Hits 34583 34583
Misses 16264 16264 Continue to review full report at Codecov.
|
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@soumith has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
got a critical lint issue, possibly from already committed code |
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@soumith has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
@suo sorry i didn't see your github comments. I ended up clicking the land button a while ago, and it landed it in the night. |
Summary: A lot of changes are in this update, some highlights: - Added Doxygen config file - Split the fusion IR (higher level TE like IR) from kernel IR (lower level CUDA like IR) - Improved latency with dynamic shape handling for the fusion logic - Prevent recompilation for pointwise + reduction fusions when not needed - Improvements to inner dimension reduction performance - Added input -> kernel + kernel launch parameters cache, added eviction policy - Added reduction fusions with multiple outputs (still single reduction stage) - Fixed code generation bugs for symbolic tiled GEMM example - Added thread predicates to prevent shared memory form being loaded multiple times - Improved sync threads placements with shared memory and removed read before write race - Fixes to FP16 reduction fusions where output would come back as FP32 Pull Request resolved: pytorch/pytorch#45218 Reviewed By: ezyang Differential Revision: D23905183 Pulled By: soumith fbshipit-source-id: 12f5ad4cbe03e9a25043bccb89e372f8579e2a79
Summary: A lot of changes are in this update, some highlights: - Added Doxygen config file - Split the fusion IR (higher level TE like IR) from kernel IR (lower level CUDA like IR) - Improved latency with dynamic shape handling for the fusion logic - Prevent recompilation for pointwise + reduction fusions when not needed - Improvements to inner dimension reduction performance - Added input -> kernel + kernel launch parameters cache, added eviction policy - Added reduction fusions with multiple outputs (still single reduction stage) - Fixed code generation bugs for symbolic tiled GEMM example - Added thread predicates to prevent shared memory form being loaded multiple times - Improved sync threads placements with shared memory and removed read before write race - Fixes to FP16 reduction fusions where output would come back as FP32 Pull Request resolved: pytorch/pytorch#45218 Reviewed By: ezyang Differential Revision: D23905183 Pulled By: soumith fbshipit-source-id: 12f5ad4cbe03e9a25043bccb89e372f8579e2a79
A lot of changes are in this update, some highlights: