[nvFuser] Latency improvements for pointwise + reduction fusion #45218

jjsjann123 · 2020-09-23T19:04:41Z

A lot of changes are in this update, some highlights:

Added Doxygen config file
Split the fusion IR (higher level TE like IR) from kernel IR (lower level CUDA like IR)
Improved latency with dynamic shape handling for the fusion logic
Prevent recompilation for pointwise + reduction fusions when not needed
Improvements to inner dimension reduction performance
Added input -> kernel + kernel launch parameters cache, added eviction policy
Added reduction fusions with multiple outputs (still single reduction stage)
Fixed code generation bugs for symbolic tiled GEMM example
Added thread predicates to prevent shared memory form being loaded multiple times
Improved sync threads placements with shared memory and removed read before write race
Fixes to FP16 reduction fusions where output would come back as FP32

dr-ci · 2020-09-23T19:46:07Z

💊 CI failures summary and remediations

As of commit 2ecd5fe (more details on the Dr. CI page):

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 19 times.

Co-authored-by: Christian Sarofeen <csarofeen@nvidia.com>

* Fix csarofeen#306 * Reenable smem block gemm cache test.

Fixes pytorch#230 removing WAR of contig flag for broadcasting removing unnecessary tests for the WAR

Add an lstm cell c++ test for convenience.

removing graph copy from critical code path; cache hasReduction result

Splits the origin (definition) links between Fusion IR and Kernel IR. This will allow moving the nodes into different containers (as well as cleaning up parts which are not really needed for the Kernel IR, ex. cloning) Also fixing isConstScalar() and a couple of build warnings in kernel_cache.cpp

Fixes pytorch#305 sys env to disabling fma and specify optimization level for jit compilation

Removing support for cloning Kernel IR nodes, which is not needed today.

Kernel IR expressions must call Fusion::registerLoweredExpr() instead of Fusion::registerExpr()

* Add an IRPrinter handler for kir::TensorView This is considered a temporary workaround as IRPrinter is meant to be exclusive to the fusion IR. * Add a comment

* Initial Dynamic Shared Memory Check if shared memory usage is within limits for current GPU Gather buffers in a single pass Use single dynamic shared memory for reduction/broadcast workspace Align dynamic shared memory by data type Co-authored-by: Ryan Spring <rspring@nvidia.com>

An example of this error happens with tv4 of testGPU_FusionComputeAtMultiBCast.

* Add computeAt tests with minor cleanup * Print names of IterDomains for better debugging experience

pytorch#333) Add Executor method to compile from a string for debug usage. Fix Reduction Scheduler to have TI level perf for FP16 inner dimension reductions. Fix tests to use randn() so large reductions aren't matching on inf.

…ytorch#338)

…orch#339) Move IterVisitor derived classes from fusion.h to iter_visitor.h

pytorch#341)

Implement hasBlockBroadcast like hasGrid/BlockReduction, cache results of these functions in executor during compilation. Improves average latency on LSTMCell 77.5us -> 20.5us.

Removing support for Kernel IR nodes from IrGraphGenerator

While our kernels handle dynamic input sizes, we are now caching kernel selection and launch parameters on static sizes. This improves kernel launch latency for repeated input sizes. The encoding from input array to a unique_id is done at `GraphCache` level, where we record and encode every seen inputs. We plumb the unique_id through the `FusionExecutorCache` and `FusionExecutor`, so we do not repeatedly infer launch parameters / cache entry selections.

Adding environment variable to: 1. disable fma lower jit optimization level for robust python end-2-end tests 2. disable fallback path

* Enable Global Intermediate Buffers * Set the default MemoryType to Local * Merge Sync_Allocations into Global_Allocations * Check that all inputs/outputs are in global memory Co-authored-by: Ryan Spring <rspring@nvidia.com>

Split the TV binary expressions across multiple lines

A lightweight built-in instrumentation. 1. To enable tracing, one just has to set PYTORCH_CUDA_FUSER_TRACE to the name of the trace file (new to be created, or will overwrite existing one). Ex. traces\experiment1.trace 2. Trace files can be viewed in Chrome/Chromium (open a new tab and type chrome://tracing in the address bar) 3. There are other options for viewing traces (Qt Creator or https://ui.perfetto.dev). 4. Since the trace files are in a simple json format, it's easy to posprocess or parse them (format file defined here: https://docs.google.com/document/d/1CvAClvFfyA5R-PhYUmn5OOQtYMH4h6I0nSsKchNAySU/preview) 5. In order to record a new operation you'd just have to add a FUSER_PERF_SCOPE macro at the top of the scope (function scope or an inner block)

Adding support for multi-process and multi-threaded tracing.

* Basic Write-After-Read (WAR) check to add __syncthreads to end of for-loop * Enable Tiled GEMM example * Check that IterDomain iterates from zero to some positive integer Co-authored-by: Ryan Spring <rspring@nvidia.com>

* Get a crazy test example working. * Change problem size and tile size, still an issue with N > 32. * Add sync threads in loops that read from smem, to make sure we finish reading before writing. * Predicate off threads bound to a broadcast dim of an output when its in shared memory. * Predicate smem tiling writing based on broadcasted dims in consumer. * Cleanup example a bit. * Revert "Add sync threads in loops that read from smem, to make sure we finish reading before writing." This reverts commit dffaa76. Revert this in favor of pytorch#383 * Add _syncthreads for Write-After-Read Race (pytorch#383) * Basic Write-After-Read (WAR) check to add __syncthreads to end of for-loop * Enable Tiled GEMM example * Check that IterDomain iterates from zero to some positive integer Co-authored-by: Ryan Spring <rspring@nvidia.com> * Refactor thread predication for writes to smem Co-authored-by: Naoya Maruyama <nmaruyama@nvidia.com> Co-authored-by: Ryan Spring <rdspring1@gmail.com> Co-authored-by: Ryan Spring <rspring@nvidia.com>

Fix reduction heuristics so we don't recompile and we use the correct launch params. Co-authored-by: Kevin Stephano <kevin.stephano@gmail.com>

This PR introduces a new interface for creating Kernel IR nodes: kir::IrBuilder. This is the only way to create new Kernel IR nodes,(so it's easy to track them), and it makes the connection between the IR nodes and the target kernel more explicit. If the Kernel object is readily available, an IrBuilder can be "wrapped" around it directly: kir::IrBuilder ir_builder(kernel); During lowering, another option is to create an IrBuilder for the kernel that is being created: kir::IrBuilder ir_builder(GpuLower::current()->kernel()); Once we have an IR builder instance, creating nodes looks auto new_node = ir_builder.create<kir::Int>(1)); auto result = ir_builder.mulExpr(lhs, rhs);

* add concretize pass * add proveEqual pass add concretize and equal utility clang-format * clang-tidy and variable naming * variable naming * style fix, doc string, additional tests * update const model for disjointset * variable naming * refactor test, restructure class, re-fromat comments * rename functions and add variable defaults * re-format test, rename functions * style fix * more comment fix * style fix * style fix Co-authored-by: Shiming Song <shimings@nvidia.com>

codecov · 2020-09-24T09:18:32Z

Codecov Report

Merging #45218 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master   #45218   +/-   ##
=======================================
  Coverage   68.01%   68.01%           
=======================================
  Files         393      393           
  Lines       50847    50847           
=======================================
  Hits        34583    34583           
  Misses      16264    16264

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 99242ec...2ecd5fe. Read the comment docs.

facebook-github-bot

@soumith has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

soumith · 2020-09-24T14:56:34Z

got a critical lint issue, possibly from already committed code

This file should only include printable US-ASCII bytes.
Use Unicode escapes like `\u` to represent disallowed values.

torch/csrc/jit/codegen/cuda/lower_unroll.h

facebook-github-bot

@soumith has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

suo · 2020-09-24T20:05:16Z

Heads up that this conflicts with #45264. My selfish preference is to land #45264 first, because I have to rebase it every time someone changes a JIT cpp test 😛. But if there is some urgency around landing this, I can rebase my PR as well.

facebook-github-bot · 2020-09-25T08:25:41Z

@soumith merged this pull request in 99e0a87.

soumith · 2020-09-25T15:17:17Z

@suo sorry i didn't see your github comments. I ended up clicking the land button a while ago, and it landed it in the night.

Summary: A lot of changes are in this update, some highlights: - Added Doxygen config file - Split the fusion IR (higher level TE like IR) from kernel IR (lower level CUDA like IR) - Improved latency with dynamic shape handling for the fusion logic - Prevent recompilation for pointwise + reduction fusions when not needed - Improvements to inner dimension reduction performance - Added input -> kernel + kernel launch parameters cache, added eviction policy - Added reduction fusions with multiple outputs (still single reduction stage) - Fixed code generation bugs for symbolic tiled GEMM example - Added thread predicates to prevent shared memory form being loaded multiple times - Improved sync threads placements with shared memory and removed read before write race - Fixes to FP16 reduction fusions where output would come back as FP32 Pull Request resolved: pytorch/pytorch#45218 Reviewed By: ezyang Differential Revision: D23905183 Pulled By: soumith fbshipit-source-id: 12f5ad4cbe03e9a25043bccb89e372f8579e2a79

jjsjann123 requested a review from apaszke as a code owner September 23, 2020 19:04

jjsjann123 requested a review from csarofeen September 23, 2020 19:20

facebook-github-bot added the oncall: jit Add this issue/PR to JIT oncall triage queue label Sep 23, 2020

pytorchbot added the open source label Sep 23, 2020

jjsjann123 and others added 25 commits September 23, 2020 21:45

CI, to our fork. (pytorch#145) (pytorch#303)

d90cfb4

Co-authored-by: Christian Sarofeen <csarofeen@nvidia.com>

Fix for issue pytorch#306 and pytorch#296 (pytorch#307)

f2ddaf8

* Fix csarofeen#306 * Reenable smem block gemm cache test.

removing WAR of contig flag for broadcasting (pytorch#301)

2b07331

Fixes pytorch#230 removing WAR of contig flag for broadcasting removing unnecessary tests for the WAR

LSTM cell C++ test (pytorch#310)

76ccf83

Add an lstm cell c++ test for convenience.

Fix predicate generation, there was a broken root map. (pytorch#311)

5e7e2b0

Reorder expressions in a breadth-first order (pytorch#312)

e2b56fb

Runtime overhead reduction pr (pytorch#309)

ab70b0c

removing graph copy from critical code path; cache hasReduction result

Debug env disable fma (pytorch#315)

ca09efa

Fixes pytorch#305 sys env to disabling fma and specify optimization level for jit compilation

Kernel IR refactoring: part 6.1 (pytorch#316)

e4a3c8e

Removing support for cloning Kernel IR nodes, which is not needed today.

Fix kir::Sync::Sync() registration (pytorch#317)

8f6ceb0

Kernel IR expressions must call Fusion::registerLoweredExpr() instead of Fusion::registerExpr()

Add an IRPrinter handler for kir::TensorView (pytorch#318)

e9e851c

* Add an IRPrinter handler for kir::TensorView This is considered a temporary workaround as IRPrinter is meant to be exclusive to the fusion IR. * Add a comment

Detect computeAt causing mismatched TensorDomain (pytorch#327)

3916db6

An example of this error happens with tv4 of testGPU_FusionComputeAtMultiBCast.

Additional tests on computeAt with minor refactoring (pytorch#331)

72d4aa6

* Add computeAt tests with minor cleanup * Print names of IterDomains for better debugging experience

Change pointwise scheduling to not generate multiple unrolled loops. (p…

99ea2e6

…ytorch#338)

Move IterVisitor derived classes from fusion.h to iter_visitor.h (pyt…

1a0eaba

…orch#339) Move IterVisitor derived classes from fusion.h to iter_visitor.h

Update fusion parser test, remove printing from common consumer tests. (

5b4af7f

pytorch#341)

Cleanup of hasBlockBroadcast (pytorch#340)

14b5958

Implement hasBlockBroadcast like hasGrid/BlockReduction, cache results of these functions in executor during compilation. Improves average latency on LSTMCell 77.5us -> 20.5us.

Kernel IR: minor cleanup (pytorch#351)

381ce56

Removing support for Kernel IR nodes from IrGraphGenerator

oops, resolving auto merge issue (pytorch#354)

58cf75a

Fixing CUDA fuser ci flag (pytorch#355)

41a87ea

Adding environment variable to: 1. disable fma lower jit optimization level for robust python end-2-end tests 2. disable fallback path

Enable Global Intermediate Buffers (pytorch#325)

3f394ae

* Enable Global Intermediate Buffers * Set the default MemoryType to Local * Merge Sync_Allocations into Global_Allocations * Check that all inputs/outputs are in global memory Co-authored-by: Ryan Spring <rspring@nvidia.com>

tlemo and others added 11 commits September 23, 2020 21:45

Kernel IR: small codegen formatting improvements (pytorch#381)

b75fd81

Split the TV binary expressions across multiple lines

Support for multi-threaded tracing (pytorch#385)

6fdc4b0

Adding support for multi-process and multi-threaded tracing.

Add _syncthreads for Write-After-Read Race (pytorch#383)

2961315

* Basic Write-After-Read (WAR) check to add __syncthreads to end of for-loop * Enable Tiled GEMM example * Check that IterDomain iterates from zero to some positive integer Co-authored-by: Ryan Spring <rspring@nvidia.com>

Fixes to reduction heuristic usage and caching (pytorch#392)

560c7c3

Fix reduction heuristics so we don't recompile and we use the correct launch params. Co-authored-by: Kevin Stephano <kevin.stephano@gmail.com>

reverting ci host repo

58fe64f

handling clang warning

eef9199

adding TODO

deb71c2

jjsjann123 force-pushed the nvfuser_9_23_pr branch from 3628926 to deb71c2 Compare September 24, 2020 04:46

removing dll export for member method

eb41933

csarofeen changed the title ~~[WIP] Nvfuser 9 23 pr~~ [nvFuser] Latency improvements for pointwise + reduction fusion Sep 24, 2020

csarofeen approved these changes Sep 24, 2020

View reviewed changes

soumith approved these changes Sep 24, 2020

View reviewed changes

facebook-github-bot reviewed Sep 24, 2020

View reviewed changes

removing non-printable ascii from old comment

2ecd5fe

facebook-github-bot reviewed Sep 24, 2020

View reviewed changes

facebook-github-bot closed this in 99e0a87 Sep 25, 2020

facebook-github-bot added the merged label Sep 25, 2020

mruberry added the Merged label Oct 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[nvFuser] Latency improvements for pointwise + reduction fusion #45218

[nvFuser] Latency improvements for pointwise + reduction fusion #45218

Uh oh!

jjsjann123 commented Sep 23, 2020 •

edited by csarofeen

Loading

Uh oh!

dr-ci bot commented Sep 23, 2020 •

edited

Loading

Uh oh!

codecov bot commented Sep 24, 2020 •

edited

Loading

Uh oh!

facebook-github-bot left a comment

Uh oh!

soumith commented Sep 24, 2020

Uh oh!

facebook-github-bot left a comment

Uh oh!

suo commented Sep 24, 2020

Uh oh!

facebook-github-bot commented Sep 25, 2020

Uh oh!

soumith commented Sep 25, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

12 participants

[nvFuser] Latency improvements for pointwise + reduction fusion #45218

[nvFuser] Latency improvements for pointwise + reduction fusion #45218

Uh oh!

Conversation

jjsjann123 commented Sep 23, 2020 • edited by csarofeen Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dr-ci bot commented Sep 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

Uh oh!

codecov bot commented Sep 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

soumith commented Sep 24, 2020

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

suo commented Sep 24, 2020

Uh oh!

facebook-github-bot commented Sep 25, 2020

Uh oh!

soumith commented Sep 25, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

12 participants

jjsjann123 commented Sep 23, 2020 •

edited by csarofeen

Loading

dr-ci bot commented Sep 23, 2020 •

edited

Loading

codecov bot commented Sep 24, 2020 •

edited

Loading