Skip to content

Conversation

@jjsjann123
Copy link
Collaborator

@jjsjann123 jjsjann123 commented Sep 23, 2020

A lot of changes are in this update, some highlights:

  • Added Doxygen config file
  • Split the fusion IR (higher level TE like IR) from kernel IR (lower level CUDA like IR)
  • Improved latency with dynamic shape handling for the fusion logic
  • Prevent recompilation for pointwise + reduction fusions when not needed
  • Improvements to inner dimension reduction performance
  • Added input -> kernel + kernel launch parameters cache, added eviction policy
  • Added reduction fusions with multiple outputs (still single reduction stage)
  • Fixed code generation bugs for symbolic tiled GEMM example
  • Added thread predicates to prevent shared memory form being loaded multiple times
  • Improved sync threads placements with shared memory and removed read before write race
  • Fixes to FP16 reduction fusions where output would come back as FP32

@jjsjann123 jjsjann123 requested a review from apaszke as a code owner September 23, 2020 19:04
@facebook-github-bot facebook-github-bot added the oncall: jit Add this issue/PR to JIT oncall triage queue label Sep 23, 2020
@dr-ci
Copy link

dr-ci bot commented Sep 23, 2020

💊 CI failures summary and remediations

As of commit 2ecd5fe (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 19 times.

jjsjann123 and others added 25 commits September 23, 2020 21:45
Co-authored-by: Christian Sarofeen <csarofeen@nvidia.com>
* Fix csarofeen#306

* Reenable smem block gemm cache test.
Fixes pytorch#230
removing WAR of contig flag for broadcasting
removing unnecessary tests for the WAR
Add an lstm cell c++ test for convenience.
removing graph copy from critical code path;
cache hasReduction result
Splits the origin (definition) links between Fusion IR and Kernel IR. This will allow moving the nodes into different containers (as well as cleaning up parts which are not really needed for the Kernel IR, ex. cloning)

Also fixing isConstScalar() and a couple of build warnings in kernel_cache.cpp
Fixes pytorch#305
sys env to disabling fma and specify optimization level for jit compilation
Removing support for cloning Kernel IR nodes, which is not needed today.
Kernel IR expressions must call Fusion::registerLoweredExpr() instead of Fusion::registerExpr()
* Add an IRPrinter handler for kir::TensorView

This is considered a temporary workaround as IRPrinter is meant to be
exclusive to the fusion IR.

* Add a comment
* Initial Dynamic Shared Memory

Check if shared memory usage is within limits for current GPU

Gather buffers in a single pass

Use single dynamic shared memory for reduction/broadcast workspace

Align dynamic shared memory by data type

Co-authored-by: Ryan Spring <rspring@nvidia.com>
An example of this error happens with tv4 of
testGPU_FusionComputeAtMultiBCast.
* Add computeAt tests with minor cleanup

* Print names of IterDomains for better debugging experience
pytorch#333)

Add Executor method to compile from a string for debug usage.  Fix Reduction Scheduler to have TI level perf for FP16 inner dimension reductions. Fix tests to use randn() so large reductions aren't matching on inf.
…orch#339)

Move IterVisitor derived classes from fusion.h to iter_visitor.h
Implement hasBlockBroadcast like hasGrid/BlockReduction, cache results of these functions in executor during compilation. Improves average latency on LSTMCell 77.5us -> 20.5us.
Removing support for Kernel IR nodes from IrGraphGenerator
While our kernels handle dynamic input sizes, we are now caching kernel selection and launch parameters on static sizes. This improves kernel launch latency for repeated input sizes.

The encoding from input array to a unique_id is done at `GraphCache` level, where we record and encode every seen inputs. We plumb the unique_id through the `FusionExecutorCache` and `FusionExecutor`, so we do not repeatedly infer launch parameters / cache entry selections.
Adding environment variable to:

1. disable fma
    lower jit optimization level for robust python end-2-end tests
2. disable fallback path
* Enable Global Intermediate Buffers

* Set the default MemoryType to Local

* Merge Sync_Allocations into Global_Allocations

* Check that all inputs/outputs are in global memory

Co-authored-by: Ryan Spring <rspring@nvidia.com>
tlemo and others added 11 commits September 23, 2020 21:45
Split the TV binary expressions across multiple lines
A lightweight built-in instrumentation.

1. To enable tracing, one just has to set PYTORCH_CUDA_FUSER_TRACE to the name of the trace file (new to be created, or will overwrite existing one). Ex. traces\experiment1.trace

2. Trace files can be viewed in Chrome/Chromium (open a new tab and type chrome://tracing in the address bar)

3. There are other options for viewing traces (Qt Creator or https://ui.perfetto.dev).

4. Since the trace files are in a simple json format, it's easy to posprocess or parse them (format file defined here: https://docs.google.com/document/d/1CvAClvFfyA5R-PhYUmn5OOQtYMH4h6I0nSsKchNAySU/preview)

5. In order to record a new operation you'd just have to add a FUSER_PERF_SCOPE macro at the top of the scope (function scope or an inner block)
Adding support for multi-process and multi-threaded tracing.
* Basic Write-After-Read (WAR) check to add __syncthreads to end of for-loop

* Enable Tiled GEMM example

* Check that IterDomain iterates from zero to some positive integer

Co-authored-by: Ryan Spring <rspring@nvidia.com>
* Get a crazy test example working.

* Change problem size and tile size, still an issue with N > 32.

* Add sync threads in loops that read from smem, to make sure we finish reading before writing.

* Predicate off threads bound to a broadcast dim of an output when its in shared memory.

* Predicate smem tiling writing based on broadcasted dims in consumer.

* Cleanup example a bit.

* Revert "Add sync threads in loops that read from smem, to make sure we finish reading before writing."

This reverts commit dffaa76.

Revert this in favor of pytorch#383

* Add _syncthreads for Write-After-Read Race (pytorch#383)

* Basic Write-After-Read (WAR) check to add __syncthreads to end of for-loop

* Enable Tiled GEMM example

* Check that IterDomain iterates from zero to some positive integer

Co-authored-by: Ryan Spring <rspring@nvidia.com>

* Refactor thread predication for writes to smem

Co-authored-by: Naoya Maruyama <nmaruyama@nvidia.com>
Co-authored-by: Ryan Spring <rdspring1@gmail.com>
Co-authored-by: Ryan Spring <rspring@nvidia.com>
Fix reduction heuristics so we don't recompile and we use the correct launch params.
Co-authored-by: Kevin Stephano <kevin.stephano@gmail.com>
This PR introduces a new interface for creating Kernel IR nodes: kir::IrBuilder. This is the only way to create new Kernel IR nodes,(so it's easy to track them), and it makes the connection between the IR nodes and the target kernel more explicit.

If the Kernel object is readily available, an IrBuilder can be "wrapped" around it directly:

kir::IrBuilder ir_builder(kernel);

During lowering, another option is to create an IrBuilder for the kernel that is being created:

kir::IrBuilder ir_builder(GpuLower::current()->kernel());

Once we have an IR builder instance, creating nodes looks

auto new_node = ir_builder.create<kir::Int>(1));
auto result = ir_builder.mulExpr(lhs, rhs);
* add concretize pass

* add proveEqual pass

add concretize and equal utility

clang-format

* clang-tidy and variable naming

* variable naming

* style fix, doc string, additional tests

* update const model for disjointset

* variable naming

* refactor test, restructure class, re-fromat comments

* rename functions and add variable defaults

* re-format test, rename functions

* style fix

* more comment fix

* style fix

* style fix

Co-authored-by: Shiming Song <shimings@nvidia.com>
@codecov
Copy link

codecov bot commented Sep 24, 2020

Codecov Report

Merging #45218 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #45218   +/-   ##
=======================================
  Coverage   68.01%   68.01%           
=======================================
  Files         393      393           
  Lines       50847    50847           
=======================================
  Hits        34583    34583           
  Misses      16264    16264           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 99242ec...2ecd5fe. Read the comment docs.

@csarofeen csarofeen changed the title [WIP] Nvfuser 9 23 pr [nvFuser] Latency improvements for pointwise + reduction fusion Sep 24, 2020
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@soumith has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@soumith
Copy link
Contributor

soumith commented Sep 24, 2020

got a critical lint issue, possibly from already committed code

This file should only include printable US-ASCII bytes.
Use Unicode escapes like `\u` to represent disallowed values.

torch/csrc/jit/codegen/cuda/lower_unroll.h

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@soumith has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@suo
Copy link
Member

suo commented Sep 24, 2020

Heads up that this conflicts with #45264. My selfish preference is to land #45264 first, because I have to rebase it every time someone changes a JIT cpp test 😛. But if there is some urgency around landing this, I can rebase my PR as well.

@facebook-github-bot
Copy link
Contributor

@soumith merged this pull request in 99e0a87.

@soumith
Copy link
Contributor

soumith commented Sep 25, 2020

@suo sorry i didn't see your github comments. I ended up clicking the land button a while ago, and it landed it in the night.

jjsjann123 added a commit to jjsjann123/nvfuser that referenced this pull request Oct 29, 2022
Summary:
A lot of changes are in this update, some highlights:

- Added Doxygen config file
- Split the fusion IR (higher level TE like IR) from kernel IR (lower level CUDA like IR)
- Improved latency with dynamic shape handling for the fusion logic
- Prevent recompilation for pointwise + reduction fusions when not needed
- Improvements to inner dimension reduction performance
- Added input -> kernel + kernel launch parameters cache, added eviction policy
- Added reduction fusions with multiple outputs (still single reduction stage)
- Fixed code generation bugs for symbolic tiled GEMM example
- Added thread predicates to prevent shared memory form being loaded multiple times
- Improved sync threads placements with shared memory and removed read before write race
- Fixes to FP16 reduction fusions where output would come back as FP32

Pull Request resolved: pytorch/pytorch#45218

Reviewed By: ezyang

Differential Revision: D23905183

Pulled By: soumith

fbshipit-source-id: 12f5ad4cbe03e9a25043bccb89e372f8579e2a79
jjsjann123 added a commit to jjsjann123/nvfuser that referenced this pull request Nov 10, 2022
Summary:
A lot of changes are in this update, some highlights:

- Added Doxygen config file
- Split the fusion IR (higher level TE like IR) from kernel IR (lower level CUDA like IR)
- Improved latency with dynamic shape handling for the fusion logic
- Prevent recompilation for pointwise + reduction fusions when not needed
- Improvements to inner dimension reduction performance
- Added input -> kernel + kernel launch parameters cache, added eviction policy
- Added reduction fusions with multiple outputs (still single reduction stage)
- Fixed code generation bugs for symbolic tiled GEMM example
- Added thread predicates to prevent shared memory form being loaded multiple times
- Improved sync threads placements with shared memory and removed read before write race
- Fixes to FP16 reduction fusions where output would come back as FP32

Pull Request resolved: pytorch/pytorch#45218

Reviewed By: ezyang

Differential Revision: D23905183

Pulled By: soumith

fbshipit-source-id: 12f5ad4cbe03e9a25043bccb89e372f8579e2a79
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged oncall: jit Add this issue/PR to JIT oncall triage queue open source

Projects

None yet

Development

Successfully merging this pull request may close these issues.