[DO NOT REVIEW] [CI Debug] #40921

csarofeen · 2020-07-02T15:28:54Z

trying to debug #40864

Fixing the bugs that Kevin finds. * Small alloc fix. * Add another reduction example, change fusion printMath. * Small test fix. * Change Reduction4 test to use TIDx.x * Rework allocation and buffer initialization, as init could be placed before alloc. Add lots of comments. * Fix bug in index compute when replaying reduction transformations for buffer initialization. * RFactor fix when root domain is reduction but has no rfactor axis. * Val isCconst fix. * update remote repo to local repo Co-authored-by: jiej <jiej@nvidia.com>

@csarofeen

On behalf of @csarofeen Working on breaking up the lowering logic to be more incremental and easier to follow. This is in preparation to fix predicates after reductions and in combinations with following broadcasts. This is to replace #65 for the updated base branch. * Start commenting lowering better, break indexing pass into its own class. * Refactor lowering to break up passes and make logic more incremental. * removing commented code Co-authored-by: Christian Sarofeen <csarofeen@nvidia.com>

@naoyam

On behalf of @naoyam: blockReduce has a bug when X_THREAD=true, Y_THREAD=false, Z_THREAD=true. This PR adds a test case that exposes the bug as well as a fix. * Add a new test case that hits a bug in blockReduce * Fix a bug in blockReduce * clang-format * Rename test function to avoid a conflict Co-authored-by: Naoya Maruyama <nmaruyama@nvidia.com>

@naoyam

On behalf of @naoyam Add yet another reduction test: This is a test that exercises a bug that is recently fixed. Here's the fix: cab53ac. Co-authored-by: Naoya Maruyama <nmaruyama@nvidia.com>

* Rewrite ExpressionEvaluator to use IterVisitor * renaming variables per review comments Co-authored-by: jiej <jiej@nvidia.com>

Start working on the issue of not predicating based on threads that were used to parallelize a reduction. * Add eq to arith.h * Initial pass at thread predicates for ops after parallelized reductions. * Remove erroneous print statement. * update variable name in reference code in cpp tests * clang-tidy * fixing typo * clang format * clang_tidy again Co-authored-by: jiej <jiej@nvidia.com>

Fix CI configuration

Support parallel reductions across thread blocks Co-authored-by: Naoya Maruyama <nmaruyama@nvidia.com>

1. Remove all references to TensorContiguity (fixes #83) 2. Various small pieces extracted from an experimental branch

* [nvfuser] Debug flag via env_var Allow setting environment variable PYTORCH_CUDA_FUSER_DEBUG to spit out codegen cuda kernel in string.

* Remove extra new lines in fusion printMath. * Fix bug in predicate generation. * Add fusion.printKernel() * Fix potential segfault in unroll pass.

* Simplify a few test cases Replace custom exception checks with ASSERT_THROW macros. * ExpressionEvaluator * Stricter EvaluationContext binding rules 1. Don't allow overwriting concrete values 2. Don't allow binding values to expression results * Fix clang-format errors * Switch to Int::ScalarType The expression evaluator is now using Int::ScalarType instead of plain int. * Avoid a fight with clang-tidy * Check the numbers of kernel input and output parameters * Add an optional arc from TensorView to its root domain This is generated for detail_level >= DetailLevel::Explicit * Checks kernel arguments * Prefer pointers over references * Bug fix * Fix accidental construction of IValue * Use noReduction * Add const to const pointer * Make an integer tensor an error as it is not yet supported * clang-tidy * Incorporate review feedback * added lerp support in parser * add missing addcmul parser and tests * clang_format * Return TensorView* from binary/compound/ternary ops * clang-format * Use TensorView* param in reductionOp and sum * Prefer as instead of static_cast * Transform replay refactor (#53) Goal of this work is to have the transformation history be specific to IterDomains instead of TensorDomains. This should make it a lot easier to match up IterDomains during replay which can be complicated when taking into consideration reduction axes, rfactors, and broadcast axes. Co-authored-by: Jie <jiej@nvidia.com> Co-authored-by: Kevin Stephano <kevin.stephano@gmail.com> * python test fixes (#52) fix python tests failure: 1. put Fusion inside cudaKernel to facilitate runtime arg check. 2. relax rank check for broadcast support in integration; 3. add shape propagation for newly added opeartion: [addcmul, lerp]; 4. adding utility function to create FusionGuard from CudaKernel directly. * [nvFuser] add torch.jit.fuser context manager (pytorch#38993) (#54) Summary: 1. `torch.jit.fuser(str)` context manager facilitates switch between backend fusers: str - 'fuser0' enables only legacy fuser; str - 'fuser1' enables only NNC; str - 'fuser2' enables only nvFuser; 2. cleanup updated python tests. Pull Request resolved: pytorch#38993 Reviewed By: nairbv, pbelevich Differential Revision: D21800620 Pulled By: soumith fbshipit-source-id: 7fe855f5a5b97368e5e84c98c28d04b2e1276c85 * Add another reduction example, change fusion printMath. * Small test fix. * Change Reduction4 test to use TIDx.x * Minor cleanup. * Clean up some noexcepts. * More cleanup. * Refactor computeAt, get first broadcast example working. * Validate first non-trivial broadcast kernel. * Fix replay when broadcast is merged with non-broadcast dim. * Add constness in replay and index compute. * Add another broadcast test. Rework index computation for producers, base on consumer computed indices. * Val isCconst fix. * Add dot product gemm example. * Clang. * Minor bug fixes. * Format and add comments to GEMM test. * WIP: Fix for enabling broadcast after reduction plus a Softmax test. (#66) * Fix for enabling broadcast after reduction plus a Softmax test. * Cleaner way of fixing checks for matching non-broadcast dims to non-reduction dims. * Clang. Co-authored-by: Kevin Stephano <kstephano@nvidia.com> Co-authored-by: Christian Sarofeen <csarofeen@nvidia.com> * Backout bad merge conflict resolutions. * More post rebase cleanup. * Refix a few tests. Some from a bad rebase. * Address comments. * Missed some review comments. * tmp Co-authored-by: Lemo <lemo1234@gmail.com> Co-authored-by: Naoya Maruyama <maruyama3@llnl.gov> Co-authored-by: Jie <jiej@nvidia.com> Co-authored-by: Kevin Stephano <kevin.stephano@gmail.com> Co-authored-by: Kevin Stephano <kstephano@nvidia.com>

* Fix replace size when a reduction dim is not in inner most. * Clang tidy. * Remove print statement in test.

This fixes #85 1. Fix Fusion::validateInputs() 2. Cleanup Fusion::~Fusion()

Fix for issue #88

* Fix MSVC (Windows) build ..\test\cpp\jit\test_gpu.cpp(2609): error C2398: Element '1': conversion from 'size_t' to '_Ty' requires a narrowing conversion with [ _Ty=int64_t ] ..\test\cpp\jit\test_gpu.cpp(2609): error C2398: Element '2': conversion from 'size_t' to '_Ty' requires a narrowing conversion with [ _Ty=int64_t ] * Fix newForReduction()

Allow partition registration to exclude nodes from fusion.

Fix and tests for #122 and #120

Building block for implementing Fusion copy/move semantics.

* [Graph Partition] Allow partition registration to exclude nodes from fusion. * [WIP] adding sum into integration * [WIP] builds now but fails at codegen/scheduling * debugging * [WIP] concept works now in the cpp tests * first buggy kernel worked * non fcd reduction compiled * fixing rebasing issue * bug fixes * fix segfault * test passed * revert test modification for bug repro * prototype working! * broadcast prior to reduction added * code cleaning to remove hardcoded reduction list * seems to be functionally correct * [reduction] integration broadcast test case added * reverse cpp tests * untouch unrelated files * removing dead code & debug prints * removing printf * autopep8 & clang-format * clang-tidy * remove debug print * clang-tidy * addressing review comments * updating in integration code * revert int64_t changes

Fixes #112.

Implementing copy and move operations for Fusion objects. The intention is to provide a generic container view of Fusion objects, ie. allow them to be copied and moved similarly to std::vector or std::unordered_map containers: 1. Copies are supported, but relatively expensive 2. Move operations are cheap (and noexcept) 3. Fusion::clear() can be used to reset the IR to a "blank state" (also noexcept) 4. Fusion supports swap() as well. The cloning machinery is implemented as non-intrusive as possible with the help of the new IrCloner class + a new "cloning" constructor in each IR node type (ex. Statement::Statement(const Statement* src, IrCloner* ir_cloner))

This fixes #105.

This change replaces most instances of struct with class (plus fixing a build break*) Why? This is mostly a stylistic convention, but an important one: it allows us to distinguish between plain aggregates (bundle of data) from objects (encapsulation / polymorphism). Easy rule of thumb: if it has virtual method or access specifiers it should be a class. For more details: https://google.github.io/styleguide/cppguide.html#Structs_vs._Classes https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#Rc-struct (*) Incidentally, this also fixes a build break on Windows, where MSVC fails to link if the declarations and definitions don't agree on class/struct (ex. declare as struct Foo but define as class Foo). This is technically a known MSVC bug, although it's convenient as a consistency check.

Move computeAt logic to separate file/class. In progress reworking of computeAt logic.

* Clang++ warnings. * Test fix.

dr-ci · 2020-07-02T16:33:50Z

💊 CI failures summary and remediations

As of commit 3053356 (more details on the Dr. CI page):

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 3 times.

csarofeen and others added 30 commits June 11, 2020 23:43

Add yet another reduction test (#74)

e3f3224

On behalf of @naoyam Add yet another reduction test: This is a test that exercises a bug that is recently fixed. Here's the fix: cab53ac. Co-authored-by: Naoya Maruyama <nmaruyama@nvidia.com>

Rewrite ExpressionEvaluator to use IterVisitor (#58)

3eb8a9b

* Rewrite ExpressionEvaluator to use IterVisitor * renaming variables per review comments Co-authored-by: jiej <jiej@nvidia.com>

Minor cleanup. (#79)

f230d9c

Fix CI configuration

43d9573

Merge pull request #82 from csarofeen/fix_ci_setup

b330974

Fix CI configuration

Grid reduction take 2 (#73)

ddec5c8

Support parallel reductions across thread blocks Co-authored-by: Naoya Maruyama <nmaruyama@nvidia.com>

Add graphviz support for the new node types (#81)

f0ab847

Cleanup (#89)

8a64e88

1. Remove all references to TensorContiguity (fixes #83) 2. Various small pieces extracted from an experimental branch

[nvfuser] Debug flag via env_var (#86)

ce2496e

* [nvfuser] Debug flag via env_var Allow setting environment variable PYTORCH_CUDA_FUSER_DEBUG to spit out codegen cuda kernel in string.

Quick fixes (#92)

b9af528

* Remove extra new lines in fusion printMath. * Fix bug in predicate generation. * Add fusion.printKernel() * Fix potential segfault in unroll pass.

Replace size quick fix (#97)

b1725ac

* Fix replace size when a reduction dim is not in inner most. * Clang tidy. * Remove print statement in test.

Fix some indexing issues with broadcasted tensors. (#98)

cf307bd

Fix Fusion::validateInputs() (#99)

1355059

This fixes #85 1. Fix Fusion::validateInputs() 2. Cleanup Fusion::~Fusion()

Fix ReplaceExprsInScope (#101)

d060620

Fix for issue #88

Do not allow tensors with origin expressions to be used as input (#103)

b21e67f

[Graph Partition]

83057ec

Allow partition registration to exclude nodes from fusion.

[WIP] adding sum into integration

ada8280

[WIP] builds now but fails at codegen/scheduling

c07a3d0

debugging

22bedfd

[WIP] concept works now in the cpp tests

945b673

first buggy kernel worked

bf7bdc9

non fcd reduction compiled

c9b9462

fixing rebasing issue

ceef40e

bug fixes

9b503ab

naoyam and others added 22 commits June 23, 2020 10:25

Remove debug printing (#119)

694ea70

Hot fixes 120/122 (#126)

a776753

Fix and tests for #122 and #120

Remove assertion on input exprs in updateBitSet (#118)

ccdd986

addressing review comments

48208e1

updating in integration code

9c767de

revert int64_t changes

0da2abc

Adding Fusion::clear() (#114)

b88bd55

Building block for implementing Fusion copy/move semantics.

Sort expressions for computeAt when traversing expressions (#127)

4a24227

Fixes #112.

Supports zero-dim tensors (#137)

3f08b17

This fixes #105.

Misc code cleanup (#136)

7fe4d87

[WIP] Reworking computeAt (#132)

35d0a18

Move computeAt logic to separate file/class. In progress reworking of computeAt logic.

Fixing clang++ warnings (#138)

b530eee

* Clang++ warnings. * Test fix.

Merge branch 'master' into 20_6_11_devel

b5a014c

Revert github/workflow files.

5b4fa9d

Fix a merge conflict.

e0dd11c

adding back

4900bf1

enable test_jit_cuda_fuser.py for profiling_mode & legacy executor

1c1dd5c

update run_test with new scripts

1a78d63

debug

1923278

csarofeen requested a review from apaszke as a code owner July 2, 2020 15:28

facebook-github-bot added the oncall: jit Add this issue/PR to JIT oncall triage queue label Jul 2, 2020

pytorchbot added the open source label Jul 2, 2020

debug2

c270583

Try reducing block size.

3053356

csarofeen closed this Jul 2, 2020

csarofeen deleted the debug_20_6_11 branch June 9, 2021 13:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DO NOT REVIEW] [CI Debug] #40921

[DO NOT REVIEW] [CI Debug] #40921

Uh oh!

csarofeen commented Jul 2, 2020

Uh oh!

dr-ci bot commented Jul 2, 2020 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

[DO NOT REVIEW] [CI Debug] #40921

[DO NOT REVIEW] [CI Debug] #40921

Uh oh!

Conversation

csarofeen commented Jul 2, 2020

Uh oh!

dr-ci bot commented Jul 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

dr-ci bot commented Jul 2, 2020 •

edited

Loading