-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[DO NOT REVIEW] [CI Debug] #40921
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
[DO NOT REVIEW] [CI Debug] #40921
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Fixing the bugs that Kevin finds. * Small alloc fix. * Add another reduction example, change fusion printMath. * Small test fix. * Change Reduction4 test to use TIDx.x * Rework allocation and buffer initialization, as init could be placed before alloc. Add lots of comments. * Fix bug in index compute when replaying reduction transformations for buffer initialization. * RFactor fix when root domain is reduction but has no rfactor axis. * Val isCconst fix. * update remote repo to local repo Co-authored-by: jiej <jiej@nvidia.com>
On behalf of @csarofeen Working on breaking up the lowering logic to be more incremental and easier to follow. This is in preparation to fix predicates after reductions and in combinations with following broadcasts. This is to replace #65 for the updated base branch. * Start commenting lowering better, break indexing pass into its own class. * Refactor lowering to break up passes and make logic more incremental. * removing commented code Co-authored-by: Christian Sarofeen <csarofeen@nvidia.com>
On behalf of @naoyam: blockReduce has a bug when X_THREAD=true, Y_THREAD=false, Z_THREAD=true. This PR adds a test case that exposes the bug as well as a fix. * Add a new test case that hits a bug in blockReduce * Fix a bug in blockReduce * clang-format * Rename test function to avoid a conflict Co-authored-by: Naoya Maruyama <nmaruyama@nvidia.com>
* Rewrite ExpressionEvaluator to use IterVisitor * renaming variables per review comments Co-authored-by: jiej <jiej@nvidia.com>
Start working on the issue of not predicating based on threads that were used to parallelize a reduction. * Add eq to arith.h * Initial pass at thread predicates for ops after parallelized reductions. * Remove erroneous print statement. * update variable name in reference code in cpp tests * clang-tidy * fixing typo * clang format * clang_tidy again Co-authored-by: jiej <jiej@nvidia.com>
Fix CI configuration
Support parallel reductions across thread blocks Co-authored-by: Naoya Maruyama <nmaruyama@nvidia.com>
* [nvfuser] Debug flag via env_var Allow setting environment variable PYTORCH_CUDA_FUSER_DEBUG to spit out codegen cuda kernel in string.
* Remove extra new lines in fusion printMath. * Fix bug in predicate generation. * Add fusion.printKernel() * Fix potential segfault in unroll pass.
* Simplify a few test cases Replace custom exception checks with ASSERT_THROW macros. * ExpressionEvaluator * Stricter EvaluationContext binding rules 1. Don't allow overwriting concrete values 2. Don't allow binding values to expression results * Fix clang-format errors * Switch to Int::ScalarType The expression evaluator is now using Int::ScalarType instead of plain int. * Avoid a fight with clang-tidy * Check the numbers of kernel input and output parameters * Add an optional arc from TensorView to its root domain This is generated for detail_level >= DetailLevel::Explicit * Checks kernel arguments * Prefer pointers over references * Bug fix * Fix accidental construction of IValue * Use noReduction * Add const to const pointer * Make an integer tensor an error as it is not yet supported * clang-tidy * Incorporate review feedback * added lerp support in parser * add missing addcmul parser and tests * clang_format * Return TensorView* from binary/compound/ternary ops * clang-format * Use TensorView* param in reductionOp and sum * Prefer as instead of static_cast * Transform replay refactor (#53) Goal of this work is to have the transformation history be specific to IterDomains instead of TensorDomains. This should make it a lot easier to match up IterDomains during replay which can be complicated when taking into consideration reduction axes, rfactors, and broadcast axes. Co-authored-by: Jie <jiej@nvidia.com> Co-authored-by: Kevin Stephano <kevin.stephano@gmail.com> * python test fixes (#52) fix python tests failure: 1. put Fusion inside cudaKernel to facilitate runtime arg check. 2. relax rank check for broadcast support in integration; 3. add shape propagation for newly added opeartion: [addcmul, lerp]; 4. adding utility function to create FusionGuard from CudaKernel directly. * [nvFuser] add torch.jit.fuser context manager (pytorch#38993) (#54) Summary: 1. `torch.jit.fuser(str)` context manager facilitates switch between backend fusers: str - 'fuser0' enables only legacy fuser; str - 'fuser1' enables only NNC; str - 'fuser2' enables only nvFuser; 2. cleanup updated python tests. Pull Request resolved: pytorch#38993 Reviewed By: nairbv, pbelevich Differential Revision: D21800620 Pulled By: soumith fbshipit-source-id: 7fe855f5a5b97368e5e84c98c28d04b2e1276c85 * Add another reduction example, change fusion printMath. * Small test fix. * Change Reduction4 test to use TIDx.x * Minor cleanup. * Clean up some noexcepts. * More cleanup. * Refactor computeAt, get first broadcast example working. * Validate first non-trivial broadcast kernel. * Fix replay when broadcast is merged with non-broadcast dim. * Add constness in replay and index compute. * Add another broadcast test. Rework index computation for producers, base on consumer computed indices. * Val isCconst fix. * Add dot product gemm example. * Clang. * Minor bug fixes. * Format and add comments to GEMM test. * WIP: Fix for enabling broadcast after reduction plus a Softmax test. (#66) * Fix for enabling broadcast after reduction plus a Softmax test. * Cleaner way of fixing checks for matching non-broadcast dims to non-reduction dims. * Clang. Co-authored-by: Kevin Stephano <kstephano@nvidia.com> Co-authored-by: Christian Sarofeen <csarofeen@nvidia.com> * Backout bad merge conflict resolutions. * More post rebase cleanup. * Refix a few tests. Some from a bad rebase. * Address comments. * Missed some review comments. * tmp Co-authored-by: Lemo <lemo1234@gmail.com> Co-authored-by: Naoya Maruyama <maruyama3@llnl.gov> Co-authored-by: Jie <jiej@nvidia.com> Co-authored-by: Kevin Stephano <kevin.stephano@gmail.com> Co-authored-by: Kevin Stephano <kstephano@nvidia.com>
* Fix replace size when a reduction dim is not in inner most. * Clang tidy. * Remove print statement in test.
This fixes #85 1. Fix Fusion::validateInputs() 2. Cleanup Fusion::~Fusion()
Fix for issue #88
* Fix MSVC (Windows) build
..\test\cpp\jit\test_gpu.cpp(2609): error C2398: Element '1': conversion from 'size_t' to '_Ty' requires a narrowing conversion
with
[
_Ty=int64_t
]
..\test\cpp\jit\test_gpu.cpp(2609): error C2398: Element '2': conversion from 'size_t' to '_Ty' requires a narrowing conversion
with
[
_Ty=int64_t
]
* Fix newForReduction()
Allow partition registration to exclude nodes from fusion.
Building block for implementing Fusion copy/move semantics.
* [Graph Partition] Allow partition registration to exclude nodes from fusion. * [WIP] adding sum into integration * [WIP] builds now but fails at codegen/scheduling * debugging * [WIP] concept works now in the cpp tests * first buggy kernel worked * non fcd reduction compiled * fixing rebasing issue * bug fixes * fix segfault * test passed * revert test modification for bug repro * prototype working! * broadcast prior to reduction added * code cleaning to remove hardcoded reduction list * seems to be functionally correct * [reduction] integration broadcast test case added * reverse cpp tests * untouch unrelated files * removing dead code & debug prints * removing printf * autopep8 & clang-format * clang-tidy * remove debug print * clang-tidy * addressing review comments * updating in integration code * revert int64_t changes
Implementing copy and move operations for Fusion objects. The intention is to provide a generic container view of Fusion objects, ie. allow them to be copied and moved similarly to std::vector or std::unordered_map containers: 1. Copies are supported, but relatively expensive 2. Move operations are cheap (and noexcept) 3. Fusion::clear() can be used to reset the IR to a "blank state" (also noexcept) 4. Fusion supports swap() as well. The cloning machinery is implemented as non-intrusive as possible with the help of the new IrCloner class + a new "cloning" constructor in each IR node type (ex. Statement::Statement(const Statement* src, IrCloner* ir_cloner))
This fixes #105.
This change replaces most instances of struct with class (plus fixing a build break*) Why? This is mostly a stylistic convention, but an important one: it allows us to distinguish between plain aggregates (bundle of data) from objects (encapsulation / polymorphism). Easy rule of thumb: if it has virtual method or access specifiers it should be a class. For more details: https://google.github.io/styleguide/cppguide.html#Structs_vs._Classes https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#Rc-struct (*) Incidentally, this also fixes a build break on Windows, where MSVC fails to link if the declarations and definitions don't agree on class/struct (ex. declare as struct Foo but define as class Foo). This is technically a known MSVC bug, although it's convenient as a consistency check.
Move computeAt logic to separate file/class. In progress reworking of computeAt logic.
* Clang++ warnings. * Test fix.
💊 CI failures summary and remediationsAs of commit 3053356 (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group. This comment has been revised 3 times. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
trying to debug #40864