[NVFuser] Enable E2E BCast-PWise-Reduction fusions #43129

csarofeen · 2020-08-16T15:59:09Z

Had a bunch of merged commits that shouldn't have been there, reverted them to prevent conflicts. Lots of new features, highlights listed below.

Overall:

Enables pointwise fusion, single (but N-D) broadcast -- pointwise fusion, single (but N-D) broadcast -- pointwise -- single (but N-D) reduction fusion.

Integration:

Separate "magic scheduler" logic that takes a fusion and generates code generator schedule
Reduction fusion scheduling with heuristics closely matching eagermode (unrolling supported, but no vectorize support)
2-Stage caching mechanism, one on contiguity, device, type, and operations, the other one is input size->reduction heuristic

Code Generation:

More generic support in code generation for computeAt
Full rework of loop nest generation and Indexing to more generically handle broadcast operations
Code generator has automatic kernel launch configuration (including automatic allocation of grid reduction buffers)
Symbolic (runtime) tilling on grid/block dimensions is supported
Simplified index generation based on user-defined input contiguity
Automatic broadcast support (similar to numpy/pytorch semantics)
Support for compile time constant shared memory buffers
Parallelized broadcast support (i.e. block reduction -> block broadcast support)

Summary: Closes pytorch#40560 This adds the equation for the weighted mean to `CrossEntropyLoss`'s docs and the `reduction` argument for `CrossEntropyLoss` and `NLLLoss` no longer describes a non-weighted mean of the outputs. Pull Request resolved: pytorch#40991 Differential Revision: D22395805 Pulled By: ezyang fbshipit-source-id: a623b6dd2aab17220fe0bf706bd9b62d6ba531fd

…ction methods. (pytorch#40962) Summary: Follow up to pytorch#36447 . Update for pytorch#33389. Also removes unused `unordered_map` include from the CPP file. Pull Request resolved: pytorch#40962 Differential Revision: D22376253 Pulled By: ngimel fbshipit-source-id: 4e7432190e9a847321aec6d6f6634056fa69bdb8

Summary: This trick should have no effect on performance, but it reduces size of kernels using the template by 10% For example, sizeof(BinaryMulDivKernel.cu.o) compiled by CUDA-10.1 toolchain for sm_75 before the change was 4.2Mb, after 3.8Mb Pull Request resolved: pytorch#40992 Differential Revision: D22398733 Pulled By: malfet fbshipit-source-id: 6576f4da00dc5fc2575b2313577f52c6571d5e6f

Summary: Pull Request resolved: pytorch#40856 Add a new activation function - Mish: A Self Regularized Non-Monotonic Neural Activation Function https://arxiv.org/abs/1908.08681 Test Plan: buck test //caffe2/caffe2/python/operator_test:elementwise_ops_test -- 'test_mish' {F242275183} Differential Revision: D22158035 fbshipit-source-id: 459c1dd0ac5b515913fc09b5f4cd13dcf095af31

Summary: Pull Request resolved: pytorch#40795 Test Plan: Imported from OSS Reviewed By: suo Differential Revision: D22314215 Pulled By: jamesr66a fbshipit-source-id: a2fb5c6804d4014f8e437c6858a7be8cd3efb380

Summary: Fixes pytorch#24557 ASV benchmark: ``` import torch sizes = [ (10**6,), (1000, 1000), (10, 10), (1, 2, 3, 4, 5, 6, 7, 8, 9, 10), ] class EqualTrue: params = range(len(sizes)) def setup(self, n): dims = sizes[n] self.a = torch.rand(dims, device='cuda') self.b = self.a.clone() def time_equal(self, n): torch.equal(self.a, self.b) class EqualFalse: params = range(len(sizes)) def setup(self, n): dims = sizes[n] self.a = torch.rand(dims, device='cuda') self.b = torch.rand(dims, device='cuda') def time_equal(self, n): torch.equal(self.a, self.b) ``` Old results: ``` [ 75.00%] ··· equal.EqualFalse.time_equal [ 75.00%] ··· ======== ============ param1 -------- ------------ 0 67.7±7μs 1 74.0±2μs 2 24.4±0.1μs 3 135±0.2μs ======== ============ [100.00%] ··· equal.EqualTrue.time_equal [100.00%] ··· ======== ============ param1 -------- ------------ 0 59.8±0.2μs 1 59.9±0.3μs 2 25.0±0.5μs 3 136±0.2μs ======== ============ ``` New results: ``` [ 75.00%] ··· equal.EqualFalse.time_equal [ 75.00%] ··· ======== ============ param1 -------- ------------ 0 44.4±0.2μs 1 44.5±0.4μs 2 31.3±0.3μs 3 96.6±0.5μs ======== ============ [100.00%] ··· equal.EqualTrue.time_equal [100.00%] ··· ======== ============ param1 -------- ------------ 0 44.2±0.2μs 1 44.6±0.2μs 2 30.8±0.3μs 3 97.3±0.2μs ======== ============ ``` Pull Request resolved: pytorch#36483 Differential Revision: D21451829 Pulled By: VitalyFedyunin fbshipit-source-id: 033e8060192c54f139310aeafe8ba784bab94ded

Summary: Original commit changeset: 46c59d849fa8 The original commit is breaking DPER3 release pipeline with the following failures: https://www.internalfb.com/intern/chronos/jobinstance?jobinstanceid=9007207344413239&smc=chronos_gp_admin_client&offset=0 ``` Child workflow f 202599639 failed with error: c10::Error: [enforce fail at operator.cc:76] blob != nullptr. op Save: Encountered a non-existing input blob: feature_preproc/feature_sparse_to_dense/default_float_value ``` https://www.internalfb.com/intern/chronos/jobinstance?jobinstanceid=9007207344855973&smc=chronos_gp_admin_client&offset=0 ``` Child workflow f 202629391 failed with error: c10::Error: [enforce fail at operator.cc:76] blob != nullptr. op Save: Encountered a non-existing input blob: tum_preproc/inductive/feature_sparse_to_dense/default_float_value ``` Related UBN tasks: T69529846, T68986110 Test Plan: Build a DPER3 package on top of this commit, and check that DPER3 release test `model_deliverability_test` is passing. Differential Revision: D22396317 fbshipit-source-id: 92d5b30cc146c005d6159a8d5bfe8973e2c546dd

Summary: Pull Request resolved: pytorch#40938 already accepted in pytorch#40645 Test Plan: Imported from OSS Reviewed By: jamesr66a, Krovatkin Differential Revision: D22394675 Pulled By: eellison fbshipit-source-id: 1e9dbb24a4cb564d9a68280d2166329ca9fb0425

Summary: Pull Request resolved: pytorch#40939 Previously, when we would do shape analysis by running the op with representative inputs, we would always set the grad property to false. This led to a wrong static analysis when we would create differentiable subgraphs, and propagate shapes without also propagating requires_grad, and then uninline them. Test Plan: Imported from OSS Differential Revision: D22394676 Pulled By: eellison fbshipit-source-id: 254e6e9f964b40d160befe0e125abe1b7aa2bd5e

Summary: Most time-consuming tests in test_nn (taking about half the time) were gradgradchecks on Conv3d. Reduce their sizes, and, most importantly, run gradgradcheck single-threaded, because that cuts the time of conv3d tests by an order of magnitude, and barely affects other tests. These changes bring test_nn time down from 1200 s to ~550 s on my machine. Pull Request resolved: pytorch#40999 Differential Revision: D22396896 Pulled By: ngimel fbshipit-source-id: 3b247caceb65d64be54499de1a55de377fdf9506

Summary: Pull Request resolved: pytorch#40717 `in_dims` specifies which dimension of the input tensors should be vmapped over. One can also specify `None` as an `in_dim` for a particular input to indicate that we do not map over said input. We implement `in_dims` by creating a BatchedTensor with BatchDim equal to said `in_dim`. Most of this PR is error checking. `in_dims` must satisfy the following: - `in_dim` can be either an int or a Tuple[Optional[int]]. If it is an int, we use it to mean the `in_dim` for every input. - If `in_dims` is not-None at some index `idx`, then the input at index `idx` MUST be a tensor (vmap can only map over tensors). jax supports something more generalized: their `in_dims` can match the structure of the `inputs` to the function (i.e., it is a nested python data structure matching the data structure of `inputs` specifying where in `inputs` the Tensors to be mapped are and what their map dims should be). We don't have the infrastruture yet so we only support `int` or a flat tuple for `in_dims`. Test Plan: - `pytest test/test_vmap.py -v` Differential Revision: D22397914 Pulled By: zou3519 fbshipit-source-id: 56d2e14be8b6024e4cde2729eff384da305b4ea3

Summary: Closes pytorch#40784 Pull Request resolved: pytorch#41038 Differential Revision: D22404273 Pulled By: malfet fbshipit-source-id: 8df05f948f069ac95591d523222faa1327429e71

Summary: I ran `make linkcheck` using `sphinx.builders.linkcheck` on the documentation and noticed a few links weren't using HTTPS so I quickly updated them all. Pull Request resolved: pytorch#40878 Differential Revision: D22404647 Pulled By: ngimel fbshipit-source-id: 9c9756db59197304023fddc28f252314f6cf4af3

Summary: In issue pytorch#36997 the user encountered a non-meaningful error message when trying to export the model to ONNX. The Pad operator in opset 9 requires the list of paddings to be constant. This PR tries to improve the error message given to the user when this is not the case. Pull Request resolved: pytorch#39651 Reviewed By: hl475 Differential Revision: D21992262 Pulled By: houseroad fbshipit-source-id: b817111c2a40deba85e4c6cdb874c1713312dba1

Summary: Fix export of full_like when fill_value is of type torch._C.Value. This PR fixes a bug when exporting GPT2DoubleHeadsModel huggingface/transformers#4950 Pull Request resolved: pytorch#40063 Reviewed By: hl475 Differential Revision: D22398353 Pulled By: houseroad fbshipit-source-id: 6980a61211fe571c2e4a57716970f474851d811e

Summary: This PR adds support for the torch `view_as` operator. Pull Request resolved: pytorch#40496 Reviewed By: hl475 Differential Revision: D22398318 Pulled By: houseroad fbshipit-source-id: f92057f9067a201b707aa9b8fc4ad34643dd5fa3

Summary: It's a known gcc-5.4 bug that enum class is not hasheable by default, so `std::unordered_map` needs 3rd explicit parameters to compute hash from the type. Should fix regression caused by pytorch#40864 Pull Request resolved: pytorch#41055 Differential Revision: D22405478 Pulled By: malfet fbshipit-source-id: f4bd36bebdc1ad0251ebd1e6cefba866e6605fe6

Summary: Forgot to add this to pytorch#41055 Pull Request resolved: pytorch#41063 Differential Revision: D22407451 Pulled By: malfet fbshipit-source-id: 6f06653b165cc4817d134657f87caf643182832a

Summary: Pull Request resolved: pytorch#41023 Remove Logger in get_matching_activations since it's not used. ghstack-source-id: 107237046 Test Plan: buck test mode/dev caffe2/test:quantization -- 'test_compare_weights_lstm_dynamic' buck test mode/dev caffe2/test:quantization -- 'test_compare_model_stub_lstm_dynamic' buck test mode/dev caffe2/test:quantization -- 'test_compare_model_outputs_lstm_dynamic' buck test mode/dev caffe2/test:quantization -- 'test_compare_weights_conv_static' buck test mode/dev caffe2/test:quantization -- 'test_compare_weights_linear_static' buck test mode/dev caffe2/test:quantization -- 'test_compare_weights_linear_dynamic' buck test mode/dev caffe2/test:quantization -- 'test_compare_model_stub_conv_static' buck test mode/dev caffe2/test:quantization -- 'test_compare_model_stub_linear_static' buck test mode/dev caffe2/test:quantization -- 'test_compare_model_stub_submodule_static' buck test mode/dev caffe2/test:quantization -- 'test_compare_model_stub_functional_static' buck test mode/dev caffe2/test:quantization -- 'test_compare_model_stub_linear_dynamic' buck test mode/dev caffe2/test:quantization -- 'test_compare_model_outputs_conv_static' buck test mode/dev caffe2/test:quantization -- 'test_compare_model_outputs_linear_static' buck test mode/dev caffe2/test:quantization -- 'test_compare_model_outputs_functional_static' buck test mode/dev caffe2/test:quantization -- 'test_compare_model_outputs_linear_dynamic' Differential Revision: D22394957 fbshipit-source-id: 7d59e0f35e9f4c304b8487460d48236ee6e5a872

Separate scheduling from integration. Uses ExpressionEvaluator to infer run-time launch configuration

…ions.

To make it easier to use `runKernel` in cpp tests.

* Add mechanism to compare two domains to find first axis where they aren't the same. * Minor computeAt refactor. * Add test, even though it's broken, minor refactoring to compute_at, minor fixes to compute_at. * Another computeAt re-write. * Update computeAt test for softmax. * Change computeAtData to a class. * Remove duplicated tests. * Re-add missing test. * Cleanup.

This intends to support cases where a reduced tensor is an input to a parallelized tensor. So, broadcasts after reductions like below should work now within a thread block: t1 = sum(t0, {1}); t2 = broadcast(t1, {false, true}); The major changes include: - Add blockBroadcast device function, which is used for broadcasting to dimensions parallelized with TIDx/y/z. - Update the softmax test. It now matches with the Aten output (within a relaxed threshold). - Add a simplified softmax test, which does not do input normalization with max. - Refactor thread predicate computation. Thread predicate information is necessary for both lowering and printing, so I extracted that from the lowering and make it a more independent class. Limitations and concerns: - Broadcasting to BID-parallelized dimensions are not supported - Thread predicates are computed twice, which might be a performance concern, but still should be trivially small compared to, e.g., the computeAt implementation.

* Checking in ported Scheduler Code * Fixed ReductionScheduler test. Need to set SharedMemory Launch Configuration and I forgot to set a couple params in a new struct. * Fixed Scheduler Test to properly work with runKernel() function. Cleaned the scheduler code to get rid of magic numbers. * Get rid of some tabs. * Remove another tab. * Fix clang formatting. * Silence clang-tidy about magic numbers that are appropriate. * Fix code sytle issues from review. * Fix more clang-format issues. * Added Launch Compatibility flag to reduction schedulelr. * Fix up booleans to be explicitly set on a branch. Added a break to a loop.

facebook-github-bot

@soumith has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot

@soumith has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

jjsjann123 · 2020-08-18T15:49:30Z

Is it still failing the internal build, looks like it's stuck there.

soumith · 2020-08-18T16:16:03Z

no, it was landing, it's done now.

facebook-github-bot · 2020-08-18T16:17:39Z

@soumith merged this pull request in b3bda94.

ezyang · 2020-08-18T18:27:16Z

It seems like this broke https://circleci.com/gh/pytorch/pytorch/6775624?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link/console

jjsjann123 · 2020-08-18T19:04:51Z

Looks like it's the hash on enum class.
I can try to patch it. How can I verify a fix as I don't have a repro on my machine (nor in the CI)?

jjsjann123 · 2020-08-18T20:28:55Z

PR issued. #43222

Most of the fixes is the same old enum-is-not-hasheable error In manager.cpp use std::unordered_map::emplace rather than `insert` to avoid error triggered by missed copy elision This regression was introduced by pytorch#43129

Summary: Most of the fixes is the same old enum-is-not-hasheable error In manager.cpp use std::unordered_map::emplace rather than `insert` to avoid error triggered by missed copy elision This regression was introduced by #43129 Pull Request resolved: #43223 Reviewed By: albanD, seemethere Differential Revision: D23198330 Pulled By: malfet fbshipit-source-id: 576082f7a4454dd29182892c9c4e0b51a967d456

Summary: Had a bunch of merged commits that shouldn't have been there, reverted them to prevent conflicts. Lots of new features, highlights listed below. **Overall:** - Enables pointwise fusion, single (but N-D) broadcast -- pointwise fusion, single (but N-D) broadcast -- pointwise -- single (but N-D) reduction fusion. **Integration:** - Separate "magic scheduler" logic that takes a fusion and generates code generator schedule - Reduction fusion scheduling with heuristics closely matching eagermode (unrolling supported, but no vectorize support) - 2-Stage caching mechanism, one on contiguity, device, type, and operations, the other one is input size->reduction heuristic **Code Generation:** - More generic support in code generation for computeAt - Full rework of loop nest generation and Indexing to more generically handle broadcast operations - Code generator has automatic kernel launch configuration (including automatic allocation of grid reduction buffers) - Symbolic (runtime) tilling on grid/block dimensions is supported - Simplified index generation based on user-defined input contiguity - Automatic broadcast support (similar to numpy/pytorch semantics) - Support for compile time constant shared memory buffers - Parallelized broadcast support (i.e. block reduction -> block broadcast support) Pull Request resolved: pytorch/pytorch#43129 Reviewed By: mrshenli Differential Revision: D23162207 Pulled By: soumith fbshipit-source-id: 16deee4074c64de877eed7c271d6a359927111b2

Summary: Most of the fixes is the same old enum-is-not-hasheable error In manager.cpp use std::unordered_map::emplace rather than `insert` to avoid error triggered by missed copy elision This regression was introduced by pytorch/pytorch#43129 Pull Request resolved: pytorch/pytorch#43223 Reviewed By: albanD, seemethere Differential Revision: D23198330 Pulled By: malfet fbshipit-source-id: 576082f7a4454dd29182892c9c4e0b51a967d456

Summary: Had a bunch of merged commits that shouldn't have been there, reverted them to prevent conflicts. Lots of new features, highlights listed below. **Overall:** - Enables pointwise fusion, single (but N-D) broadcast -- pointwise fusion, single (but N-D) broadcast -- pointwise -- single (but N-D) reduction fusion. **Integration:** - Separate "magic scheduler" logic that takes a fusion and generates code generator schedule - Reduction fusion scheduling with heuristics closely matching eagermode (unrolling supported, but no vectorize support) - 2-Stage caching mechanism, one on contiguity, device, type, and operations, the other one is input size->reduction heuristic **Code Generation:** - More generic support in code generation for computeAt - Full rework of loop nest generation and Indexing to more generically handle broadcast operations - Code generator has automatic kernel launch configuration (including automatic allocation of grid reduction buffers) - Symbolic (runtime) tilling on grid/block dimensions is supported - Simplified index generation based on user-defined input contiguity - Automatic broadcast support (similar to numpy/pytorch semantics) - Support for compile time constant shared memory buffers - Parallelized broadcast support (i.e. block reduction -> block broadcast support) Pull Request resolved: pytorch/pytorch#43129 Reviewed By: mrshenli Differential Revision: D23162207 Pulled By: soumith fbshipit-source-id: 16deee4074c64de877eed7c271d6a359927111b2

Summary: Most of the fixes is the same old enum-is-not-hasheable error In manager.cpp use std::unordered_map::emplace rather than `insert` to avoid error triggered by missed copy elision This regression was introduced by pytorch/pytorch#43129 Pull Request resolved: pytorch/pytorch#43223 Reviewed By: albanD, seemethere Differential Revision: D23198330 Pulled By: malfet fbshipit-source-id: 576082f7a4454dd29182892c9c4e0b51a967d456

peterbell10 and others added 30 commits July 7, 2020 09:07

s/torch::jit::class_/torch::class_/ (pytorch#40795)

21a7178

Summary: Pull Request resolved: pytorch#40795 Test Plan: Imported from OSS Reviewed By: suo Differential Revision: D22314215 Pulled By: jamesr66a fbshipit-source-id: a2fb5c6804d4014f8e437c6858a7be8cd3efb380

[ATen] Add native_cuda_h list to CMakeLists.txt (pytorch#41038)

b4dc764

Summary: Closes pytorch#40784 Pull Request resolved: pytorch#41038 Differential Revision: D22404273 Pulled By: malfet fbshipit-source-id: 8df05f948f069ac95591d523222faa1327429e71

ONNX: support view_as operator (pytorch#40496)

bf9e22f

Summary: This PR adds support for the torch `view_as` operator. Pull Request resolved: pytorch#40496 Reviewed By: hl475 Differential Revision: D22398318 Pulled By: houseroad fbshipit-source-id: f92057f9067a201b707aa9b8fc4ad34643dd5fa3

Fix unordered-map-over-enum for GCC 5.4 (pytorch#41063)

e9cdf46

Summary: Forgot to add this to pytorch#41055 Pull Request resolved: pytorch#41063 Differential Revision: D22407451 Pulled By: malfet fbshipit-source-id: 6f06653b165cc4817d134657f87caf643182832a

CI, to our fork. (#145)

4c2a804

Scheduler separation from integration logic (#134)

37dc0ca

Separate scheduling from integration. Uses ExpressionEvaluator to infer run-time launch configuration

Make sure ops are looking at root domains, not at current transformat…

ce176bd

…ions.

making runKernel broadcast_size argument optional (#150)

35ce1d1

To make it easier to use `runKernel` in cpp tests.

Fix tests. (#153)

0568040

Definitly should be a ref...

33eb027

Do not add trailing spaces (#156)

d821bab

Minor cleanup (#157)

9ece0ca

Windows build.

14613b2

csarofeen changed the title ~~[WIP] [NVFuser] Enable E2E arbitrary BCast-PWise-Reduction fusions~~ [NVFuser] Enable E2E arbitrary BCast-PWise-Reduction fusions Aug 17, 2020

facebook-github-bot reviewed Aug 17, 2020

View reviewed changes

csarofeen changed the title ~~[NVFuser] Enable E2E arbitrary BCast-PWise-Reduction fusions~~ [NVFuser] Enable E2E BCast-PWise-Reduction fusions Aug 17, 2020

attempt to fix compiler complaining on fstream

23f3dd5

facebook-github-bot reviewed Aug 17, 2020

View reviewed changes

soumith approved these changes Aug 18, 2020

View reviewed changes

facebook-github-bot closed this in b3bda94 Aug 18, 2020

facebook-github-bot added the merged label Aug 18, 2020

jjsjann123 mentioned this pull request Aug 18, 2020

patch enum class as hash key for gcc 5.4 #43222

Closed

malfet mentioned this pull request Aug 18, 2020

Fix codegen/cuda gcc-5.4 compilation issues #43223

Closed

mruberry added the Merged label Oct 28, 2020

csarofeen deleted the 20_7_6_devel_rework branch June 9, 2021 13:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[NVFuser] Enable E2E BCast-PWise-Reduction fusions #43129

[NVFuser] Enable E2E BCast-PWise-Reduction fusions #43129

Uh oh!

csarofeen commented Aug 16, 2020 •

edited

Loading

Uh oh!

facebook-github-bot left a comment

Uh oh!

facebook-github-bot left a comment

Uh oh!

jjsjann123 commented Aug 18, 2020

Uh oh!

soumith commented Aug 18, 2020

Uh oh!

facebook-github-bot commented Aug 18, 2020

Uh oh!

ezyang commented Aug 18, 2020

Uh oh!

jjsjann123 commented Aug 18, 2020

Uh oh!

jjsjann123 commented Aug 18, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

[NVFuser] Enable E2E BCast-PWise-Reduction fusions #43129

[NVFuser] Enable E2E BCast-PWise-Reduction fusions #43129

Uh oh!

Conversation

csarofeen commented Aug 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

jjsjann123 commented Aug 18, 2020

Uh oh!

soumith commented Aug 18, 2020

Uh oh!

facebook-github-bot commented Aug 18, 2020

Uh oh!

ezyang commented Aug 18, 2020

Uh oh!

jjsjann123 commented Aug 18, 2020

Uh oh!

jjsjann123 commented Aug 18, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

csarofeen commented Aug 16, 2020 •

edited

Loading