Skip to content

Conversation

@csarofeen
Copy link
Contributor

@csarofeen csarofeen commented Aug 16, 2020

Had a bunch of merged commits that shouldn't have been there, reverted them to prevent conflicts. Lots of new features, highlights listed below.

Overall:

  • Enables pointwise fusion, single (but N-D) broadcast -- pointwise fusion, single (but N-D) broadcast -- pointwise -- single (but N-D) reduction fusion.

Integration:

  • Separate "magic scheduler" logic that takes a fusion and generates code generator schedule
  • Reduction fusion scheduling with heuristics closely matching eagermode (unrolling supported, but no vectorize support)
  • 2-Stage caching mechanism, one on contiguity, device, type, and operations, the other one is input size->reduction heuristic

Code Generation:

  • More generic support in code generation for computeAt
  • Full rework of loop nest generation and Indexing to more generically handle broadcast operations
  • Code generator has automatic kernel launch configuration (including automatic allocation of grid reduction buffers)
  • Symbolic (runtime) tilling on grid/block dimensions is supported
  • Simplified index generation based on user-defined input contiguity
  • Automatic broadcast support (similar to numpy/pytorch semantics)
  • Support for compile time constant shared memory buffers
  • Parallelized broadcast support (i.e. block reduction -> block broadcast support)

peterbell10 and others added 30 commits July 7, 2020 09:07
Summary:
Closes pytorch#40560

This adds the equation for the weighted mean to `CrossEntropyLoss`'s docs and the `reduction` argument for `CrossEntropyLoss` and `NLLLoss` no longer describes a non-weighted mean of the outputs.

Pull Request resolved: pytorch#40991

Differential Revision: D22395805

Pulled By: ezyang

fbshipit-source-id: a623b6dd2aab17220fe0bf706bd9b62d6ba531fd
…ction methods. (pytorch#40962)

Summary:
Follow up to pytorch#36447 . Update for pytorch#33389.

Also removes unused `unordered_map` include from the CPP file.

Pull Request resolved: pytorch#40962

Differential Revision: D22376253

Pulled By: ngimel

fbshipit-source-id: 4e7432190e9a847321aec6d6f6634056fa69bdb8
Summary:
This trick should have no effect on performance, but it reduces size of kernels using the template by 10%
For example, sizeof(BinaryMulDivKernel.cu.o) compiled by CUDA-10.1 toolchain for sm_75 before the change was 4.2Mb, after 3.8Mb

Pull Request resolved: pytorch#40992

Differential Revision: D22398733

Pulled By: malfet

fbshipit-source-id: 6576f4da00dc5fc2575b2313577f52c6571d5e6f
Summary:
Pull Request resolved: pytorch#40856

Add a new activation function - Mish: A Self Regularized Non-Monotonic Neural Activation Function https://arxiv.org/abs/1908.08681

Test Plan:
buck test //caffe2/caffe2/python/operator_test:elementwise_ops_test -- 'test_mish'

{F242275183}

Differential Revision: D22158035

fbshipit-source-id: 459c1dd0ac5b515913fc09b5f4cd13dcf095af31
Summary: Pull Request resolved: pytorch#40795

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D22314215

Pulled By: jamesr66a

fbshipit-source-id: a2fb5c6804d4014f8e437c6858a7be8cd3efb380
Summary:
Fixes pytorch#24557

ASV benchmark:

```
import torch

sizes = [
    (10**6,),
    (1000, 1000),
    (10, 10),
    (1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
]

class EqualTrue:
    params = range(len(sizes))

    def setup(self, n):
        dims = sizes[n]
        self.a = torch.rand(dims, device='cuda')
        self.b = self.a.clone()

    def time_equal(self, n):
        torch.equal(self.a, self.b)

class EqualFalse:
    params = range(len(sizes))

    def setup(self, n):
        dims = sizes[n]
        self.a = torch.rand(dims, device='cuda')
        self.b = torch.rand(dims, device='cuda')

    def time_equal(self, n):
        torch.equal(self.a, self.b)
```

Old results:
```
[ 75.00%] ··· equal.EqualFalse.time_equal
[ 75.00%] ··· ======== ============
               param1
              -------- ------------
                 0       67.7±7μs
                 1       74.0±2μs
                 2      24.4±0.1μs
                 3      135±0.2μs
              ======== ============

[100.00%] ··· equal.EqualTrue.time_equal
[100.00%] ··· ======== ============
               param1
              -------- ------------
                 0      59.8±0.2μs
                 1      59.9±0.3μs
                 2      25.0±0.5μs
                 3      136±0.2μs
              ======== ============
```

New results:
```
[ 75.00%] ··· equal.EqualFalse.time_equal
[ 75.00%] ··· ======== ============
               param1
              -------- ------------
                 0      44.4±0.2μs
                 1      44.5±0.4μs
                 2      31.3±0.3μs
                 3      96.6±0.5μs
              ======== ============

[100.00%] ··· equal.EqualTrue.time_equal
[100.00%] ··· ======== ============
               param1
              -------- ------------
                 0      44.2±0.2μs
                 1      44.6±0.2μs
                 2      30.8±0.3μs
                 3      97.3±0.2μs
              ======== ============
```

Pull Request resolved: pytorch#36483

Differential Revision: D21451829

Pulled By: VitalyFedyunin

fbshipit-source-id: 033e8060192c54f139310aeafe8ba784bab94ded
Summary:
Original commit changeset: 46c59d849fa8

The original commit is breaking DPER3 release pipeline with the following failures:
https://www.internalfb.com/intern/chronos/jobinstance?jobinstanceid=9007207344413239&smc=chronos_gp_admin_client&offset=0
```
Child workflow f 202599639  failed with error: c10::Error: [enforce fail at operator.cc:76] blob != nullptr. op Save: Encountered a non-existing input blob: feature_preproc/feature_sparse_to_dense/default_float_value
```
https://www.internalfb.com/intern/chronos/jobinstance?jobinstanceid=9007207344855973&smc=chronos_gp_admin_client&offset=0
```
Child workflow f 202629391  failed with error: c10::Error: [enforce fail at operator.cc:76] blob != nullptr. op Save: Encountered a non-existing input blob: tum_preproc/inductive/feature_sparse_to_dense/default_float_value
```

Related UBN tasks: T69529846, T68986110

Test Plan: Build a DPER3 package on top of this commit, and check that DPER3 release test `model_deliverability_test` is passing.

Differential Revision: D22396317

fbshipit-source-id: 92d5b30cc146c005d6159a8d5bfe8973e2c546dd
Summary:
Pull Request resolved: pytorch#40938

already accepted in pytorch#40645

Test Plan: Imported from OSS

Reviewed By: jamesr66a, Krovatkin

Differential Revision: D22394675

Pulled By: eellison

fbshipit-source-id: 1e9dbb24a4cb564d9a68280d2166329ca9fb0425
Summary:
Pull Request resolved: pytorch#40939

Previously, when we would do shape analysis by running the op with representative inputs, we would always set the grad property to false. This led to a wrong static analysis when we would create differentiable subgraphs, and propagate shapes without also propagating requires_grad, and then uninline them.

Test Plan: Imported from OSS

Differential Revision: D22394676

Pulled By: eellison

fbshipit-source-id: 254e6e9f964b40d160befe0e125abe1b7aa2bd5e
Summary:
Most time-consuming tests in test_nn (taking about half the time) were gradgradchecks on Conv3d. Reduce their sizes, and, most importantly, run gradgradcheck single-threaded, because that cuts the time of conv3d tests by an order of magnitude, and barely affects other tests.
These changes bring test_nn time down from 1200 s to ~550 s on my machine.

Pull Request resolved: pytorch#40999

Differential Revision: D22396896

Pulled By: ngimel

fbshipit-source-id: 3b247caceb65d64be54499de1a55de377fdf9506
Summary:
Pull Request resolved: pytorch#40717

`in_dims` specifies which dimension of the input tensors should be
vmapped over. One can also specify `None` as an `in_dim` for a particular
input to indicate that we do not map over said input.

We implement `in_dims` by creating a BatchedTensor with BatchDim equal
to said `in_dim`. Most of this PR is error checking. `in_dims` must
satisfy the following:
- `in_dim` can be either an int or a Tuple[Optional[int]]. If it is an
int, we use it to mean the `in_dim` for every input.
- If `in_dims` is not-None at some index `idx`, then the input at index
`idx` MUST be a tensor (vmap can only map over tensors).

jax supports something more generalized: their `in_dims` can match the
structure of the `inputs` to the function (i.e., it is a nested python
data structure matching the data structure of `inputs` specifying where
in `inputs` the Tensors to be mapped are and what their map dims should
be). We don't have the infrastruture yet so we only support `int` or a
flat tuple for `in_dims`.

Test Plan: - `pytest test/test_vmap.py -v`

Differential Revision: D22397914

Pulled By: zou3519

fbshipit-source-id: 56d2e14be8b6024e4cde2729eff384da305b4ea3
Summary:
Closes pytorch#40784

Pull Request resolved: pytorch#41038

Differential Revision: D22404273

Pulled By: malfet

fbshipit-source-id: 8df05f948f069ac95591d523222faa1327429e71
Summary:
I ran `make linkcheck` using `sphinx.builders.linkcheck` on the documentation and noticed a few links weren't using HTTPS so I quickly updated them all.

Pull Request resolved: pytorch#40878

Differential Revision: D22404647

Pulled By: ngimel

fbshipit-source-id: 9c9756db59197304023fddc28f252314f6cf4af3
Summary:
In issue pytorch#36997 the user encountered a non-meaningful error message when trying to export the model to ONNX. The Pad operator in opset 9 requires the list of paddings to be constant. This PR tries to improve the error message given to the user when this is not the case.

Pull Request resolved: pytorch#39651

Reviewed By: hl475

Differential Revision: D21992262

Pulled By: houseroad

fbshipit-source-id: b817111c2a40deba85e4c6cdb874c1713312dba1
Summary:
Fix export of full_like when fill_value is of type torch._C.Value.

This PR fixes a bug when exporting GPT2DoubleHeadsModel huggingface/transformers#4950

Pull Request resolved: pytorch#40063

Reviewed By: hl475

Differential Revision: D22398353

Pulled By: houseroad

fbshipit-source-id: 6980a61211fe571c2e4a57716970f474851d811e
Summary:
This PR adds support for the torch `view_as` operator.

Pull Request resolved: pytorch#40496

Reviewed By: hl475

Differential Revision: D22398318

Pulled By: houseroad

fbshipit-source-id: f92057f9067a201b707aa9b8fc4ad34643dd5fa3
Summary:
It's a known gcc-5.4 bug that enum class is not hasheable by default, so `std::unordered_map` needs 3rd explicit parameters to compute hash from the type.

Should fix regression caused by pytorch#40864

Pull Request resolved: pytorch#41055

Differential Revision: D22405478

Pulled By: malfet

fbshipit-source-id: f4bd36bebdc1ad0251ebd1e6cefba866e6605fe6
Summary:
Forgot to add this to pytorch#41055

Pull Request resolved: pytorch#41063

Differential Revision: D22407451

Pulled By: malfet

fbshipit-source-id: 6f06653b165cc4817d134657f87caf643182832a
Summary:
Pull Request resolved: pytorch#41023

Remove Logger in get_matching_activations since it's not used.
ghstack-source-id: 107237046

Test Plan:
buck test mode/dev caffe2/test:quantization -- 'test_compare_weights_lstm_dynamic'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_stub_lstm_dynamic'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_outputs_lstm_dynamic'
buck test mode/dev caffe2/test:quantization -- 'test_compare_weights_conv_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_weights_linear_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_weights_linear_dynamic'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_stub_conv_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_stub_linear_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_stub_submodule_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_stub_functional_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_stub_linear_dynamic'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_outputs_conv_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_outputs_linear_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_outputs_functional_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_outputs_linear_dynamic'

Differential Revision: D22394957

fbshipit-source-id: 7d59e0f35e9f4c304b8487460d48236ee6e5a872
Separate scheduling from integration.
Uses ExpressionEvaluator to infer run-time launch configuration
To make it easier to use `runKernel` in cpp tests.
* Add mechanism to compare two domains to find first axis where they aren't the same.

* Minor computeAt refactor.

* Add test, even though it's broken, minor refactoring to compute_at, minor fixes to compute_at.

* Another computeAt re-write.

* Update computeAt test for softmax.

* Change computeAtData to a class.

* Remove duplicated tests.

* Re-add missing test.

* Cleanup.
This intends to support cases where a reduced tensor is an input to a parallelized tensor. So, broadcasts after reductions like below should work now within a thread block:

t1 = sum(t0, {1});
t2 = broadcast(t1, {false, true});

The major changes include:

- Add blockBroadcast device function, which is used for broadcasting to dimensions parallelized with TIDx/y/z.
- Update the softmax test. It now matches with the Aten output (within a relaxed threshold).
- Add a simplified softmax test, which does not do input normalization with max.
- Refactor thread predicate computation. Thread predicate information is necessary for both lowering and printing, so I extracted that from the lowering and make it a more independent class.

Limitations and concerns:

- Broadcasting to BID-parallelized dimensions are not supported
- Thread predicates are computed twice, which might be a performance concern, but still should be trivially small compared to, e.g., the computeAt implementation.
* Checking in ported Scheduler Code

* Fixed ReductionScheduler test.  Need to set SharedMemory Launch Configuration and I forgot to set a couple params in a new struct.

* Fixed Scheduler Test to properly work with runKernel() function.  Cleaned the scheduler code to get rid of magic numbers.

* Get rid of some tabs.

* Remove another tab.

* Fix clang formatting.

* Silence clang-tidy about magic numbers that are appropriate.

* Fix code sytle issues from review.

* Fix more clang-format issues.

* Added Launch Compatibility flag to reduction schedulelr.

* Fix up booleans to be explicitly set on a branch.  Added a break to a loop.
@csarofeen csarofeen changed the title [WIP] [NVFuser] Enable E2E arbitrary BCast-PWise-Reduction fusions [NVFuser] Enable E2E arbitrary BCast-PWise-Reduction fusions Aug 17, 2020
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@soumith has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@csarofeen csarofeen changed the title [NVFuser] Enable E2E arbitrary BCast-PWise-Reduction fusions [NVFuser] Enable E2E BCast-PWise-Reduction fusions Aug 17, 2020
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@soumith has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@jjsjann123
Copy link
Collaborator

Is it still failing the internal build, looks like it's stuck there.

@soumith
Copy link
Contributor

soumith commented Aug 18, 2020

no, it was landing, it's done now.

@facebook-github-bot
Copy link
Contributor

@soumith merged this pull request in b3bda94.

@ezyang
Copy link
Contributor

ezyang commented Aug 18, 2020

@jjsjann123
Copy link
Collaborator

Looks like it's the hash on enum class.
I can try to patch it. How can I verify a fix as I don't have a repro on my machine (nor in the CI)?

@jjsjann123
Copy link
Collaborator

PR issued. #43222

malfet added a commit to malfet/pytorch that referenced this pull request Aug 18, 2020
Most of the fixes is the same old enum-is-not-hasheable error
In manager.cpp use std::unordered_map::emplace rather than `insert` to avoid error triggered by missed copy elision
This regression was introduced by pytorch#43129
facebook-github-bot pushed a commit that referenced this pull request Aug 19, 2020
Summary:
Most of the fixes is the same old enum-is-not-hasheable error
In manager.cpp use std::unordered_map::emplace rather than `insert` to avoid error triggered by missed copy elision
This regression was introduced by #43129

Pull Request resolved: #43223

Reviewed By: albanD, seemethere

Differential Revision: D23198330

Pulled By: malfet

fbshipit-source-id: 576082f7a4454dd29182892c9c4e0b51a967d456
@csarofeen csarofeen deleted the 20_7_6_devel_rework branch June 9, 2021 13:38
jjsjann123 pushed a commit to jjsjann123/nvfuser that referenced this pull request Oct 29, 2022
Summary:
Had a bunch of merged commits that shouldn't have been there, reverted them to prevent conflicts. Lots of new features, highlights listed below.

**Overall:**

- Enables pointwise fusion, single (but N-D) broadcast -- pointwise fusion, single (but N-D) broadcast -- pointwise -- single (but N-D) reduction fusion.

**Integration:**

- Separate "magic scheduler" logic that takes a fusion and generates code generator schedule
- Reduction fusion scheduling with heuristics closely matching eagermode (unrolling supported, but no vectorize support)
- 2-Stage caching mechanism, one on contiguity, device, type, and operations, the other one is input size->reduction heuristic

**Code Generation:**

- More generic support in code generation for computeAt
- Full rework of loop nest generation and Indexing to more generically handle broadcast operations
- Code generator has automatic kernel launch configuration (including automatic allocation of grid reduction buffers)
- Symbolic (runtime) tilling on grid/block dimensions is supported
- Simplified index generation based on user-defined input contiguity
- Automatic broadcast support (similar to numpy/pytorch semantics)
- Support for compile time constant shared memory buffers
- Parallelized broadcast support (i.e. block reduction -> block broadcast support)

Pull Request resolved: pytorch/pytorch#43129

Reviewed By: mrshenli

Differential Revision: D23162207

Pulled By: soumith

fbshipit-source-id: 16deee4074c64de877eed7c271d6a359927111b2
jjsjann123 pushed a commit to jjsjann123/nvfuser that referenced this pull request Oct 29, 2022
Summary:
Most of the fixes is the same old enum-is-not-hasheable error
In manager.cpp use std::unordered_map::emplace rather than `insert` to avoid error triggered by missed copy elision
This regression was introduced by pytorch/pytorch#43129

Pull Request resolved: pytorch/pytorch#43223

Reviewed By: albanD, seemethere

Differential Revision: D23198330

Pulled By: malfet

fbshipit-source-id: 576082f7a4454dd29182892c9c4e0b51a967d456
jjsjann123 pushed a commit to jjsjann123/nvfuser that referenced this pull request Nov 10, 2022
Summary:
Had a bunch of merged commits that shouldn't have been there, reverted them to prevent conflicts. Lots of new features, highlights listed below.

**Overall:**

- Enables pointwise fusion, single (but N-D) broadcast -- pointwise fusion, single (but N-D) broadcast -- pointwise -- single (but N-D) reduction fusion.

**Integration:**

- Separate "magic scheduler" logic that takes a fusion and generates code generator schedule
- Reduction fusion scheduling with heuristics closely matching eagermode (unrolling supported, but no vectorize support)
- 2-Stage caching mechanism, one on contiguity, device, type, and operations, the other one is input size->reduction heuristic

**Code Generation:**

- More generic support in code generation for computeAt
- Full rework of loop nest generation and Indexing to more generically handle broadcast operations
- Code generator has automatic kernel launch configuration (including automatic allocation of grid reduction buffers)
- Symbolic (runtime) tilling on grid/block dimensions is supported
- Simplified index generation based on user-defined input contiguity
- Automatic broadcast support (similar to numpy/pytorch semantics)
- Support for compile time constant shared memory buffers
- Parallelized broadcast support (i.e. block reduction -> block broadcast support)

Pull Request resolved: pytorch/pytorch#43129

Reviewed By: mrshenli

Differential Revision: D23162207

Pulled By: soumith

fbshipit-source-id: 16deee4074c64de877eed7c271d6a359927111b2
jjsjann123 pushed a commit to jjsjann123/nvfuser that referenced this pull request Nov 10, 2022
Summary:
Most of the fixes is the same old enum-is-not-hasheable error
In manager.cpp use std::unordered_map::emplace rather than `insert` to avoid error triggered by missed copy elision
This regression was introduced by pytorch/pytorch#43129

Pull Request resolved: pytorch/pytorch#43223

Reviewed By: albanD, seemethere

Differential Revision: D23198330

Pulled By: malfet

fbshipit-source-id: 576082f7a4454dd29182892c9c4e0b51a967d456
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged oncall: jit Add this issue/PR to JIT oncall triage queue open source

Projects

None yet

Development

Successfully merging this pull request may close these issues.