Skip to content

Conversation

@VitalyFedyunin
Copy link
Contributor

@VitalyFedyunin VitalyFedyunin commented Jul 15, 2019

Move CPU implementation of the addcmul operator to Aten ( #22797 )

before

In [11]: timeit x.addcmul(a, b)                                                                                                                           
1.31 ms ± 18.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

after

In [9]: timeit x.addcmul(a, b)                                                                                                                  
588 µs ± 22.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Adding custom code for the case when value == 1, doesn't provide significant performance gain.

@pytorchbot pytorchbot added module: cpu CPU specific problem (e.g., perf, algorithm) module: internals Related to internal abstractions in c10 and ATen module: operators labels Jul 15, 2019
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@ifedan ifedan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have tests?

@VitalyFedyunin VitalyFedyunin requested review from ifedan and xuhdev July 22, 2019 19:49
@VitalyFedyunin
Copy link
Contributor Author

@pytorchbot retest this please

@VitalyFedyunin
Copy link
Contributor Author

Do you have tests?

Already tested by common_methods_invocations.py

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@gchanan gchanan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious -- wouldn't it only be a few more lines to move the CUDA implementation as well?

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@VitalyFedyunin
Copy link
Contributor Author

I'm curious -- wouldn't it only be a few more lines to move the CUDA implementation as well?

Only to show how to separate migration, and have this PR as example of it.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@VitalyFedyunin merged this pull request in 3b5daef.

zdevito pushed a commit to zdevito/ATen that referenced this pull request Aug 1, 2019
Summary:
Move CPU implementation of the `addcmul` operator to Aten ( pytorch/pytorch#22797 )

### before

```python
In [11]: timeit x.addcmul(a, b)
1.31 ms ± 18.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
```

### after

```python
In [9]: timeit x.addcmul(a, b)
588 µs ± 22.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
```

Adding custom code for the case when `value == 1`, doesn't provide significant performance gain.
Pull Request resolved: pytorch/pytorch#22874

Differential Revision: D16359348

Pulled By: VitalyFedyunin

fbshipit-source-id: 941ead835672fca78a1fcc762da052e64308b111
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged module: cpu CPU specific problem (e.g., perf, algorithm) module: internals Related to internal abstractions in c10 and ATen

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants