Skip to content

Conversation

@CaoE
Copy link
Collaborator

@CaoE CaoE commented Jul 29, 2022

Description

  • add BFloat16 support for mish and hardtanh backward on CPU.
  • optimize the performance for silu

Testing

  • optimize the performance for silu: bfloat16

single socket (28 cores):

before: 1x128x1024  forward 0.090 s  backward  0.218 s
        10x128x1024 forward 0.146 s  backward  0.314 s
            
after:  1x128x1024   forward  0.064 s backward  0.100 s
        10x128x1024  forward  0.085 s backward  0.133 s

single core:

before: 1x128x1024   forward 0.300 s  backward  0.606 s
        10x128x1024  forward 2.825 s  backward  5.834 s

after:  1x128x1024   forward 0.156 s backward   0.239 s
        10x128x1024  forward 1.447 s backward   2.165 s
  • Add BFloat16 support for mish and backward of hardtanh on CPU.

single socket (20 cores):

op shape fp32 / s fp32 / s bf16 / s  bf16 / s
    forward backward forward backward
silu [10, 128, 10, 10] 4.41E-05 7.67E-05 5.32E-05 9.38E-05
  [10, 128, 80, 80] 0.0008 0.001788 0.00067 0.001031
mish [10, 128, 10, 10] 0.000356 0.000427 0.000367 0.000436
  [10, 128, 80, 80] 0.004527 0.005807 0.004757 0.005393
hardtanh [10, 128, 10, 10] / 3.97E-05 / 4.45E-05
  [10, 128, 80, 80] / 0.001748 / 0.000645

single core:

op shape fp32 / s fp32 / s bf16 / s  bf16 / s
    forward backward forward backward
silu [10, 128, 10, 10] 1.17E-04 1.91E-04 1.35E-04 2.23E-04
  [10, 128, 80, 80] 0.007434 0.013141 0.008464 0.013044
mish [10, 128, 10, 10] 0.00103 0.00122 0.00106 0.001227
  [10, 128, 80, 80] 0.065629 0.078418 0.067779 0.077214
hardtanh [10, 128, 10, 10] / 1.18E-04 / 9.30E-05
  [10, 128, 80, 80] / 0.010773 / 0.005834

cc @VitalyFedyunin @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Jul 29, 2022

🔗 Helpful links

✅ No Failures (1 Pending)

As of commit 4bede7f549 (more details on the Dr. CI page):

Expand to see more

💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@CaoE CaoE changed the title Add bf16 support for activation on CPU Add BFloat16 support for activation on CPU Aug 1, 2022
@CaoE CaoE force-pushed the ecao/bf16_ops2 branch 3 times, most recently from 7c70c00 to 0ee7dd6 Compare August 2, 2022 04:49
@CaoE CaoE marked this pull request as ready for review August 3, 2022 00:58
@CaoE CaoE requested review from mruberry and ngimel as code owners August 3, 2022 00:58
@yanbing-j yanbing-j added intel priority matters to intel architecture from performance wise intel This tag is for PR from Intel labels Aug 3, 2022
@albanD albanD added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Aug 3, 2022
@ngimel ngimel requested review from frank-wei and removed request for mruberry and ngimel August 5, 2022 16:52
@pytorch-bot pytorch-bot bot added the release notes: nn release notes category label Sep 26, 2022
@pytorch-bot
Copy link

pytorch-bot bot commented Sep 27, 2022

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/82460

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 534c73b:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@CaoE CaoE force-pushed the ecao/bf16_ops2 branch 4 times, most recently from bca91b0 to 3e2ba41 Compare September 29, 2022 01:23
@CaoE
Copy link
Collaborator Author

CaoE commented Sep 29, 2022

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: PR #82460 has not been reviewed yet (Rule superuser)

Details for Dev Infra team Raised by workflow job

@CaoE
Copy link
Collaborator Author

CaoE commented Sep 29, 2022

Hi @frank-wei , could you please review this PR ? Thank you.

@facebook-github-bot
Copy link
Contributor

/easycla

As part of the transition to the PyTorch Foundation, this project now requires contributions be covered under the new CLA. See #85559 for additional details.

This comment will trigger a new check of this PR. If you are already covered, you will simply see a new "EasyCLA" check that passes. If you are not covered, a bot will leave a new comment with a link to sign.

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Oct 4, 2022

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: CaoE / name: Cao E (74dceca12e3877175c80d9ccaa9c84529d0ee150)

@jgong5
Copy link
Collaborator

jgong5 commented Oct 20, 2022

@CaoE Do you mind make the PR title more specific about what changes you made? The word "activation" sounds too general.

@CaoE CaoE changed the title Add BFloat16 support for activation on CPU Add BFloat16 support and optimization for mish, hardtanh, and silu on CPU Oct 21, 2022
@CaoE CaoE marked this pull request as draft October 21, 2022 01:06
@CaoE
Copy link
Collaborator Author

CaoE commented Oct 21, 2022

Changed the title, and I will provide more performance numbers later.

@github-actions github-actions bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Nov 7, 2022
@CaoE CaoE marked this pull request as ready for review November 8, 2022 00:58
@CaoE CaoE requested a review from kit1980 November 9, 2022 08:06
@CaoE
Copy link
Collaborator Author

CaoE commented Nov 9, 2022

Hi @kit1980, could you please view this PR ? Thank you.

@CaoE CaoE changed the title Add BFloat16 support and optimization for mish, hardtanh, and silu on CPU Add BFloat16 support and optimization for mish, hardtanh backward, and silu on CPU Nov 11, 2022
Copy link
Contributor

@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but see nits

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit

Suggested change
return (float(self_val) <= min_val || float(self_val) >= max_val) ? float(0) : float(grad_val);
return (float(self_val) <= min_val || float(self_val) >= max_val) ? BFloat16(0) :grad_val;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit

Suggested change
const Vectorized<float> kOneVec(float(1));
const Vectorized<float> kOneVec(1.0f);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit

Suggested change
return float(x) / (float(1) + std::exp(-float(x)));
return float(x) / (1.0f + std::exp(-float(x)));

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit

Suggested change
const Vectorized<float> kOneVec(float(1));
const Vectorized<float> kOneVec(1.0f);

Comment on lines 1094 to 1095
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
float(1) / (float(1) + std::exp(-float(x)));
return dy * sigmoid * (float(1) + x * (float(1) - sigmoid));
1.0f / (1.0f + std::exp(-float(x)));
return dy * sigmoid * (1.0f + x * (1.0f - sigmoid));

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const Vec kOneVec(float(1));
const Vec kOneVec(1.0f);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
float(1) / (float(1) + std::exp(-float(x)));
1.0f / (1.0f + std::exp(-float(x)));

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return dy * (tanh_softplus + x * sigmoid * (float(1) - tanh_softplus * tanh_softplus));
return dy * (tanh_softplus + x * sigmoid * (1.0f - tanh_softplus * tanh_softplus));

@CaoE
Copy link
Collaborator Author

CaoE commented Nov 17, 2022

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 17, 2022
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

kulinseth pushed a commit to kulinseth/pytorch that referenced this pull request Dec 10, 2022
…d silu on CPU (pytorch#82460)

### Description
* add BFloat16 support for mish and hardtanh backward on CPU.
* optimize the performance for silu

### Testing

- optimize the performance for silu: bfloat16

single socket (28 cores):
```
before: 1x128x1024  forward 0.090 s  backward  0.218 s
        10x128x1024 forward 0.146 s  backward  0.314 s

after:  1x128x1024   forward  0.064 s backward  0.100 s
        10x128x1024  forward  0.085 s backward  0.133 s
```
single core:
```
before: 1x128x1024   forward 0.300 s  backward  0.606 s
        10x128x1024  forward 2.825 s  backward  5.834 s

after:  1x128x1024   forward 0.156 s backward   0.239 s
        10x128x1024  forward 1.447 s backward   2.165 s
```

- Add BFloat16 support for mish and backward of hardtanh on CPU.

single socket (20 cores):
op | shape | fp32 / s | fp32 / s | bf16 / s |  bf16 / s
-- | -- | -- | -- | -- | --
  |   | forward | backward | forward | backward
silu | [10, 128, 10, 10] | 4.41E-05 | 7.67E-05 | 5.32E-05 | 9.38E-05
  | [10, 128, 80, 80] | 0.0008 | 0.001788 | 0.00067 | 0.001031
mish | [10, 128, 10, 10] | 0.000356 | 0.000427 | 0.000367 | 0.000436
  | [10, 128, 80, 80] | 0.004527 | 0.005807 | 0.004757 | 0.005393
hardtanh | [10, 128, 10, 10] | / | 3.97E-05 | / | 4.45E-05
  | [10, 128, 80, 80] | / | 0.001748 | / | 0.000645

single core:
op | shape | fp32 / s | fp32 / s | bf16 / s |  bf16 / s
-- | -- | -- | -- | -- | --
  |   | forward | backward | forward | backward
silu | [10, 128, 10, 10] | 1.17E-04 | 1.91E-04 | 1.35E-04 | 2.23E-04
  | [10, 128, 80, 80] | 0.007434 | 0.013141 | 0.008464 | 0.013044
mish | [10, 128, 10, 10] | 0.00103 | 0.00122 | 0.00106 | 0.001227
  | [10, 128, 80, 80] | 0.065629 | 0.078418 | 0.067779 | 0.077214
hardtanh | [10, 128, 10, 10] | / | 1.18E-04 | / | 9.30E-05
  | [10, 128, 80, 80] | / | 0.010773 | / | 0.005834

Pull Request resolved: pytorch#82460
Approved by: https://github.com/mingfeima, https://github.com/malfet
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request cla signed intel priority matters to intel architecture from performance wise intel This tag is for PR from Intel Merged module: cpu CPU specific problem (e.g., perf, algorithm) open source release notes: nn release notes category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

9 participants