Add BFloat16 support for smooth_l1_loss on CPU #62558

CaoE · 2021-08-02T06:30:50Z

Add BFloat16 support for smooth_l1_loss on CPU.

facebook-github-bot · 2021-08-02T06:30:55Z

🔗 Helpful links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/62558
📄 Preview docs built from this PR
📄 Preview C++ docs built from this PR
🔧 Opt-in to CIFlow to control what jobs run on your PRs

💊 CI failures summary and remediations

As of commit 9f09c88 (more details on the Dr. CI page):

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

VitalyFedyunin · 2021-08-02T22:15:52Z

Hi! Can you please clarify why we can't use https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/cpu/vec/vec256/vec256_bfloat16.h#L154 here?

CaoE · 2021-08-03T01:55:06Z

@VitalyFedyunin Hi! This is because Vectorized<BFloat16> is casted to Vectorized<float> at the beginning, which will reduce the type conversion overhead of intermediate operations.

VitalyFedyunin · 2021-08-09T22:29:14Z

Overall looks good, I would like to see benchmark numbers for different input sizes.

CaoE · 2021-08-12T08:52:27Z

Hi~
Single core performance is tested on Xeon(R) Platinum 8180 @2.5 Ghz.

input: 1x3x1x6: 
    Using Vecterized<BFloat16> directly:      forward: 0.185ms,  backward: 0.179ms  
    Casted to Vecterized<float> at beginning: forward: 0.183ms,  backward: 0.175ms 

input: 1x3x1x128: 
    Using Vecterized<BFloat16> directly:       forward: 0.198ms,  backward: 0.194ms   
    Casted to Vecterized<float> at beginning: forward: 0.201ms,  backward: 0.197ms    

input: 1x3x128x128: 
    Using Vecterized<BFloat16> directly:       forward: 2.118ms,  backward: 2.453ms   
    Casted to Vecterized<float> at beginning: forward: 2.004ms,  backward: 2.992ms

Rounding error (the greatest difference compared with float) is tested on Intel(R) Core(TM) i7-10700K CPU.

input: 1x3x1x6: 
    Using Vecterized<BFloat16> directly:      forward: 0.0029454827308654785,  backward: 0.000244140625  
    Casted to Vecterized<float> at beginning: forward: 0.0009607672691345215,  backward: 0.00010850653052330017 

input: 1x3x1x128: 
    Using Vecterized<BFloat16> directly:      forward: 0.0005249381065368652,  backward: 1.52587890625e-05   
    Casted to Vecterized<float> at beginning: forward: 0.0005249381065368652,  backward: 7.62939453125e-06    

input: 1x3x128x128: 
    Using Vecterized<BFloat16> directly:      forward: 0.002125561237335205,  backward: 1.1920928955078125e-07 
    Casted to Vecterized<float> at beginning: forward: 0.002125561237335205,  backward: 5.960464477539063e-08

In this case, there is no advantage to cast Vecterized<BFloat16> to Vecterized<float> at beginning, but it will reduce rounding errors.

codecov · 2021-08-26T12:12:54Z

Codecov Report

Merging #62558 (4a451c6) into master (feefc94) will increase coverage by 0.38%.
The diff coverage is n/a.

❗ Current head 4a451c6 differs from pull request most recent head c280eb9. Consider uploading reports for the commit c280eb9 to get more accurate results

@@            Coverage Diff             @@
##           master   #62558      +/-   ##
==========================================
+ Coverage   66.37%   66.76%   +0.38%     
==========================================
  Files         739      695      -44     
  Lines       94299    90736    -3563     
==========================================
- Hits        62595    60580    -2015     
+ Misses      31704    30156    -1548

CaoE · 2021-09-02T01:30:09Z

When the input becomes larger, the second method get better optimization.
Another single core performance test on Xeon(R) Platinum 8180 @2.5 Ghz (AVX2).

input: 1x3x256x256: 
    Using Vecterized<BFloat16> directly:      forward: 0.404 ms,  backward:  0.387 ms  
    Casted to Vecterized<float> at beginning: forward: 0.134 ms,  backward:  0.196 ms 

input: 4x3x256x1024: 
    Using Vecterized<BFloat16> directly:      forward: 5.886 ms,  backward:   5.913 ms  
    Casted to Vecterized<float> at beginning: forward: 1.746 ms,  backward:   2.800 ms 

input: 10x10x1024x1024: 
    Using Vecterized<BFloat16> directly:      forward: 209.155 ms,  backward:  220.582 ms  
    Casted to Vecterized<float> at beginning: forward: 77.022 ms,  backward:  118.064 ms

pytorch-probot · 2021-10-08T05:06:34Z

CI Flow Status

⚛️ CI Flow

Ruleset - Version: v1
Ruleset - File: https://github.com/CaoE/pytorch/blob/042a8d5cc54662551e1773afc47fda25792f4a7a/.github/generated-ciflow-ruleset.json
PR ciflow labels: ciflow/default

Workflows	Labels (bold enabled)	Status
Triggered Workflows
linux-bionic-py3.7-clang9	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/noarch`, `ciflow/trunk`	✅ triggered
linux-docs	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/docs`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
linux-vulkan-bionic-py3.7-clang9	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`, `ciflow/vulkan`	✅ triggered
linux-xenial-cuda11.3-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
linux-xenial-cuda11.3-py3.7-gcc7-bazel-test	`ciflow/all`, `ciflow/bazel`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
linux-xenial-py3-clang5-mobile-build	`ciflow/all`, `ciflow/default`, `ciflow/linux`, `ciflow/mobile`, `ciflow/trunk`	✅ triggered
linux-xenial-py3-clang5-mobile-custom-build-static	`ciflow/all`, `ciflow/default`, `ciflow/linux`, `ciflow/mobile`, `ciflow/trunk`	✅ triggered
linux-xenial-py3.7-clang7-asan	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/sanitizers`, `ciflow/trunk`	✅ triggered
linux-xenial-py3.7-clang7-onnx	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/onnx`, `ciflow/trunk`	✅ triggered
linux-xenial-py3.7-gcc5.4	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
linux-xenial-py3.7-gcc7	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
linux-xenial-py3.7-gcc7-no-ops	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single	`ciflow/all`, `ciflow/android`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit	`ciflow/all`, `ciflow/android`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
win-vs2019-cpu-py3	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/trunk`, `ciflow/win`	✅ triggered
win-vs2019-cuda11.3-py3	`ciflow/all`, `ciflow/cuda`, `ciflow/default`, `ciflow/trunk`, `ciflow/win`	✅ triggered
Skipped Workflows
caffe2-linux-xenial-py3.7-gcc5.4	`ciflow/all`, `ciflow/cpu`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
docker-builds	`ciflow/all`, `ciflow/trunk`	🚫 skipped
ios-12-5-1-arm64	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
ios-12-5-1-arm64-coreml	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
ios-12-5-1-arm64-custom-ops	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
ios-12-5-1-arm64-full-jit	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
ios-12-5-1-arm64-metal	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
ios-12-5-1-x86-64	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
ios-12-5-1-x86-64-coreml	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
ios-12-5-1-x86-64-full-jit	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
libtorch-linux-xenial-cuda10.2-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
libtorch-linux-xenial-cuda11.3-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
linux-binary-conda	`ciflow/binaries`, `ciflow/binaries/conda`	🚫 skipped
linux-binary-libtorch-cxx11-abi	`ciflow/binaries`, `ciflow/binaries/libtorch`	🚫 skipped
linux-binary-libtorch-pre-cxx11	`ciflow/binaries`, `ciflow/binaries/libtorch`	🚫 skipped
linux-binary-manywheel	`ciflow/binaries`, `ciflow/binaries/wheel`	🚫 skipped
linux-bionic-cuda10.2-py3.9-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/slow`, `ciflow/trunk`	🚫 skipped
linux-bionic-py3.6-clang9	`ciflow/xla`	🚫 skipped
linux-docs-push	`ciflow/all`, `ciflow/cpu`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
linux-xenial-cuda11.3-py3.7-gcc7-no-ops	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
macos-10-15-py3-arm64	`ciflow/all`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
macos-10-15-py3-lite-interpreter-x86-64	`ciflow/all`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
macos-11-py3-x86-64	`ciflow/all`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
parallelnative-linux-xenial-py3.7-gcc5.4	`ciflow/all`, `ciflow/cpu`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
periodic-libtorch-linux-bionic-cuda11.5-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
periodic-libtorch-linux-xenial-cuda11.1-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
periodic-linux-bionic-cuda11.5-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/scheduled`, `ciflow/slow`, `ciflow/slow-gradcheck`	🚫 skipped
periodic-linux-xenial-cuda11.1-py3.7-gcc7-debug	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
periodic-win-vs2019-cuda11.1-py3	`ciflow/all`, `ciflow/cuda`, `ciflow/scheduled`, `ciflow/win`	🚫 skipped
periodic-win-vs2019-cuda11.5-py3	`ciflow/all`, `ciflow/cuda`, `ciflow/scheduled`, `ciflow/win`	🚫 skipped
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-build	`ciflow/all`, `ciflow/android`, `ciflow/cpu`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped

You can add a comment to the PR and tag @pytorchbot with the following commands:

# ciflow rerun, "ciflow/default" will always be added automatically
@pytorchbot ciflow rerun

# ciflow rerun with additional labels "-l <ciflow/label_name>", which is equivalent to adding these labels manually and trigger the rerun
@pytorchbot ciflow rerun -l ciflow/scheduled -l ciflow/slow

For more information, please take a look at the CI Flow Wiki.

CaoE · 2021-10-08T08:29:51Z

Rebased @VitalyFedyunin

CaoE · 2021-12-28T06:05:42Z

Hi @VitalyFedyunin, could you please review it ? Thank you.

facebook-github-bot · 2022-03-15T16:54:35Z

@frank-wei has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

frank-wei

The bf16 is casted to float to reduce rounding error and perf looks good.

facebook-github-bot · 2022-03-31T18:24:43Z

@frank-wei has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Summary: Add BFloat16 support for smooth_l1_loss on CPU. Pull Request resolved: #62558 Reviewed By: H-Huang Differential Revision: D34897859 Pulled By: frank-wei fbshipit-source-id: a52138c89852642db78f5f3083d05873f3cdec3a

github-actions · 2022-04-04T06:06:18Z

Hey @CaoE.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

facebook-github-bot added the cla signed label Aug 2, 2021

pytorchbot added the open source label Aug 2, 2021

iramazanli requested a review from VitalyFedyunin August 2, 2021 19:36

iramazanli added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Aug 2, 2021

CaoE force-pushed the bf16_loss branch 2 times, most recently from a880982 to 45f51e0 Compare August 11, 2021 07:09

CaoE closed this Aug 11, 2021

CaoE reopened this Aug 12, 2021

CaoE force-pushed the bf16_loss branch from 45f51e0 to 3b30e8b Compare August 12, 2021 08:35

CaoE force-pushed the bf16_loss branch from 3b30e8b to ebb69a3 Compare August 16, 2021 07:26

CaoE force-pushed the bf16_loss branch from ebb69a3 to 4a451c6 Compare August 26, 2021 07:52

jgong5 mentioned this pull request Sep 24, 2021

[RFC] Extend Autocast to CPU/CUDA with BF16 data type #55374

Open

CaoE force-pushed the bf16_loss branch 2 times, most recently from 956916f to 7bf43bb Compare September 27, 2021 08:16

CaoE force-pushed the bf16_loss branch from 7bf43bb to 89a29c8 Compare October 8, 2021 05:06

pytorch-probot bot added the ciflow/default label Oct 8, 2021

CaoE force-pushed the bf16_loss branch from 89a29c8 to 06585df Compare October 18, 2021 03:17

CaoE force-pushed the bf16_loss branch from 06585df to d12dbab Compare November 2, 2021 02:54

CaoE force-pushed the bf16_loss branch from d12dbab to c9a91c7 Compare November 15, 2021 01:49

CaoE force-pushed the bf16_loss branch 2 times, most recently from 436f534 to 2dc36e5 Compare December 17, 2021 06:52

CaoE force-pushed the bf16_loss branch from 2dc36e5 to 167553f Compare December 28, 2021 04:55

CaoE force-pushed the bf16_loss branch from 167553f to 12288e2 Compare January 5, 2022 06:06

CaoE force-pushed the bf16_loss branch 2 times, most recently from 042a8d5 to c280eb9 Compare January 24, 2022 03:04

add BFloat16 support for smooth_l1_loss on CPU

9f09c88

CaoE force-pushed the bf16_loss branch from c280eb9 to 9f09c88 Compare March 2, 2022 11:55

frank-wei approved these changes Mar 15, 2022

View reviewed changes

suo removed the ciflow/default label Mar 22, 2022

pytorchmergebot closed this in c5872e6 Apr 4, 2022

pmeier mentioned this pull request Apr 13, 2022

fix torch.nn.functional.l1_loss for complex inputs #65681

Closed

frank-wei added the intel This tag is for PR from Intel label Jun 3, 2022

WBobby mentioned this pull request Aug 17, 2022

Add ROCm5.2.3/AMDGPU support for PyTorch WBobby/pytorch#2

Closed

Add BFloat16 support for smooth_l1_loss on CPU #62558

Add BFloat16 support for smooth_l1_loss on CPU #62558

Uh oh!

Conversation

CaoE commented Aug 2, 2021

Uh oh!

facebook-github-bot commented Aug 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful links

💊 CI failures summary and remediations

Uh oh!

VitalyFedyunin commented Aug 2, 2021

Uh oh!

CaoE commented Aug 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

VitalyFedyunin commented Aug 9, 2021

Uh oh!

CaoE commented Aug 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Aug 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

CaoE commented Sep 2, 2021

Uh oh!

pytorch-probot bot commented Oct 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚛️ CI Flow

Uh oh!

CaoE commented Oct 8, 2021

Uh oh!

CaoE commented Dec 28, 2021

Uh oh!

facebook-github-bot commented Mar 15, 2022

Uh oh!

frank-wei left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Mar 31, 2022

Uh oh!

github-actions bot commented Apr 4, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

facebook-github-bot commented Aug 2, 2021 •

edited

Loading

CaoE commented Aug 3, 2021 •

edited

Loading

CaoE commented Aug 12, 2021 •

edited

Loading

codecov bot commented Aug 26, 2021 •

edited

Loading

pytorch-probot bot commented Oct 8, 2021 •

edited

Loading