Skip to content

Conversation

@CaoE
Copy link
Collaborator

@CaoE CaoE commented Aug 2, 2021

Add BFloat16 support for smooth_l1_loss on CPU.

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Aug 2, 2021

🔗 Helpful links

💊 CI failures summary and remediations

As of commit 9f09c88 (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@iramazanli iramazanli added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Aug 2, 2021
@VitalyFedyunin
Copy link
Contributor

Hi! Can you please clarify why we can't use https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/cpu/vec/vec256/vec256_bfloat16.h#L154 here?

@CaoE
Copy link
Collaborator Author

CaoE commented Aug 3, 2021

@VitalyFedyunin Hi! This is because Vectorized<BFloat16> is casted to Vectorized<float> at the beginning, which will reduce the type conversion overhead of intermediate operations.

@VitalyFedyunin
Copy link
Contributor

Overall looks good, I would like to see benchmark numbers for different input sizes.

@CaoE CaoE force-pushed the bf16_loss branch 2 times, most recently from a880982 to 45f51e0 Compare August 11, 2021 07:09
@CaoE CaoE closed this Aug 11, 2021
@CaoE CaoE reopened this Aug 12, 2021
@CaoE
Copy link
Collaborator Author

CaoE commented Aug 12, 2021

Hi~
Single core performance is tested on Xeon(R) Platinum 8180 @2.5 Ghz.

input: 1x3x1x6: 
    Using Vecterized<BFloat16> directly:      forward: 0.185ms,  backward: 0.179ms  
    Casted to Vecterized<float> at beginning: forward: 0.183ms,  backward: 0.175ms 

input: 1x3x1x128: 
    Using Vecterized<BFloat16> directly:       forward: 0.198ms,  backward: 0.194ms   
    Casted to Vecterized<float> at beginning: forward: 0.201ms,  backward: 0.197ms    

input: 1x3x128x128: 
    Using Vecterized<BFloat16> directly:       forward: 2.118ms,  backward: 2.453ms   
    Casted to Vecterized<float> at beginning: forward: 2.004ms,  backward: 2.992ms    

Rounding error (the greatest difference compared with float) is tested on Intel(R) Core(TM) i7-10700K CPU.

input: 1x3x1x6: 
    Using Vecterized<BFloat16> directly:      forward: 0.0029454827308654785,  backward: 0.000244140625  
    Casted to Vecterized<float> at beginning: forward: 0.0009607672691345215,  backward: 0.00010850653052330017 

input: 1x3x1x128: 
    Using Vecterized<BFloat16> directly:      forward: 0.0005249381065368652,  backward: 1.52587890625e-05   
    Casted to Vecterized<float> at beginning: forward: 0.0005249381065368652,  backward: 7.62939453125e-06    

input: 1x3x128x128: 
    Using Vecterized<BFloat16> directly:      forward: 0.002125561237335205,  backward: 1.1920928955078125e-07 
    Casted to Vecterized<float> at beginning: forward: 0.002125561237335205,  backward: 5.960464477539063e-08  

In this case, there is no advantage to cast Vecterized<BFloat16> to Vecterized<float> at beginning, but it will reduce rounding errors.

@codecov
Copy link

codecov bot commented Aug 26, 2021

Codecov Report

Merging #62558 (4a451c6) into master (feefc94) will increase coverage by 0.38%.
The diff coverage is n/a.

❗ Current head 4a451c6 differs from pull request most recent head c280eb9. Consider uploading reports for the commit c280eb9 to get more accurate results

@@            Coverage Diff             @@
##           master   #62558      +/-   ##
==========================================
+ Coverage   66.37%   66.76%   +0.38%     
==========================================
  Files         739      695      -44     
  Lines       94299    90736    -3563     
==========================================
- Hits        62595    60580    -2015     
+ Misses      31704    30156    -1548     

@CaoE
Copy link
Collaborator Author

CaoE commented Sep 2, 2021

When the input becomes larger, the second method get better optimization.
Another single core performance test on Xeon(R) Platinum 8180 @2.5 Ghz (AVX2).

input: 1x3x256x256: 
    Using Vecterized<BFloat16> directly:      forward: 0.404 ms,  backward:  0.387 ms  
    Casted to Vecterized<float> at beginning: forward: 0.134 ms,  backward:  0.196 ms 

input: 4x3x256x1024: 
    Using Vecterized<BFloat16> directly:      forward: 5.886 ms,  backward:   5.913 ms  
    Casted to Vecterized<float> at beginning: forward: 1.746 ms,  backward:   2.800 ms 

input: 10x10x1024x1024: 
    Using Vecterized<BFloat16> directly:      forward: 209.155 ms,  backward:  220.582 ms  
    Casted to Vecterized<float> at beginning: forward: 77.022 ms,  backward:  118.064 ms 

@pytorch-probot
Copy link

pytorch-probot bot commented Oct 8, 2021

CI Flow Status

⚛️ CI Flow

Ruleset - Version: v1
Ruleset - File: https://github.com/CaoE/pytorch/blob/042a8d5cc54662551e1773afc47fda25792f4a7a/.github/generated-ciflow-ruleset.json
PR ciflow labels: ciflow/default

Workflows Labels (bold enabled) Status
Triggered Workflows
linux-bionic-py3.7-clang9 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/noarch, ciflow/trunk ✅ triggered
linux-docs ciflow/all, ciflow/cpu, ciflow/default, ciflow/docs, ciflow/linux, ciflow/trunk ✅ triggered
linux-vulkan-bionic-py3.7-clang9 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk, ciflow/vulkan ✅ triggered
linux-xenial-cuda11.3-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-cuda11.3-py3.7-gcc7-bazel-test ciflow/all, ciflow/bazel, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-py3-clang5-mobile-build ciflow/all, ciflow/default, ciflow/linux, ciflow/mobile, ciflow/trunk ✅ triggered
linux-xenial-py3-clang5-mobile-custom-build-static ciflow/all, ciflow/default, ciflow/linux, ciflow/mobile, ciflow/trunk ✅ triggered
linux-xenial-py3.7-clang7-asan ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/sanitizers, ciflow/trunk ✅ triggered
linux-xenial-py3.7-clang7-onnx ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/onnx, ciflow/trunk ✅ triggered
linux-xenial-py3.7-gcc5.4 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-py3.7-gcc7 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-py3.7-gcc7-no-ops ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single ciflow/all, ciflow/android, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit ciflow/all, ciflow/android, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
win-vs2019-cpu-py3 ciflow/all, ciflow/cpu, ciflow/default, ciflow/trunk, ciflow/win ✅ triggered
win-vs2019-cuda11.3-py3 ciflow/all, ciflow/cuda, ciflow/default, ciflow/trunk, ciflow/win ✅ triggered
Skipped Workflows
caffe2-linux-xenial-py3.7-gcc5.4 ciflow/all, ciflow/cpu, ciflow/linux, ciflow/trunk 🚫 skipped
docker-builds ciflow/all, ciflow/trunk 🚫 skipped
ios-12-5-1-arm64 ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-arm64-coreml ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-arm64-custom-ops ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-arm64-full-jit ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-arm64-metal ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-x86-64 ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-x86-64-coreml ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-x86-64-full-jit ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
libtorch-linux-xenial-cuda10.2-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/trunk 🚫 skipped
libtorch-linux-xenial-cuda11.3-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/trunk 🚫 skipped
linux-binary-conda ciflow/binaries, ciflow/binaries/conda 🚫 skipped
linux-binary-libtorch-cxx11-abi ciflow/binaries, ciflow/binaries/libtorch 🚫 skipped
linux-binary-libtorch-pre-cxx11 ciflow/binaries, ciflow/binaries/libtorch 🚫 skipped
linux-binary-manywheel ciflow/binaries, ciflow/binaries/wheel 🚫 skipped
linux-bionic-cuda10.2-py3.9-gcc7 ciflow/all, ciflow/cuda, ciflow/linux, ciflow/slow, ciflow/trunk 🚫 skipped
linux-bionic-py3.6-clang9 ciflow/xla 🚫 skipped
linux-docs-push ciflow/all, ciflow/cpu, ciflow/linux, ciflow/scheduled 🚫 skipped
linux-xenial-cuda11.3-py3.7-gcc7-no-ops ciflow/all, ciflow/cuda, ciflow/linux, ciflow/trunk 🚫 skipped
macos-10-15-py3-arm64 ciflow/all, ciflow/macos, ciflow/trunk 🚫 skipped
macos-10-15-py3-lite-interpreter-x86-64 ciflow/all, ciflow/macos, ciflow/trunk 🚫 skipped
macos-11-py3-x86-64 ciflow/all, ciflow/macos, ciflow/trunk 🚫 skipped
parallelnative-linux-xenial-py3.7-gcc5.4 ciflow/all, ciflow/cpu, ciflow/linux, ciflow/trunk 🚫 skipped
periodic-libtorch-linux-bionic-cuda11.5-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-libtorch-linux-xenial-cuda11.1-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-linux-bionic-cuda11.5-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled, ciflow/slow, ciflow/slow-gradcheck 🚫 skipped
periodic-linux-xenial-cuda11.1-py3.7-gcc7-debug ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-win-vs2019-cuda11.1-py3 ciflow/all, ciflow/cuda, ciflow/scheduled, ciflow/win 🚫 skipped
periodic-win-vs2019-cuda11.5-py3 ciflow/all, ciflow/cuda, ciflow/scheduled, ciflow/win 🚫 skipped
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-build ciflow/all, ciflow/android, ciflow/cpu, ciflow/linux, ciflow/trunk 🚫 skipped

You can add a comment to the PR and tag @pytorchbot with the following commands:
# ciflow rerun, "ciflow/default" will always be added automatically
@pytorchbot ciflow rerun

# ciflow rerun with additional labels "-l <ciflow/label_name>", which is equivalent to adding these labels manually and trigger the rerun
@pytorchbot ciflow rerun -l ciflow/scheduled -l ciflow/slow

For more information, please take a look at the CI Flow Wiki.

@CaoE
Copy link
Collaborator Author

CaoE commented Oct 8, 2021

Rebased @VitalyFedyunin

@CaoE
Copy link
Collaborator Author

CaoE commented Dec 28, 2021

Hi @VitalyFedyunin, could you please review it ? Thank you.

@CaoE CaoE force-pushed the bf16_loss branch 2 times, most recently from 042a8d5 to c280eb9 Compare January 24, 2022 03:04
@facebook-github-bot
Copy link
Contributor

@frank-wei has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Contributor

@frank-wei frank-wei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bf16 is casted to float to reduce rounding error and perf looks good.

@facebook-github-bot
Copy link
Contributor

@frank-wei has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot pushed a commit that referenced this pull request Apr 4, 2022
Summary:
Add BFloat16 support for smooth_l1_loss on CPU.

Pull Request resolved: #62558

Reviewed By: H-Huang

Differential Revision: D34897859

Pulled By: frank-wei

fbshipit-source-id: a52138c89852642db78f5f3083d05873f3cdec3a
@github-actions
Copy link
Contributor

github-actions bot commented Apr 4, 2022

Hey @CaoE.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed intel This tag is for PR from Intel open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants