Performance improvement for sgemm column major TN (transposeA = T, transposeB = N) case #54
Merged
TimmyLiu merged 1 commit intoclMathLibraries:developfrom Nov 6, 2014
TimmyLiu:develop
Merged
Performance improvement for sgemm column major TN (transposeA = T, transposeB = N) case #54TimmyLiu merged 1 commit intoclMathLibraries:developfrom TimmyLiu:develop
TimmyLiu merged 1 commit intoclMathLibraries:developfrom
TimmyLiu:develop
Conversation
…N kernel by doing transpose separately
TimmyLiu
pushed a commit
that referenced
this pull request
Nov 6, 2014
Performance improvement for sgemm column major TN (transposeA = T, transposeB = N) case
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Even with the use of tuning tool, the current sgemm TN still pose a poor performance comparing to sgemm NN, sgemm NT and sgemm TT. This pull request propose a wrapper from sgemm TN to sgemm NN by doing the transposition of A in a separate kernel, so that the sgemm TN can benefit from the performance of sgemm NN.
Note that since a out-of-place transposition was implemented, an extra opencl buffer was created within this wrapper. This might be a issue for really big matrix sizes.
To enable this wrapper, one would need to set env CLBLAS_FAST_SGEMM_TN=1. The code was only tested on "Spectre", "Tahiti" and "Hawaii" devices. Thus, at the moment, if the environment variable was not set or if the hardware device is anything other than "Spectre", "Tahiti" and "Hawaii", the "old" kernel without transposition will be called.