Improve the performance of CPU join and transpose #2849
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR improves the performance of the transpose and join kernels in the CPU backend. The previous approach was naive and was not optimized for data locality. This new version is slightly improved and uses a tile based approach to speed up the operation.
The join kernel is improved by using a memcpy call instead of a for loop to perform the copy to the output matrix.
Fixed several warnings using the -Wall flag in GCC and enabled it by default in CMake
Fixed a matrix multiplication test where the output matrix was not being tested
Fixed a potential issue with mean where the optimizer could remove some operations that could reduce the accuracy of the result.
Use double to calculate the mean for the random engine uniform tests to avoid overflow issues with larger arrays
Fixed a bug in the DefaultMemoryManager introduced in f211253 where I was dereferencing an iterator before checking if the find function returned an actual value.