Backport changes to 3.7 #2949

umar456 · 2020-06-27T03:50:31Z

This PR backports bugfixs and some minor features to the 3.7 branch for the 3.7.2 release.

Description

Improvements

Cache CUDA kernels to disk to improve load times(Thanks to @cschreib-ibex) Automatically cache compiled CUDA kernels on disk to speed up kernel compilation #2848
Staticly link against cuda libraries Use static CUDA upstream libs on Unix #2785
Make cuDNN an optional build dependency Make cudnn dependency optional #2836
Improve support for different compilers and OS Fix error with GCC 6.1. Fixed warnings and Serialize test #2876 Fixed several errors on OSX with Apple Clang #2945 Check and use constexpr for fns and constructors #2925 Remove obsolete OSX specific patch in CLBlast external project #2942 Workaround for bug in Apple's OpenCL, a missing definition #2943 Fixed several errors on OSX with Apple Clang #2945
Improve performance of join and transpose on CPU Improve the performance of CPU join and transpose #2849
Improve documentation Added mouse manipulations #2816 Fix documentation to mem step size and clean up memory manager test #2821 Fix install doc #2846 Update OpenCL interop page so they discuss deleting of memory #2918 Update issue templates #2928 Cautionary notes about default random engine handle management #2947
Reduce binary size using NVRTC and template reducing instantiations Improve the performance of CPU join and transpose #2849 Cleanup fftconvolve related code #2861 Refactor kernel wrappers to use new caching API #2890
Improve reduceByKey performance on OpenCL by using builtin functions Fix work group function name conflict in OpenCL 2.0 #2851
Improve support for Intel OpenCL GPUs Adjust JIT heuristics for the Intel GPU on the OpenCL backend #2855
Allow staticly linking against MKL CMake support to link against static Intel MKL #2877 (Sponsered by SDL)
Better support for older CUDA toolkits Fix errors and warnings with CUDA 9.0 builds #2923
Add support for CUDA 11 Support to build CUDA backend with CUDA toolkit 11.0 #2939
Add support for ccache for faster builds Fix ccache launch scripts to use sh compatible syntax #2931
Add support for the conan package manager on linux Add an ArrayFire conanfile.py that pulls from the linux binary installer #2875

Fixes

Bug crash when allocating large arrays Fix byteToString where the byte alue is greater than 1 petabyte #2827
Fix various compiler warnings Fix byteToString where the byte alue is greater than 1 petabyte #2827 Improve the performance of CPU join and transpose #2849 Fix error in GCC 8.3: cl_float* -> float* using static_cast is invalid #2872 Fix error with GCC 6.1. Fixed warnings and Serialize test #2876
Fix minor leaks in OpenCL functions Fix errors found with Address Sanitizer in the OpenCL backend #2913
Various continuous integration related fixes Win cpu build #2819
Fix zero padding with convolv2NN fix zero padding in convolve2NN #2820
Fix af_get_memory_pressure_threshold return value Fix the af_get_memory_pressure_threshold by assigning value parameter #2831
Increased the max filter length for morph
Handle empty array inputs for LU, QR, and Rank functions Fix empty array handling in lu, rank, and qr. Other minor refactoring #2838
Fix FindMKL.cmake script for sequential threading library Fix pinnedMemManager check. Fix MKL Sequential layer #2840
Various internal refactoring Apply clang-tidy suggestions to all backends #2839 Cleanup fftconvolve related code #2861 Refactor padArray to reshape; Refactor qr,solve to use padArrayBorders #2864 Merge kernel caching logic for CUDA and OpenCL backends #2873 Refactor kernel wrappers to use new caching API #2890 Use same Node class for all backends #2891 Fix errors found with Address Sanitizer in the OpenCL backend #2913
Fix OpenCL 2.0 builtin function name conflict Fix work group function name conflict in OpenCL 2.0 #2851
Fix error caused when releasing memory with multiple devices Fix premature deallocation when deallocating with multiple devices. #2867

Checklist

~~[ ] Rebased on latest master~~
Code compiles
Tests pass
~~[ ] Functions added to unified API~~
~~[ ] Functions documented~~

Instead of creating a static library out of all separate instantiations of thrust_sort_by_key sources, we now directly embed sources generated(using cmake's configure_file command) into afcuda target. This also fixed separable compilation. Prior to this change, separate compilation failed (related to cuda device linking - undefined references). I tried to fix that problem, but couldn't get a break through. However, I realized that just directly using the generated sources with afcuda target will do the job without any additional static library.

thrust::stable_sort_by_key has known issue with device linking. The code crashes with cudaInvalidValueError. It works as expected without any changes with or without separable compilation otherwise. https://github.com/thrust/thrust/wiki/Debugging#known-issues https://github.com/thrust/thrust/blob/master/doc/changelog.md#known-issues-2 The above documents mention a known issue with device linking and thrust. Although the documents say it happens in debug mode(with -G flag), I noticed similar crashes in release configuration too in ArrayFire. Due to the above issue, I have separated out the relevant source files (fft,blas,sparse and solver) which require device linking into separate static library. Once separated into a separate static library, sort_by_key and all the other unit tests that use it are running as expected without any crashes.

Removed a special neighborhood iterator which isn't necessary

Added mouse manipulations

Change ninja to 1.10.0

pinverse_cpu test is excluded as lapacke dependency is not taken care of yet

* Fix constant mem declaration in CUDA morph kernel Global constant value of max filter length was not modified after increasing filter support to 19 from 17 back originally.

* adds fallback for convolveNN functions * adds cudnn option, runtime fallback * Noexcept and const many Dependency module functions * Refactor cuDNN code in CMake * Fix fallback logic. refactor cuDNN util functions. Fix f16 wrap Co-authored-by: Umar Arshad <umar@arrayfire.com>

* Add clang-tidy configuration file * Cleanup some exception code * Add additional upstream directories to .gitignore * Remove unused parameters from wrap and transform implementations * Fix warnings and removed unused calls

* Removed constexpr not supported by VS2015 * Fixed formatting

* enqueueWriteBuffer asynchronously in vision kernels There are few locations where initializing the flags or buffers were earlier using synchronous copy to GPU memory which is not needed since the kernel execution in-order. Hence, changed them to be asynchronous copies. * Fix formatting * Correct the scope of h_desc_lvl on orb

* Improve documentation of the alloc and free function * Add tests for memory operations

* Created snippets for examples in the document

* The rdc and dlink flags are not required because they are added by CMake for separable compilation and static linking respectively * Add guards around libs that are not included in the CUDA 9.0 Toolkit * Only link with OpenMP when linking with cuSOLVER dynamically * Fix error message when CUDNN is not found

* Address casts from double to __half which are missing in 9.0 * Thrust return_temporary_buffer function can accept void* pointers in older versions of Thrust. Use raw_pointer_cast to pass the pointer to memFree * cublasGemmEx doesn't exist in CUDA 9.0. Add ifdefs to guard against older builds * __float2half is not a host function so it needs to be removed from mean * Add template instantiation for memFree to accept void* pointers

CUSOLVER_CHECK error message printed "CUBLAS Error" instead of CUSOLVER Error

Earlier to this change, I added bash based syntax which won't work with /bin/sh or dash shells. /usr/sh is available on most systems that use init.d scripts. So, it is safe to assume it's availability on majority of linux distributions.

* Adds PR template **Short description of change** Adds a github PR template for the ArrayFire project. Developers will now face a short suggested checklist when creating a new PR on github. **Motivation** Adding a PR template will make it easier to reference old issues when generating reports and link future issue in historical context. **Future considerations** Wiki might need to be updated with additional development guidelines. The current guidelines could be more comprehensive. * Updated pull request template * Added additional detail. * Use comments instead of text to communicate with the reader. * Create a simple checklist * Grammer + Future changes in the description section Co-authored-by: Umar Arshad <umar@arrayfire.com>

* AF_CONSTEXPR expands to nothing if constexpr support is not available. * Replace CONSTEXPR_DH with AF_CONSTEXPR and __DH__ in `src/backend/common/half.hpp` * Removed AF_CONSTEXPR where it is invalid in half.hpp

* Adds the Zc:__cplusplus flag to cuda builds for MSVC if the flag is available. the cuda_fp16 header does not define the default constructor for __half as "= default" and that prevents the __half struct to be used in a constexpr expression * For older versions of MSVC we define the __cplusplus macro before and after the inclusion of cuda_fp16.h header. * Define the AF_CONSTEXPR macro for NVRTC compilation

Adds several classes of issues with proposed additional information that would be helpful when debugging. Co-authored-by: pradeep <pradeep@arrayfire.com> Co-authored-by: Umar Arshad <umar@arrayfire.com>

cusparseSpMv/cusparseSpMM functions use sparse and dense matrix/vector descriptor objects as arguments. This API is introduced in CUDA 10.1 and old API has been deprecated. It is also removed in CUDA 11.

Also, updates CUB version from 1.8.0 to 1.9.10

9prady9 and others added 30 commits June 26, 2020 15:15

Remove unsed header from wrap cpu kernel

6e2fa13

Refactor cpu confidence cc to use ParamIterator

a932788

Removed a special neighborhood iterator which isn't necessary

Fix byteToString where the byte value is > a petabyte

d978cec

Fix warning in boost stacktrace on newer gcc compilers

fd4bf32

Update forge_visualization.md

627d16b

Added mouse manipulations

Use boost env var on linux github ci jobs

dd60c3d

Change ninja to 1.10.0

Windows github action ci job for CPU backend

9f5c57a

pinverse_cpu test is excluded as lapacke dependency is not taken care of yet

Avoid print_info as ctest post command for non-ninja win generators

b1fb0ec

Fix the af_get_memory_pressure_threshold by assigning value parameter

6e1ee89

Fix documentation to mem step size and clean up memory manager test

7fa7607

Fix constant mem declaration in CUDA morph kernel (arrayfire#2835)

6da3045

* Fix constant mem declaration in CUDA morph kernel Global constant value of max filter length was not modified after increasing filter support to 19 from 17 back originally.

Fix lu, rank and qr handling of empty arrays and check for nullptr

65d4f53

Renamed a few variables in the default alloc and unlock funcitons

2dfaa37

Minor refactor in median. Add one and two element tests

22b3524

Fixed formatting issue in test/memory.cpp

bbab0ca

Remove MKL_ThreadingLibrary from required var. Sequential doesn have one

549ed65

Fix pinned memory manager check, was testing the function pointer

56a976c

Apply clang-tidy suggestions to all backends (arrayfire#2839)

034cd5d

* Add clang-tidy configuration file * Cleanup some exception code * Add additional upstream directories to .gitignore * Remove unused parameters from wrap and transform implementations * Fix warnings and removed unused calls

Escape % character in windows install instructions

c6c8656

Correct lib path suffix for linux install instructions

285820d

Remove constexpr not supported by VS2015 (arrayfire#2850)

0565724

* Removed constexpr not supported by VS2015 * Fixed formatting

Fix dereference of memory_info iterator before check

e4b1b22

Use double to calculate mean in random engine uniform tests if avialable

d9bf968

Prevent the optimizations in the MeanOp on cpu.

047798b

Fix the MatrixMultiplyBatch test so that we are testing the result

6cacbbb

Remove unnecessary tile from var. Use arith output parameter instead

83695dc

Address all warnings with -Wall flags in GCC 9.3

dcdf795

9prady9 and others added 26 commits June 26, 2020 18:10

Add out of memory test using custom memory manager

42feefb

Return cl_mem instead of cl::Buffer from nativeAlloc

9f142e1

* Improve documentation of the alloc and free function * Add tests for memory operations

Fix leak in OpenCL Indexing

b67b92b

Update OpenCL interop page so they discuss deleting of memory

a69bb96

* Created snippets for examples in the document

Add missing ndims arg check in indexing fns

d493ad1

Fix CUSOLVER_CHECK error message

5c28176

CUSOLVER_CHECK error message printed "CUBLAS Error" instead of CUSOLVER Error

Fix several warnings with older compilers

52aa2ef

adds missing WITH_CUDNN guard for cudnn.hpp

d18a9c1

Split pack expansion to work around a possible bug in VS 2015

2ba510c

Use cxx_relaxed_constexpr check to define AF_CONSTEXPR

d68c7e7

* AF_CONSTEXPR expands to nothing if constexpr support is not available. * Replace CONSTEXPR_DH with AF_CONSTEXPR and __DH__ in `src/backend/common/half.hpp` * Removed AF_CONSTEXPR where it is invalid in half.hpp

Create issue templates (arrayfire#2928)

8fe1ad0

Adds several classes of issues with proposed additional information that would be helpful when debugging. Co-authored-by: pradeep <pradeep@arrayfire.com> Co-authored-by: Umar Arshad <umar@arrayfire.com>

Remove obsolete OSX specific patch in CLBlast external project

298c207

Workaround for bug in Apple's OpenCL, a missing definition

0b97f43

Increase minimum required CUDA toolkit version to build

c886688

Fix several errors when compiling on OSX

a55933d

Add static asserts and move constructors for several classes

4dc7b7c

Use descriptor based cusparse API for sparse blas fns

e5d1542

cusparseSpMv/cusparseSpMM functions use sparse and dense matrix/vector descriptor objects as arguments. This API is introduced in CUDA 10.1 and old API has been deprecated. It is also removed in CUDA 11.

Changes to support build with CUDA 11

1fd0e0b

Also, updates CUB version from 1.8.0 to 1.9.10

Cautionary notes about default random engine handle management

45410db

Update version to 3.7.2 and add release notes

a6c72ff

umar456 added this to the 3.7.2 milestone Jun 27, 2020

umar456 added the backport label Jun 27, 2020

9prady9 approved these changes Jun 27, 2020

View reviewed changes

9prady9 merged commit 2b929a8 into arrayfire:v3.7 Jun 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backport changes to 3.7 #2949

Backport changes to 3.7 #2949

Uh oh!

umar456 commented Jun 27, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Backport changes to 3.7 #2949

Backport changes to 3.7 #2949

Uh oh!

Conversation

umar456 commented Jun 27, 2020

Description

Improvements

Fixes

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants