FEAT: Add support for batched matrix multiply#1898
Conversation
636b265 to
0c6df07
Compare
|
Build finished. No test results found. |
|
Build finished. 116 tests run, 0 skipped, 3 failed. |
|
I just realized we can support batch for a single input as well. I'll update this PR. |
|
build arrayfire ci |
|
@pavanky Can you please rebase your changes so that conflicts are resolved. |
|
build arrayfire ci |
|
OSX batch matrix multiplication failed |
|
Hi @pavanky Any reason why this one is blocked ? Cheers, WT. |
|
@BookmanHan you can always build this PR yourself. |
|
@BookmanHan to be clear the feature is fully implemented. The failing tests are not dependent on the changes. Also when using an open source / free library "demand" may be a bit strong. I don't work for the company anymore but if you want to contact the developers at arrayfire @umar456 can give you further info. |
|
@pavanky Thanks for the response. Thanks, again. |
|
@pavanky It indeed should be 'need', which is replaceable of 'demand' in Chinese. Thanks again for your response. |
|
The memAlloc change broke this pull, but you can fix it by changing the following lines in cuda/blas.cpp switch to auto: adding .get(): switch to get and remove memfree the appended patch can be applied to current master (includes the complete pull and the fix for malloc) |
|
@georgh Ah thanks. I'll fix it soon. |
|
Thank you guys. |
|
@WilliamTambelliniv Yes that would be a good idea. PRs welcome |
|
Cool, just tested and speed report makes sens: peak 3683.66 GFLOPS` |
This uses batchedGemm for CUDA backend, but uses a for loop for CPU and OpenCL backend.
These can be improved (in the future) by: