FEAT: Add support for batched matrix multiply by pavanky · Pull Request #1898 · arrayfire/arrayfire

pavanky · 2017-08-03T06:19:28Z

Related to Optimized functions necessary for arrayfire-ml #1868
Related to [WIP] Conv2D Layer arrayfire-ml#39
Partially solves 3D and 4D support for Linear Algebra functions #483

This uses batchedGemm for CUDA backend, but uses a for loop for CPU and OpenCL backend.

These can be improved (in the future) by:

Using batched gemm for CPU when using MKL 11.3 or greater
Using CLBLAST as the default BLAS implementation for OpenCL (or copying the batched gemm code into arrayfire).

arrayfire-ci · 2017-08-03T08:09:35Z

Build finished. No test results found.

arrayfire-ci · 2017-08-04T04:19:40Z

Build finished. 116 tests run, 0 skipped, 3 failed.

pavanky · 2017-08-04T20:19:06Z

I just realized we can support batch for a single input as well. I'll update this PR.

pavanky · 2017-08-11T03:46:32Z

build arrayfire ci

9prady9 · 2017-09-26T09:11:15Z

@pavanky Can you please rebase your changes so that conflicts are resolved.

pavanky · 2017-10-15T22:46:29Z

@umar456 @9prady9 rebased. Can one of you approve this ?

pavanky · 2017-10-17T03:51:15Z

build arrayfire ci

9prady9 · 2017-10-17T08:47:05Z

OSX batch matrix multiplication failed


[----------] 1 test from MatrixMultiply
[ RUN      ] MatrixMultiply.Batched
unknown file: Failure
C++ exception with description "ArrayFire Exception (Invalid input size:203):
In function dim_t af::calcDim(const af_seq &, const dim_t &)
In file src/backend/common/dim4.cpp:205
Invalid dimension for argument 1
Expected: seq.begin >= -DBL_MIN && seq.begin < parentDim

WilliamTambellini · 2017-11-09T20:39:01Z

Hi @pavanky Any reason why this one is blocked ? Cheers, WT.

pavanky · 2017-11-27T05:36:00Z

@BookmanHan you can always build this PR yourself.

pavanky · 2017-11-27T05:39:41Z

@BookmanHan to be clear the feature is fully implemented. The failing tests are not dependent on the changes.

Also when using an open source / free library "demand" may be a bit strong.

I don't work for the company anymore but if you want to contact the developers at arrayfire @umar456 can give you further info.

BookmanHan · 2017-11-27T07:10:39Z

@pavanky
I am very sorry for using the wrong word.
It should be "need".

Thanks for the response.
I will build ArrayFire from the source.

Thanks, again.

BookmanHan · 2017-11-27T07:31:17Z

@pavanky
I really don't want to be rude.
This is truly a misunderstanding or mis-phrasing.

It indeed should be 'need', which is replaceable of 'demand' in Chinese.
Actually, I am totally friendly, just noting that.

Thanks again for your response.

georgh · 2017-11-27T16:08:40Z

The memAlloc change broke this pull, but you can fix it by changing the following lines in cuda/blas.cpp

switch to auto:

        auto d_lptrs = memAlloc<void *>(batchSize);
        auto d_rptrs = memAlloc<void *>(batchSize);
        auto d_optrs = memAlloc<void *>(batchSize);

adding .get():

        CUDA_CHECK(cudaMemcpyAsync(d_lptrs.get(), lptrs.data(), bytes,
                                   cudaMemcpyHostToDevice,
                                   getActiveStream()));
        CUDA_CHECK(cudaMemcpyAsync(d_rptrs.get(), rptrs.data(), bytes,
                                   cudaMemcpyHostToDevice,
                                   getActiveStream()));
        CUDA_CHECK(cudaMemcpyAsync(d_optrs.get(), optrs.data(), bytes,
                                   cudaMemcpyHostToDevice,
                                   getActiveStream()));

switch to get and remove memfree

        CUBLAS_CHECK(gemmBatched_func<T>()(
                          blasHandle(),
                          lOpts,
                          rOpts,
                          M, N, K,
                          &alpha,
                         (const T **)d_lptrs.get(), lStrides[1],
                         (const T **)d_rptrs.get(), rStrides[1],
                          &beta,
                         (T **)d_optrs.get(),
                         oStrides[1],
                         batchSize));

the appended patch can be applied to current master (includes the complete pull and the fix for malloc)

0001-fix-memalloc.patch.txt

pavanky · 2017-11-27T18:02:58Z

@georgh Ah thanks. I'll fix it soon.

WilliamTambellini · 2017-12-02T01:24:10Z

Thank you guys.
Would it make sens to add such codes to examples/benchmark/blas.cpp:

printf("Benchmark N-by-N-by-N matrix multiply\n");
    for (int n = 128; n <= 640; n += 128) {
        printf("%4d x %4d x %4d: ", n, n, n);
        A = constant(1,n,n,n);
        double time = timeit(fn); // time in seconds
        double gflops = 2.0 * powf(n,4) / (time * 1e9);
        if (gflops > peak)
            peak = gflops;
        printf(" %4.0f Gflops\n", gflops);
        fflush(stdout);
    }

umar456 · 2017-12-02T01:36:07Z

@WilliamTambelliniv Yes that would be a good idea. PRs welcome

WilliamTambellini · 2017-12-02T03:00:27Z

Cool, just tested and speed report makes sens:
`
$ examples/benchmarks/blas_cuda
ArrayFire v3.x.x (CUDA, 64-bit Linux, build b7bd543)
Platform: CUDA Toolkit 8, Driver: 375.74
[0] GeForce GTX 1060, 6073 MB, CUDA Compute 6.1
Benchmark N-by-N matrix multiply
128 x 128: 155 Gflops
256 x 256: 1039 Gflops
384 x 384: 2239 Gflops
512 x 512: 2368 Gflops
640 x 640: 2898 Gflops
768 x 768: 3248 Gflops
896 x 896: 3173 Gflops
1024 x 1024: 3288 Gflops
1152 x 1152: 3312 Gflops
1280 x 1280: 3665 Gflops
1408 x 1408: 3411 Gflops
1536 x 1536: 3528 Gflops
1664 x 1664: 3538 Gflops
1792 x 1792: 3614 Gflops
1920 x 1920: 3552 Gflops
2048 x 2048: 3684 Gflops
Benchmark N-by-N-by-N matrix multiply
128 x 128 x 128: 2168 Gflops
256 x 256 x 256: 2955 Gflops
384 x 384 x 384: 3175 Gflops
512 x 512 x 512: 3268 Gflops
640 x 640 x 640: 3448 Gflops

peak 3683.66 GFLOPS

`

pavanky added this to the v3.6.0 milestone Aug 3, 2017

pavanky force-pushed the batched_gemm branch 4 times, most recently from 636b265 to 0c6df07 Compare August 3, 2017 07:19

9prady9 previously approved these changes Aug 3, 2017

View reviewed changes

pavanky force-pushed the batched_gemm branch from 0c6df07 to ac16642 Compare August 3, 2017 15:18

pavanky changed the title ~~FEAT: Add support for batched matrix multiply~~ [WIP[ FEAT: Add support for batched matrix multiply Aug 4, 2017

pavanky force-pushed the batched_gemm branch from ac16642 to 95b12bc Compare August 6, 2017 18:25

pavanky changed the title ~~[WIP[ FEAT: Add support for batched matrix multiply~~ FEAT: Add support for batched matrix multiply Aug 6, 2017

pavanky force-pushed the batched_gemm branch from 95b12bc to af53414 Compare August 6, 2017 18:32

pavanky mentioned this pull request Aug 7, 2017

gfor example matmul failed #1901

Closed

pavanky changed the base branch from devel to master August 23, 2017 15:21

pavanky mentioned this pull request Oct 10, 2017

Make clblast the default OpenCL blas implementation. #1956

Closed

pavanky dismissed 9prady9’s stale review via d5a7a93 October 15, 2017 22:45

pavanky force-pushed the batched_gemm branch from af53414 to d5a7a93 Compare October 15, 2017 22:45

pavanky mentioned this pull request Nov 23, 2017

Could current ArrayFire support Matrix-Multply in batch mode? #1996

Closed

georgh mentioned this pull request Nov 27, 2017

Adding squeeze arrayfire/arrayfire-python#161

Open

pavanky mentioned this pull request Nov 28, 2017

ParallelRange with Sliced Matrix arrayfire/arrayfire-python#162

Open

FEAT: Add support for batched matrix multiply

4fa2e12

umar456 force-pushed the batched_gemm branch from d5a7a93 to 4fa2e12 Compare December 1, 2017 15:04

umar456 approved these changes Dec 1, 2017

View reviewed changes

umar456 merged commit 9447258 into arrayfire:master Dec 1, 2017

pavanky deleted the batched_gemm branch December 1, 2017 19:03

ghost mentioned this pull request Apr 4, 2018

Feature request: slice/slices, mapslices JuliaGPU/ArrayFire.jl#187

Open

Conversation

pavanky commented Aug 3, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

arrayfire-ci commented Aug 3, 2017

Uh oh!

arrayfire-ci commented Aug 4, 2017

Uh oh!

pavanky commented Aug 4, 2017

Uh oh!

pavanky commented Aug 11, 2017

Uh oh!

9prady9 commented Sep 26, 2017

Uh oh!

pavanky commented Oct 15, 2017

Uh oh!

pavanky commented Oct 17, 2017

Uh oh!

9prady9 commented Oct 17, 2017

Uh oh!

WilliamTambellini commented Nov 9, 2017

Uh oh!

pavanky commented Nov 27, 2017

Uh oh!

pavanky commented Nov 27, 2017

Uh oh!

BookmanHan commented Nov 27, 2017

Uh oh!

BookmanHan commented Nov 27, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

georgh commented Nov 27, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pavanky commented Nov 27, 2017

Uh oh!

WilliamTambellini commented Dec 2, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

umar456 commented Dec 2, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WilliamTambellini commented Dec 2, 2017

peak 3683.66 GFLOPS

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

pavanky commented Aug 3, 2017 •

edited

Loading

BookmanHan commented Nov 27, 2017 •

edited

Loading

georgh commented Nov 27, 2017 •

edited

Loading

WilliamTambellini commented Dec 2, 2017 •

edited

Loading

umar456 commented Dec 2, 2017 •

edited

Loading