Skip to content

FEAT: Add support for batched matrix multiply#1898

Merged
umar456 merged 1 commit intoarrayfire:masterfrom
pavanky:batched_gemm
Dec 1, 2017
Merged

FEAT: Add support for batched matrix multiply#1898
umar456 merged 1 commit intoarrayfire:masterfrom
pavanky:batched_gemm

Conversation

@pavanky
Copy link
Copy Markdown
Member

@pavanky pavanky commented Aug 3, 2017

This uses batchedGemm for CUDA backend, but uses a for loop for CPU and OpenCL backend.

These can be improved (in the future) by:

  • Using batched gemm for CPU when using MKL 11.3 or greater
  • Using CLBLAST as the default BLAS implementation for OpenCL (or copying the batched gemm code into arrayfire).

@pavanky pavanky added this to the v3.6.0 milestone Aug 3, 2017
@pavanky pavanky force-pushed the batched_gemm branch 4 times, most recently from 636b265 to 0c6df07 Compare August 3, 2017 07:19
9prady9
9prady9 previously approved these changes Aug 3, 2017
@arrayfire-ci
Copy link
Copy Markdown

Build finished. No test results found.

@arrayfire-ci
Copy link
Copy Markdown

Build finished. 116 tests run, 0 skipped, 3 failed.

@pavanky pavanky changed the title FEAT: Add support for batched matrix multiply [WIP[ FEAT: Add support for batched matrix multiply Aug 4, 2017
@pavanky
Copy link
Copy Markdown
Member Author

pavanky commented Aug 4, 2017

I just realized we can support batch for a single input as well. I'll update this PR.

@pavanky pavanky changed the title [WIP[ FEAT: Add support for batched matrix multiply FEAT: Add support for batched matrix multiply Aug 6, 2017
@pavanky
Copy link
Copy Markdown
Member Author

pavanky commented Aug 11, 2017

build arrayfire ci

@pavanky pavanky changed the base branch from devel to master August 23, 2017 15:21
@9prady9
Copy link
Copy Markdown
Member

9prady9 commented Sep 26, 2017

@pavanky Can you please rebase your changes so that conflicts are resolved.

@pavanky
Copy link
Copy Markdown
Member Author

pavanky commented Oct 15, 2017

@umar456 @9prady9 rebased. Can one of you approve this ?

@pavanky
Copy link
Copy Markdown
Member Author

pavanky commented Oct 17, 2017

build arrayfire ci

@9prady9
Copy link
Copy Markdown
Member

9prady9 commented Oct 17, 2017

OSX batch matrix multiplication failed


[----------] 1 test from MatrixMultiply
[ RUN      ] MatrixMultiply.Batched
unknown file: Failure
C++ exception with description "ArrayFire Exception (Invalid input size:203):
In function dim_t af::calcDim(const af_seq &, const dim_t &)
In file src/backend/common/dim4.cpp:205
Invalid dimension for argument 1
Expected: seq.begin >= -DBL_MIN && seq.begin < parentDim


@WilliamTambellini
Copy link
Copy Markdown
Contributor

Hi @pavanky Any reason why this one is blocked ? Cheers, WT.

@pavanky
Copy link
Copy Markdown
Member Author

pavanky commented Nov 27, 2017

@BookmanHan you can always build this PR yourself.

@pavanky
Copy link
Copy Markdown
Member Author

pavanky commented Nov 27, 2017

@BookmanHan to be clear the feature is fully implemented. The failing tests are not dependent on the changes.

Also when using an open source / free library "demand" may be a bit strong.

I don't work for the company anymore but if you want to contact the developers at arrayfire @umar456 can give you further info.

@BookmanHan
Copy link
Copy Markdown

@pavanky
I am very sorry for using the wrong word.
It should be "need".

Thanks for the response.
I will build ArrayFire from the source.

Thanks, again.

@BookmanHan
Copy link
Copy Markdown

BookmanHan commented Nov 27, 2017

@pavanky
I really don't want to be rude.
This is truly a misunderstanding or mis-phrasing.

It indeed should be 'need', which is replaceable of 'demand' in Chinese.
Actually, I am totally friendly, just noting that.

Thanks again for your response.

@georgh
Copy link
Copy Markdown

georgh commented Nov 27, 2017

The memAlloc change broke this pull, but you can fix it by changing the following lines in cuda/blas.cpp

switch to auto:

        auto d_lptrs = memAlloc<void *>(batchSize);
        auto d_rptrs = memAlloc<void *>(batchSize);
        auto d_optrs = memAlloc<void *>(batchSize);

adding .get():

        CUDA_CHECK(cudaMemcpyAsync(d_lptrs.get(), lptrs.data(), bytes,
                                   cudaMemcpyHostToDevice,
                                   getActiveStream()));
        CUDA_CHECK(cudaMemcpyAsync(d_rptrs.get(), rptrs.data(), bytes,
                                   cudaMemcpyHostToDevice,
                                   getActiveStream()));
        CUDA_CHECK(cudaMemcpyAsync(d_optrs.get(), optrs.data(), bytes,
                                   cudaMemcpyHostToDevice,
                                   getActiveStream()));

switch to get and remove memfree

        CUBLAS_CHECK(gemmBatched_func<T>()(
                          blasHandle(),
                          lOpts,
                          rOpts,
                          M, N, K,
                          &alpha,
                         (const T **)d_lptrs.get(), lStrides[1],
                         (const T **)d_rptrs.get(), rStrides[1],
                          &beta,
                         (T **)d_optrs.get(),
                         oStrides[1],
                         batchSize));

the appended patch can be applied to current master (includes the complete pull and the fix for malloc)

0001-fix-memalloc.patch.txt

@pavanky
Copy link
Copy Markdown
Member Author

pavanky commented Nov 27, 2017

@georgh Ah thanks. I'll fix it soon.

@umar456 umar456 merged commit 9447258 into arrayfire:master Dec 1, 2017
@pavanky pavanky deleted the batched_gemm branch December 1, 2017 19:03
@WilliamTambellini
Copy link
Copy Markdown
Contributor

WilliamTambellini commented Dec 2, 2017

Thank you guys.
Would it make sens to add such codes to examples/benchmark/blas.cpp:

printf("Benchmark N-by-N-by-N matrix multiply\n");
    for (int n = 128; n <= 640; n += 128) {
        printf("%4d x %4d x %4d: ", n, n, n);
        A = constant(1,n,n,n);
        double time = timeit(fn); // time in seconds
        double gflops = 2.0 * powf(n,4) / (time * 1e9);
        if (gflops > peak)
            peak = gflops;
        printf(" %4.0f Gflops\n", gflops);
        fflush(stdout);
    }

@umar456
Copy link
Copy Markdown
Member

umar456 commented Dec 2, 2017

@WilliamTambelliniv Yes that would be a good idea. PRs welcome

@WilliamTambellini
Copy link
Copy Markdown
Contributor

Cool, just tested and speed report makes sens:
`
$ examples/benchmarks/blas_cuda
ArrayFire v3.x.x (CUDA, 64-bit Linux, build b7bd543)
Platform: CUDA Toolkit 8, Driver: 375.74
[0] GeForce GTX 1060, 6073 MB, CUDA Compute 6.1
Benchmark N-by-N matrix multiply
128 x 128: 155 Gflops
256 x 256: 1039 Gflops
384 x 384: 2239 Gflops
512 x 512: 2368 Gflops
640 x 640: 2898 Gflops
768 x 768: 3248 Gflops
896 x 896: 3173 Gflops
1024 x 1024: 3288 Gflops
1152 x 1152: 3312 Gflops
1280 x 1280: 3665 Gflops
1408 x 1408: 3411 Gflops
1536 x 1536: 3528 Gflops
1664 x 1664: 3538 Gflops
1792 x 1792: 3614 Gflops
1920 x 1920: 3552 Gflops
2048 x 2048: 3684 Gflops
Benchmark N-by-N-by-N matrix multiply
128 x 128 x 128: 2168 Gflops
256 x 256 x 256: 2955 Gflops
384 x 384 x 384: 3175 Gflops
512 x 512 x 512: 3268 Gflops
640 x 640 x 640: 3448 Gflops

peak 3683.66 GFLOPS

`

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants