Alternative OpenCL kernel for performing the CSC matrix vector multiply #3608

christophe-murphy · 2024-09-30T16:14:47Z

Alternative OpenCL kernel for performing the CSC matrix vector multiply using atomic operations.

Benchmarking so far has shown it to be ~~on par with the CUDA backend~~ around 9x slower than the CUDA back end on my Nvidia RTX 4060 GPU. Note that support has been included for the BLAS style matrix vector multiply with alpha and beta parameters however it appears that this is not supported elsewhere in the code for sparse matrices so it has not been tested. Existing sparse matrix vector multiply tests are all passing for single and double precision as well as complex.

Description

The way that the transpose of the sparse matrix is carried out in the OpenCL backend is to treat the CSR matrix as a CSC matrix. In the CSC matrix, the compressed data is ordered by column rather than by row. The original OpenCL kernel was performing a block matrix vector multiply. This is straightforward to do with the CSR format but is more complicated with CSC format. In order to determine which compressed indices fall in which block for a given column, a binary search was performed. As the matrix dimensions grew larger, the binary search quickly blew up the kernel run time.

As far as I can see the only way to avoid the need for a binary search is to not perform the multiply in blocks. I have created a new OpenCL kernel for the CSC matrix vector multiply which makes use of atomic addition operations instead. While this is not ideal and may run slow on certain devices, ~~I have found the runtime of this kernel to be on par with the CUDA backend in testing.~~ I have found the runtime of this kernel to be about 9x that of the CUDA backend which is not ideal but much faster than the previous implementation.

It is likely difficult to achieve comparable performance to the CUDA back end on an Nvidia GPU since it makes use of the cuSPARSE library.

Checklist

Rebased on latest master
Code compiles
Tests pass

…ly using atomic operations. Benchmarking so far has shown it to be on par with the CUDA backend on my Nvidia RTX 4060 GPU. Note that support has been included for the BLAS style matrix vector multiply with alpha and beta parameters however it appears that this is not supported elsewhere in the code for sparse matrices so it has not been tested. Existing sparse matrix vector multiply tests are all passing for single and double precision as well as complex.

edwinsolisf

Tested on windows with RTX 3070 Ti Mobile, passes all tests. Improvement is about 43.1x speedup on OpenCL backend; however, CUDA is still 234x faster using the test code provided in issue #2294.

Raw Measurements:
CUDA: 5.735 ms
Old OpenCL: 57,901 ms
New OpenCL: 1,344 ms

christophe-murphy · 2025-01-06T22:05:20Z

Tested on windows with RTX 3070 Ti Mobile, passes all tests. Improvement is about 43.1x speedup on OpenCL backend; however, CUDA is still 234x faster using the test code provided in issue #2294.

Raw Measurements: CUDA: 5.735 ms Old OpenCL: 57,901 ms New OpenCL: 1,344 ms

How are you measuring the time? Could you send me the exact code you used.

Using the af::timer with the example code in #2294 , I am getting < 1ms for both CUDA and OpenCL

edwinsolisf · 2025-01-06T22:54:17Z

Tested on windows with RTX 3070 Ti Mobile, passes all tests. Improvement is about 43.1x speedup on OpenCL backend; however, CUDA is still 234x faster using the test code provided in issue #2294.
Raw Measurements: CUDA: 5.735 ms Old OpenCL: 57,901 ms New OpenCL: 1,344 ms

How are you measuring the time? Could you send me the exact code you used.

Using the af::timer with the example code in #2294 , I am getting < 1ms for both CUDA and OpenCL

This is the code I used:

#include <iostream>
#include <chrono>
#include <vector>
#include <random>

#include <arrayfire.h>

void ind(std::vector<int>& rows, std::vector<int>& columns, std::vector<float>& values, int maksimi, int Nx, int Ny) {
	rows.emplace_back(0);
	std::default_random_engine generator;
	std::normal_distribution<float> distribution(0.f, 1.0f);

	for (int kk = 0; kk < Nx; kk++) {
		int N_row = std::rand() % maksimi + 1;
		int temp = 0;
		for (int ll = 0; ll < N_row; ll++) {
			int apu = std::rand() % (Ny - N_row + ll - 1);
			if (apu <= temp)
				temp++;
			else
				temp = apu;
			columns.emplace_back(temp);
			values.emplace_back(distribution(generator));
		}
		rows.emplace_back(rows[kk] + N_row);
	}
}

int main() {
    // std::cout << "---------- OPENCL ------------\n";
    // af::setBackend(AF_BACKEND_OPENCL);
    // af::setDevice(1);

    std::cout << "---------- CUDA ------------\n";
    af::setBackend(AF_BACKEND_CUDA);
    af::setDevice(0);

    af::info();

	int Nx = 128 * 128 * 109 * 2;
	int maksimi = 128;
	int Ny = maksimi * maksimi * 109;
    
	std::vector<int> columns;
	std::vector<int> rows;
	std::vector<float> values;
    
	for (int ii = 0; ii < 3; ii++) {
		ind(rows, columns, values, maksimi, Nx, Ny);

		af::array element_ar(values.size(), &values.front());
		af::array indices_ar(columns.size(), &columns.front());
		af::array lor_ar(rows.size(), &rows.front());
		af::array H = af::sparse(lor_ar.dims(0) - 1, Ny, element_ar, lor_ar, indices_ar);

        // Calculate result once to profile runtime instead of JIT compilation
		af::array Summ = af::matmul(H, af::constant(1, H.dims(0), 1), AF_MAT_TRANS);
        Summ.eval();
        Summ = af::array{};

        // Prepare inputs
        H.eval();
        af::sync();

        auto begin = std::chrono::high_resolution_clock::now();

		Summ = af::matmul(H, af::constant(1, H.dims(0), 1), AF_MAT_TRANS);

        // Force evaluation
        Summ.eval();   
        af::sync();

        auto end = std::chrono::high_resolution_clock::now();

        std::cout << "iteration " << ii << ", " << std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count() << " us \n";
		rows.clear();
		columns.clear();
		values.clear();
	}

    std::cout << std::endl;
	return 0;
}

christophe-murphy · 2025-01-06T23:41:22Z

Looks like you are using a different device for the OpenCL back end.

edwinsolisf · 2025-01-06T23:46:21Z

Looks like you are using a different device for the OpenCL back end.

Yes, the devices are in different order on the OpenCL backend. af::info produces this in my system:

-0- AMD: gfx1035, 12297 MB
[1] NVIDIA: NVIDIA GeForce RTX 3070 Ti Laptop GPU, 8191 MB
-2- OpenCLOn12: AMD Radeon(TM) Graphics, 15970 MB
-3- OpenCLOn12: NVIDIA GeForce RTX 3070 Ti Laptop GPU, 8018 MB
-4- INTEL: AMD Ryzen 7 6800H with Radeon Graphics         , 31940 MB
-5- OpenCLOn12: Microsoft Basic Render Driver, 15970 MB

The OpenCLOn is a DirectX 12 emulation layer.

christophe-murphy · 2025-01-06T23:51:16Z

OK, looks like I needed to add an af::sync(). When I do this I get ~10 ms for CUDA and ~90 ms for OpenCL. Not sure why you are seeing a much bigger difference.

edwinsolisf

Passed all tests. Tested on Ubuntu 22.04 with Nvidia Tesla P4. Measurements are:
CUDA: 38.1 ms
Old OpenCL: 282 s
New OpenCL: 53.7 ms

The improvement in performance from this change is very clear and comparable to the CUDA backend. The performance seen in my previous measurements in Windows seem to suggest an improvement but not of the same magnitude so it might be good to verify in other systems. Regardless, the improvement is significant and needed.

christophe-murphy · 2025-01-10T20:14:53Z

Passed all tests. Tested on Ubuntu 22.04 with Nvidia Tesla P4. Measurements are: CUDA: 38.1 ms Old OpenCL: 282 s New OpenCL: 53.7 ms

The improvement in performance from this change is very clear and comparable to the CUDA backend. The performance seen in my previous measurements in Windows seem to suggest an improvement but not of the same magnitude so it might be good to verify in other systems. Regardless, the improvement is significant and needed.

Thanks for doing these checks. I suspect that some devices/drivers are better at doing atomic operations than others. So far I can't think of a better way of doing this without atomic operations but I'm happy to hear if anyone has ideas.

christophe-murphy linked an issue Sep 30, 2024 that may be closed by this pull request

Sparse-dense matmul with AF_MAT_TRANS very slow in OpenCL on Nvidia card #2294

Closed

edwinsolisf self-requested a review January 1, 2025 00:34

edwinsolisf reviewed Jan 1, 2025

View reviewed changes

christophe-murphy mentioned this pull request Jan 6, 2025

[Perf] OpenCL performance issue with sparse matmul #2937

Closed

edwinsolisf self-requested a review January 10, 2025 00:25

edwinsolisf approved these changes Jan 10, 2025

View reviewed changes

christophe-murphy merged commit e770c88 into master Jan 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alternative OpenCL kernel for performing the CSC matrix vector multiply #3608

Alternative OpenCL kernel for performing the CSC matrix vector multiply #3608

Uh oh!

christophe-murphy commented Sep 30, 2024 •

edited

Loading

Uh oh!

edwinsolisf left a comment

Uh oh!

christophe-murphy commented Jan 6, 2025

Uh oh!

edwinsolisf commented Jan 6, 2025

Uh oh!

christophe-murphy commented Jan 6, 2025

Uh oh!

edwinsolisf commented Jan 6, 2025

Uh oh!

christophe-murphy commented Jan 6, 2025

Uh oh!

edwinsolisf left a comment

Uh oh!

christophe-murphy commented Jan 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Alternative OpenCL kernel for performing the CSC matrix vector multiply #3608

Alternative OpenCL kernel for performing the CSC matrix vector multiply #3608

Uh oh!

Conversation

christophe-murphy commented Sep 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Uh oh!

edwinsolisf left a comment

Choose a reason for hiding this comment

Uh oh!

christophe-murphy commented Jan 6, 2025

Uh oh!

edwinsolisf commented Jan 6, 2025

Uh oh!

christophe-murphy commented Jan 6, 2025

Uh oh!

edwinsolisf commented Jan 6, 2025

Uh oh!

christophe-murphy commented Jan 6, 2025

Uh oh!

edwinsolisf left a comment

Choose a reason for hiding this comment

Uh oh!

christophe-murphy commented Jan 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

christophe-murphy commented Sep 30, 2024 •

edited

Loading