Skip to content

Conversation

@christophe-murphy
Copy link
Contributor

@christophe-murphy christophe-murphy commented Sep 30, 2024

Alternative OpenCL kernel for performing the CSC matrix vector multiply using atomic operations.

Benchmarking so far has shown it to be on par with the CUDA backend around 9x slower than the CUDA back end on my Nvidia RTX 4060 GPU. Note that support has been included for the BLAS style matrix vector multiply with alpha and beta parameters however it appears that this is not supported elsewhere in the code for sparse matrices so it has not been tested. Existing sparse matrix vector multiply tests are all passing for single and double precision as well as complex.

Description

The way that the transpose of the sparse matrix is carried out in the OpenCL backend is to treat the CSR matrix as a CSC matrix. In the CSC matrix, the compressed data is ordered by column rather than by row. The original OpenCL kernel was performing a block matrix vector multiply. This is straightforward to do with the CSR format but is more complicated with CSC format. In order to determine which compressed indices fall in which block for a given column, a binary search was performed. As the matrix dimensions grew larger, the binary search quickly blew up the kernel run time.

As far as I can see the only way to avoid the need for a binary search is to not perform the multiply in blocks. I have created a new OpenCL kernel for the CSC matrix vector multiply which makes use of atomic addition operations instead. While this is not ideal and may run slow on certain devices, I have found the runtime of this kernel to be on par with the CUDA backend in testing. I have found the runtime of this kernel to be about 9x that of the CUDA backend which is not ideal but much faster than the previous implementation.

It is likely difficult to achieve comparable performance to the CUDA back end on an Nvidia GPU since it makes use of the cuSPARSE library.

Checklist

  • Rebased on latest master
  • Code compiles
  • Tests pass

…ly using atomic operations. Benchmarking so far has shown it to be on par with the CUDA backend on my Nvidia RTX 4060 GPU. Note that support has been included for the BLAS style matrix vector multiply with alpha and beta parameters however it appears that this is not supported elsewhere in the code for sparse matrices so it has not been tested. Existing sparse matrix vector multiply tests are all passing for single and double precision as well as complex.
Copy link
Contributor

@edwinsolisf edwinsolisf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested on windows with RTX 3070 Ti Mobile, passes all tests. Improvement is about 43.1x speedup on OpenCL backend; however, CUDA is still 234x faster using the test code provided in issue #2294.

Raw Measurements:
CUDA: 5.735 ms
Old OpenCL: 57,901 ms
New OpenCL: 1,344 ms

@christophe-murphy
Copy link
Contributor Author

Tested on windows with RTX 3070 Ti Mobile, passes all tests. Improvement is about 43.1x speedup on OpenCL backend; however, CUDA is still 234x faster using the test code provided in issue #2294.

Raw Measurements: CUDA: 5.735 ms Old OpenCL: 57,901 ms New OpenCL: 1,344 ms

How are you measuring the time? Could you send me the exact code you used.

Using the af::timer with the example code in #2294 , I am getting < 1ms for both CUDA and OpenCL

@edwinsolisf
Copy link
Contributor

Tested on windows with RTX 3070 Ti Mobile, passes all tests. Improvement is about 43.1x speedup on OpenCL backend; however, CUDA is still 234x faster using the test code provided in issue #2294.
Raw Measurements: CUDA: 5.735 ms Old OpenCL: 57,901 ms New OpenCL: 1,344 ms

How are you measuring the time? Could you send me the exact code you used.

Using the af::timer with the example code in #2294 , I am getting < 1ms for both CUDA and OpenCL

This is the code I used:

#include <iostream>
#include <chrono>
#include <vector>
#include <random>

#include <arrayfire.h>

void ind(std::vector<int>& rows, std::vector<int>& columns, std::vector<float>& values, int maksimi, int Nx, int Ny) {
	rows.emplace_back(0);
	std::default_random_engine generator;
	std::normal_distribution<float> distribution(0.f, 1.0f);

	for (int kk = 0; kk < Nx; kk++) {
		int N_row = std::rand() % maksimi + 1;
		int temp = 0;
		for (int ll = 0; ll < N_row; ll++) {
			int apu = std::rand() % (Ny - N_row + ll - 1);
			if (apu <= temp)
				temp++;
			else
				temp = apu;
			columns.emplace_back(temp);
			values.emplace_back(distribution(generator));
		}
		rows.emplace_back(rows[kk] + N_row);
	}
}

int main() {
    // std::cout << "---------- OPENCL ------------\n";
    // af::setBackend(AF_BACKEND_OPENCL);
    // af::setDevice(1);

    std::cout << "---------- CUDA ------------\n";
    af::setBackend(AF_BACKEND_CUDA);
    af::setDevice(0);

    af::info();

	int Nx = 128 * 128 * 109 * 2;
	int maksimi = 128;
	int Ny = maksimi * maksimi * 109;
    
	std::vector<int> columns;
	std::vector<int> rows;
	std::vector<float> values;
    
	for (int ii = 0; ii < 3; ii++) {
		ind(rows, columns, values, maksimi, Nx, Ny);

		af::array element_ar(values.size(), &values.front());
		af::array indices_ar(columns.size(), &columns.front());
		af::array lor_ar(rows.size(), &rows.front());
		af::array H = af::sparse(lor_ar.dims(0) - 1, Ny, element_ar, lor_ar, indices_ar);

        // Calculate result once to profile runtime instead of JIT compilation
		af::array Summ = af::matmul(H, af::constant(1, H.dims(0), 1), AF_MAT_TRANS);
        Summ.eval();
        Summ = af::array{};

        // Prepare inputs
        H.eval();
        af::sync();

        auto begin = std::chrono::high_resolution_clock::now();

		Summ = af::matmul(H, af::constant(1, H.dims(0), 1), AF_MAT_TRANS);

        // Force evaluation
        Summ.eval();   
        af::sync();

        auto end = std::chrono::high_resolution_clock::now();

        std::cout << "iteration " << ii << ", " << std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count() << " us \n";
		rows.clear();
		columns.clear();
		values.clear();
	}

    std::cout << std::endl;
	return 0;
}

@christophe-murphy
Copy link
Contributor Author

Looks like you are using a different device for the OpenCL back end.

@edwinsolisf
Copy link
Contributor

Looks like you are using a different device for the OpenCL back end.

Yes, the devices are in different order on the OpenCL backend. af::info produces this in my system:

-0- AMD: gfx1035, 12297 MB
[1] NVIDIA: NVIDIA GeForce RTX 3070 Ti Laptop GPU, 8191 MB
-2- OpenCLOn12: AMD Radeon(TM) Graphics, 15970 MB
-3- OpenCLOn12: NVIDIA GeForce RTX 3070 Ti Laptop GPU, 8018 MB
-4- INTEL: AMD Ryzen 7 6800H with Radeon Graphics         , 31940 MB
-5- OpenCLOn12: Microsoft Basic Render Driver, 15970 MB

The OpenCLOn is a DirectX 12 emulation layer.

@christophe-murphy
Copy link
Contributor Author

OK, looks like I needed to add an af::sync(). When I do this I get ~10 ms for CUDA and ~90 ms for OpenCL. Not sure why you are seeing a much bigger difference.

@edwinsolisf edwinsolisf self-requested a review January 10, 2025 00:25
Copy link
Contributor

@edwinsolisf edwinsolisf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Passed all tests. Tested on Ubuntu 22.04 with Nvidia Tesla P4. Measurements are:
CUDA: 38.1 ms
Old OpenCL: 282 s
New OpenCL: 53.7 ms

The improvement in performance from this change is very clear and comparable to the CUDA backend. The performance seen in my previous measurements in Windows seem to suggest an improvement but not of the same magnitude so it might be good to verify in other systems. Regardless, the improvement is significant and needed.

@christophe-murphy
Copy link
Contributor Author

Passed all tests. Tested on Ubuntu 22.04 with Nvidia Tesla P4. Measurements are: CUDA: 38.1 ms Old OpenCL: 282 s New OpenCL: 53.7 ms

The improvement in performance from this change is very clear and comparable to the CUDA backend. The performance seen in my previous measurements in Windows seem to suggest an improvement but not of the same magnitude so it might be good to verify in other systems. Regardless, the improvement is significant and needed.

Thanks for doing these checks. I suspect that some devices/drivers are better at doing atomic operations than others. So far I can't think of a better way of doing this without atomic operations but I'm happy to hear if anyone has ideas.

@christophe-murphy christophe-murphy merged commit e770c88 into master Jan 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Sparse-dense matmul with AF_MAT_TRANS very slow in OpenCL on Nvidia card

3 participants