-
Notifications
You must be signed in to change notification settings - Fork 436
add support for hip fft #1864
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add support for hip fft #1864
Conversation
|
Can someone trigger the cuda build. |
|
It's great that you are looking into accelerating the FFTs! In the past we never saw great speedups because the MPI communication went through the host. Is this still the case? Also, it seems you rewrote 95% of the code. So, this might be an opportunity to adopt the BSD license like we did for the grid code. |
|
Unfortunately, I do not think we will gain much with the current state of the code because everything goes through the cpu so we always pay the prize of transferring data back and forth. a proper solution would be to keep the data on GPU but it require more in depth knowledge of where these routines are called before attempting anything. I wanted to simplify the code as well because it was clearly old, mysterious and over complicated. There is technically no new code since it is pretty much the same logic. I just removed unneeded events and used one stream for the calculations. The specialization only boils down to one header file and one source file for each gpu backends. All the rest is generic. I saw that the CI failed due to missing functions. I may have forgotten to push one modification in the toolchain or CI since I had to add new compilation flags. NB : we may need to have some common header files for the functions that are not really gpus vendor dependent. hip api matches cuda api to a very large extend so we could simply have something like gpuMemcpyAsync(...) instead of hipMemcpyAsync (or it cuda version). |
3d83a66 to
949da0c
Compare
To my knowledge this is the earliest CUDA code in CP2K. Having it in good working order and with HIP support is already great progress and should buy us some time.
Yes, eventually we'll have to bite the bullet and rewrite the plane-wave code from scratch, but it's not yet top of the list IMHO.
Agreed, I'll add this to offload_library.c shortly. Which CUDA/HIP functions do you need? |
It is indeed in my plan to rewrite the fft stuff with SpFFT. Simon designed it very cleanly and we have used it in Sirius for quite some time now. I did this PR because it would be helpful for some people.
I can also do it since I have the header files already written. But basically something like this is enough for most cases. The things that hip does this already but I think it is a bad idea to make hip (not rocm) hard dependency for the GPU support. Most of the time we use a limited set of functions and these one are used in grid, dbm, pw for instance. I can prepare a PR just for this. It would not take me long at all. Also we have this variable OFFLOAD_TARGET that we use in the CI (I think, correct me if I am wrong). I was thinking to use it in the code as well so we do not have to deal with things like this __PW_CUDA or __PW_HIP. We could probably just write __PW_GPU (as we want the GPU support) and use the OFFLOAD_TARGET inside the code since the toolchain defines it. |
Looks good, please go ahead. I've just two comments:
Generally, I'd like to stick with Regarding the flags, I was planing to change the entire scheme once our two PRs are merged: As we get more accelerated code parts it becomes cumbersome to enable them each individually. Instead there should be a single |
already working on the offload idea. I am checking that things work before opening the PR. Will replace gpu with offload. It is just a matter of search and replace. I will do it incrementally though so that we can check everything is fine. |
I was thinking along those lines as well. |
|
There is something weird going on in the CI. I'll try to add some safe-guards to the |
949da0c to
8503213
Compare
src/pw/gpu/pw_gpu_internal.cc
Outdated
| blasSetStream(handle_, streams_); | ||
| is_configured = true; | ||
| } | ||
| printf("hip fft : Initialization done!!!"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be that this print statement interferes with the communication between the cp2k shell and the do_regtest.py script.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. I missed to remove this statement. I removed it since it is pointless.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for spotting it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I use the offload interface to the GPUs functions now.
c635710 to
ade7404
Compare
|
@mtaillefumier CUDA PASCAL is not passing. HIP PASCAL has still some problems... (forget HIP MI100, the CI is broken there) |
|
The cuda backend works (fortunately). hip-cuda does not pass compilation because warnings about deprecated functions are treated as errors. I am going to deactivate only this warning for the hip-cuda backend (which is the hip-rocm backend as well). |
accd509 to
e93d5ef
Compare
|
@mtaillefumier Is an updated of the ROCM a better solution? I think we can migrate to 4.5... |
|
I updated the docker file for rocm but the problems are multiple.
|
|
@mtaillefumier I think the HIP pascal is still on rocm 4.2. |
b62359a to
0db40e3
Compare
|
Just wondering, why do you need those casts? cp2k/src/offload/offload_cuda_internal.h Line 78 in 2113ae8
cp2k/src/offload/offload_hip_internal.h Line 89 in 2113ae8
|
|
to avoid a a warning and because cudaMalloc (hipMalloc) require void **. I will change the interface to reflect this better if you think it is any useful. |
Yes, I think you should remove those casts and their inverse - the types My guess is that you forgot to apply the inverse somewhere, which leads to an invalid pointer that then causes the crashes in |
The table is ghatmap is garbage when passed to the function. So the question is why does it is work with nvidia backend ? |
bc9808b to
8184644
Compare
8184644 to
6d5c959
Compare
|
I've just noticed that as result of this PR, now HIP Pascal benchmark runs much slower: https://dashboard.cp2k.org/archive/hip-pascal/index.html It was 36-40 min, now it is >80min. https://dashboard.cp2k.org/archive/cuda-pascal/index.html I cannot see any evident slowdown in the performance tests (https://dashboard.cp2k.org/archive/perf-cuda-volta/index.html). Looking at a single test, for example https://dashboard.cp2k.org/archive/cuda-pascal/commit_2113ae84166846f21ac23000eb6c0415c77e7e05.txt https://dashboard.cp2k.org/archive/cuda-pascal/commit_17efb48911a60e2c4c27ff53d9e1c09127373232.txt PizDaint test is also failing with a compilation error (https://dashboard.cp2k.org/archive/cray-xc50-gnu-psmp/index.html). @mtaillefumier is this expected? |
|
For the record on Daint, there is an error message just at the beginning of the compilation: |
|
Hmm. Not really expected indeed. This can be improved for sure. I am surprised by the performance penalties but I can turn off PW_GPU by default a let people choose to turn it on.
|
|
We should fix this performance regression as it not only eats into our cloud budget but also slows down future development. The culprit is probably the allocation of these device ressources. Maybe this could be done lazily? |
Good catch, @oschuett . Indeed the previous code was using cublas v1... |
Allocating these resources should cost nothing because cublas is initialized by others dependencies such as dbcsr. So allocating an handler does not explain 2x performance difference. It costs some time the first time it is initialized but nothing afterwards. And the handler is allocated only at the beginning so i do not think it is the culprit. I can replace the dcopy by a stride of two by doing by hand (and doing the memset by hand as well) though if you really think it is the problem. I think the fft plane creation is the issue. Let me explore this tomorrow. |
|
Unfortunately, I have to add another problem here. I'm trying to build the current master with HIP support, but no PW stuff. I get errors like: and tons of others from the same file |
It might be that DBCSR uses either lazy initialization or keeps the handle even after the library has been finalized. The GPU tests are run in keep-alive mode (previously we used farming), which means that the same CP2K process is re-used to execute multiple input files.
You are right. Looking at the timings it's a 2x slowdown across the board. Maybe |
|
About performance, with HIP I see "literally" tons of in my output (this is H2O-128-RI-dRPA-TZ.inp benchmark). |
|
A specific slowdown of SCCS test cases indicates a problem with the FFT performance, since SCCS makes extraordinarily use of FFTs compared to other tests. |
|
it seems that the slowdown is very pronounced for fft specific tests. blasCreate is called one time only and I do not think creating an handler even with so many processes in parallel should be an issue. @alazzaro it is rocmfft specific issue. See here ROCm/rocFFT#343 one way to go around it it do out of plane transforms. I can modify the code in that direction. |
It is an attempt to make fft calls on GPU more generic than we had before. It is now possible to run ffts on both NVIDIA and AMD gpus more transparently, the specialization occurring at the lowest level.