Skip to content

Conversation

@mtaillefumier
Copy link
Contributor

It is an attempt to make fft calls on GPU more generic than we had before. It is now possible to run ffts on both NVIDIA and AMD gpus more transparently, the specialization occurring at the lowest level.

  • Adding for instance intel support for their next gen HPC gpus reduces to one header file and writing the gather and scatter kernels.
  • it would alsso be possible to use the same header file for the Nvidia and AMD kernels since they are actually the same. For now they are separated.
  • NB : Hip fft supports both rocm_fft and cufft libraries.

@mtaillefumier
Copy link
Contributor Author

Can someone trigger the cuda build.

@oschuett
Copy link
Member

It's great that you are looking into accelerating the FFTs!

In the past we never saw great speedups because the MPI communication went through the host. Is this still the case?

Also, it seems you rewrote 95% of the code. So, this might be an opportunity to adopt the BSD license like we did for the grid code.

@mtaillefumier
Copy link
Contributor Author

Unfortunately, I do not think we will gain much with the current state of the code because everything goes through the cpu so we always pay the prize of transferring data back and forth. a proper solution would be to keep the data on GPU but it require more in depth knowledge of where these routines are called before attempting anything.

I wanted to simplify the code as well because it was clearly old, mysterious and over complicated. There is technically no new code since it is pretty much the same logic. I just removed unneeded events and used one stream for the calculations. The specialization only boils down to one header file and one source file for each gpu backends. All the rest is generic.

I saw that the CI failed due to missing functions. I may have forgotten to push one modification in the toolchain or CI since I had to add new compilation flags.

NB : we may need to have some common header files for the functions that are not really gpus vendor dependent. hip api matches cuda api to a very large extend so we could simply have something like gpuMemcpyAsync(...) instead of hipMemcpyAsync (or it cuda version).

@mtaillefumier mtaillefumier force-pushed the hip_fft branch 2 times, most recently from 3d83a66 to 949da0c Compare January 11, 2022 10:03
@oschuett
Copy link
Member

I wanted to simplify the code as well because it was clearly old, mysterious and over complicated.

To my knowledge this is the earliest CUDA code in CP2K. Having it in good working order and with HIP support is already great progress and should buy us some time.

a proper solution would be to keep the data on GPU but it require more in depth knowledge of where these routines are called before attempting anything.

Yes, eventually we'll have to bite the bullet and rewrite the plane-wave code from scratch, but it's not yet top of the list IMHO.
Also, I'm hoping that SpFFT will do most of the heavy lifting ;-)

NB : we may need to have some common header files for the functions that are not really gpus vendor dependent.

Agreed, I'll add this to offload_library.c shortly. Which CUDA/HIP functions do you need?

@mtaillefumier
Copy link
Contributor Author

mtaillefumier commented Jan 11, 2022

I wanted to simplify the code as well because it was clearly old, mysterious and over complicated.

To my knowledge this is the earliest CUDA code in CP2K. Having it in good working order and with HIP support is already great progress and should buy us some time.

a proper solution would be to keep the data on GPU but it require more in depth knowledge of where these routines are called before attempting anything.

Yes, eventually we'll have to bite the bullet and rewrite the plane-wave code from scratch, but it's not yet top of the list IMHO. Also, I'm hoping that SpFFT will do most of the heavy lifting ;-)

It is indeed in my plan to rewrite the fft stuff with SpFFT. Simon designed it very cleanly and we have used it in Sirius for quite some time now. I did this PR because it would be helpful for some people.

NB : we may need to have some common header files for the functions that are not really gpus vendor dependent.

Agreed, I'll add this to offload_library.c shortly. Which CUDA/HIP functions do you need?

I can also do it since I have the header files already written. But basically

#define gpuStream_t hipStream_t
#define gpuEvent_t hipEvent_t

  /*******************************************************************************
   * \brief Check given Hip status and upon failure abort with a nice message.
   * \author Ole Schuett
   ******************************************************************************/
#define CHECK(status)                                                   \
  if (status != hipSuccess) {                                           \
    fprintf(stderr, "ERROR: %s %s %d\n", hipGetErrorString(status), __FILE__, \
            __LINE__);                                                  \
    abort();                                                            \
  }

  static inline void gpuMemsetAsync(void *ptr__, int val__, size_t size__, gpuStream_t stream__) {
    CHECK(hipMemsetAsync(ptr__, val__, size__, stream__));
  }

  static inline void gpuMemcpyAsyncHtoD(void *ptr1__, void *ptr2__, size_t size__, gpuStream_t stream__) {
    CHECK(hipMemcpyAsync(ptr1__, ptr2__, size__, hipMemcpyHostToDevice, stream__));
  }

  static inline void gpuMemcpyAsyncDtoH(void *ptr1__, void *ptr2__, size_t size__, gpuStream_t stream__) {
    CHECK(hipMemcpyAsync(ptr1__, ptr2__, size__, hipMemcpyDeviceToHost, stream__));
  }

  static inline void gpuEventCreate(gpuEvent_t *event__) {
    CHECK(hipEventCreate(event__));
  }

  static inline void gpuEventDestroy(gpuEvent_t event__) {
    CHECK(hipEventDestroy(event__));
  }

  static inline void gpuStreamCreate(gpuStream_t *stream__) {
    CHECK(hipStreamCreate(stream__));
  }

  static inline void gpuStreamDestroy(gpuStream_t stream__) {
    CHECK(hipStreamDestroy(Stream__));
  }

  static inline void gpuEventSynchronize(gpuEvent_t event__) {
    CHECK(hipEventSynchronize(event__));
  }

  static inline void gpuStreamSynchronize(gpuStream_t stream__) {
    CHECK(hipStreamSynchronize(event__));
  }

  static inline void gpuEventRecord(gnuEvent_t event__, gpuStream_t stream__) {
    CHECK(hipEventRecord(event__, stream__));
  }

  static inline void gpuMalloc(void *ptr__, size_t size__) {
    check(hipMalloc(ptr__, size__));
  }

  static inline void gpuFree(void *ptr__) {
    check(hipFree(ptr__));
  }

static inline void gpuStreamWaitEvent(gpuStream_t stream__, gpuEvent_t event__, const int val__) {
  cudaStreamWaitEvent(stream__, event__, val__);
}


something like this is enough for most cases. The things that hip does this already but I think it is a bad idea to make hip (not rocm) hard dependency for the GPU support. Most of the time we use a limited set of functions and these one are used in grid, dbm, pw for instance.

I can prepare a PR just for this. It would not take me long at all. Also we have this variable OFFLOAD_TARGET that we use in the CI (I think, correct me if I am wrong). I was thinking to use it in the code as well so we do not have to deal with things like this __PW_CUDA or __PW_HIP. We could probably just write __PW_GPU (as we want the GPU support) and use the OFFLOAD_TARGET inside the code since the toolchain defines it.

@oschuett
Copy link
Member

oschuett commented Jan 11, 2022

I can also do it since I have the header files already written. But basically....

Looks good, please go ahead. I've just two comments:

  • I'd prefer offload instead of gpu as namespace prefix. The header file could e.g. be called offload_operations.h.
  • I think typedefs produce nicer error messages, i.e. typedef offloadStream_t hipStream_t;

We could probably just write __PW_GPU (as we want the GPU support) and use the OFFLOAD_TARGET inside the code since the toolchain defines it.

Generally, I'd like to stick with OFFLOAD instead of GPU as it's more distinct.

Regarding the flags, I was planing to change the entire scheme once our two PRs are merged: As we get more accelerated code parts it becomes cumbersome to enable them each individually. Instead there should be a single __OFFLOAD flag to enable them all and then individual __NO_OFFLOAD_FOO flags to disable some parts for debugging.

@mtaillefumier
Copy link
Contributor Author

I can also do it since I have the header files already written. But basically....

Looks good, please go ahead. I've just two comments:

* I'd prefer `offload` instead of `gpu` as namespace prefix. The header file could e.g. be called `offload_operations.h`.

* I think typedefs produce nicer error messages, i.e. `typedef offloadStream_t hipStream_t;`

We could probably just write __PW_GPU (as we want the GPU support) and use the OFFLOAD_TARGET inside the code since the toolchain defines it.

Generally, I'd like to stick with OFFLOAD instead of GPU as it's more distinct.

Regarding the flags, I was planing to change the entire scheme once our two PRs are merged: As we get more accelerated code parts it becomes cumbersome to enable them each individually. Instead there should be a single __OFFLOAD flag to enable them all and then individual __NO_OFFLOAD_FOO flags to disable some parts for debugging.

already working on the offload idea. I am checking that things work before opening the PR. Will replace gpu with offload. It is just a matter of search and replace.

I will do it incrementally though so that we can check everything is fine.

@mtaillefumier
Copy link
Contributor Author

Generally, I'd like to stick with OFFLOAD instead of GPU as it's more distinct.

Regarding the flags, I was planing to change the entire scheme once our two PRs are merged: As we get more accelerated code parts it becomes cumbersome to enable them each individually. Instead there should be a single __OFFLOAD flag to enable them all and then individual __NO_OFFLOAD_FOO flags to disable some parts for debugging.

I was thinking along those lines as well.

@oschuett
Copy link
Member

There is something weird going on in the CI. I'll try to add some safe-guards to the do_regtest.py script to debug this.

blasSetStream(handle_, streams_);
is_configured = true;
}
printf("hip fft : Initialization done!!!");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be that this print statement interferes with the communication between the cp2k shell and the do_regtest.py script.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. I missed to remove this statement. I removed it since it is pointless.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for spotting it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I use the offload interface to the GPUs functions now.

@mtaillefumier mtaillefumier force-pushed the hip_fft branch 9 times, most recently from c635710 to ade7404 Compare January 14, 2022 16:21
@alazzaro
Copy link
Member

@mtaillefumier CUDA PASCAL is not passing. HIP PASCAL has still some problems... (forget HIP MI100, the CI is broken there)

@mtaillefumier
Copy link
Contributor Author

The cuda backend works (fortunately). hip-cuda does not pass compilation because warnings about deprecated functions are treated as errors. I am going to deactivate only this warning for the hip-cuda backend (which is the hip-rocm backend as well).

@mtaillefumier mtaillefumier force-pushed the hip_fft branch 2 times, most recently from accd509 to e93d5ef Compare January 17, 2022 10:09
@alazzaro
Copy link
Member

@mtaillefumier Is an updated of the ROCM a better solution? I think we can migrate to 4.5...

@oschuett
Copy link
Member

@mtaillefumier
Copy link
Contributor Author

I updated the docker file for rocm but the problems are multiple.

  • compilation issues are one thing. I mitigated some of them (because I did things wrong) but others are really rocm/hip related.
  • The second issue I have is actually more annoying. hipFFT and hipBLAS packages provided by the rocm distros do not support the cuda backend by default. I have to compile hipfft and hibBLAS by hand in order for this to test the code with the CI. The resources I have access to for testing with AMD hardware are fluctuating as well but I can run the regtests on mi100 today at least.

@alazzaro
Copy link
Member

@mtaillefumier I think the HIP pascal is still on rocm 4.2.

@mtaillefumier mtaillefumier force-pushed the hip_fft branch 8 times, most recently from b62359a to 0db40e3 Compare January 20, 2022 16:18
@oschuett
Copy link
Member

Just wondering, why do you need those casts?

OFFLOAD_CHECK(cudaMalloc((void **)ptr__, size__));

OFFLOAD_CHECK(hipMalloc((void **)ptr__, size__));

@mtaillefumier
Copy link
Contributor Author

mtaillefumier commented Jan 21, 2022

to avoid a a warning and because cudaMalloc (hipMalloc) require void **. I will change the interface to reflect this better if you think it is any useful.
I can not figure out why the exact same code works on cuda but fails on both rocm and hip-cuda. I track down all API errors but there is none. So far the same kernel code compiled directly with nvcc works while if it fails when compiled through hip -> nvcc

@oschuett
Copy link
Member

oschuett commented Jan 21, 2022

to avoid a a warning and because cudaMalloc (hipMalloc) require void **. I will change the interface to reflect this better if you think it is any useful.

Yes, I think you should remove those casts and their inverse - the types void* and void** are simply not the same.

My guess is that you forgot to apply the inverse somewhere, which leads to an invalid pointer that then causes the crashes in offloadMemcpyAsyncDtoH.

@mtaillefumier
Copy link
Contributor Author

My guess is that you forgot to apply the inverse somewhere, which leads to an invalid pointer that then causes the crashes in offloadMemcpyAsyncDtoH.

The table is ghatmap is garbage when passed to the function. So the question is why does it is work with nvidia backend ?

@mtaillefumier mtaillefumier force-pushed the hip_fft branch 2 times, most recently from bc9808b to 8184644 Compare January 21, 2022 14:55
@oschuett oschuett merged commit 7c9ba87 into cp2k:master Jan 21, 2022
@alazzaro
Copy link
Member

I've just noticed that as result of this PR, now HIP Pascal benchmark runs much slower:

https://dashboard.cp2k.org/archive/hip-pascal/index.html

It was 36-40 min, now it is >80min.
Somehow a similar slowdown is on the Pascal tests too:

https://dashboard.cp2k.org/archive/cuda-pascal/index.html

I cannot see any evident slowdown in the performance tests (https://dashboard.cp2k.org/archive/perf-cuda-volta/index.html).

Looking at a single test, for example

https://dashboard.cp2k.org/archive/cuda-pascal/commit_2113ae84166846f21ac23000eb6c0415c77e7e05.txt

>>> /workspace/cp2k/regtesting/TEST-local_cuda-psmp-2022-01-20_19-08-01/QS/regtest-sccs-1
    H2O_sccs_td_fft.inp                                                            OK (   7.52 sec)
    H2O_sccs_td_cd5_fg.inp                                                         OK (   9.16 sec)
<<< /workspace/cp2k/regtesting/TEST-local_cuda-psmp-2022-01-20_19-08-01/QS/regtest-sccs-1 (180 of 300) done in 16.68 sec

https://dashboard.cp2k.org/archive/cuda-pascal/commit_17efb48911a60e2c4c27ff53d9e1c09127373232.txt

>>> /workspace/cp2k/regtesting/TEST-local_cuda-psmp-2022-01-24_11-07-40/QS/regtest-sccs-1
    H2O_sccs_td_fft.inp                                                            OK (  16.79 sec)
    H2O_sccs_td_cd5_fg.inp                                                         OK (  19.91 sec)
<<< /workspace/cp2k/regtesting/TEST-local_cuda-psmp-2022-01-24_11-07-40/QS/regtest-sccs-1 (180 of 300) done in 36.71 sec

PizDaint test is also failing with a compilation error (https://dashboard.cp2k.org/archive/cray-xc50-gnu-psmp/index.html).

@mtaillefumier is this expected?

@alazzaro
Copy link
Member

For the record on Daint, there is an error message just at the beginning of the compilation:

make -C /scratch/snx3000/mkrack/rt/CRAY-XC50-gnu/cp2k/exts/dbcsr -f /scratch/snx3000/mkrack/rt/CRAY-XC50-gnu/cp2k/exts/build_dbcsr/Makefile \
   ARCHFILE=/scratch/snx3000/mkrack/rt/CRAY-XC50-gnu/cp2k/arch/CRAY-XC50-gnu.psmp \
   LIBDIR=/scratch/snx3000/mkrack/rt/CRAY-XC50-gnu/cp2k/lib/CRAY-XC50-gnu/psmp/exts/dbcsr \
   OBJDIR=/scratch/snx3000/mkrack/rt/CRAY-XC50-gnu/cp2k/obj/CRAY-XC50-gnu/psmp/exts/dbcsr \
   USE_ACCEL="cuda" \
   ACC="nvcc" \
   ACCFLAGS="-D__parallel -D__SCALAPACK -D__FFTW3 -D__MPI_VERSION=3 -D__MAX_CONTR=4 -D__CHECK_DIAG -D__DBCSR_ACC -D__GRID_CUDA -D__PLUMED2 -D__ELPA -D__ELPA_NVIDIA_GPU -D__LIBVORI -D__LIBXC -D__LIBINT -D__SPGLIB -D__LIBXSMM -D__SPFFT -D__SPLA -D__HDF5 -D__LIBVDWXC -D__SIRIUS -D__COSMA -D__GSL -O3 -Xcompiler="-fopenmp" -arch sm_60 --std=c++11"
Removing stale archives for psmp ... 
Removing stale archives ... 
Resolving dependencies ... 
Resolving dependencies for psmp ... 
makedep error: Multiple source files with the same basename: pw_cuda_z
make[2]: *** [/scratch/snx3000/mkrack/rt/CRAY-XC50-gnu/cp2k/Makefile:482: makedep] Error 1
make[2]: *** Waiting for unfinished jobs....

@mtaillefumier
Copy link
Contributor Author

Hmm. Not really expected indeed. This can be improved for sure. I am surprised by the performance penalties but I can turn off PW_GPU by default a let people choose to turn it on.

  • the compilation error is really strange because pw_cuda_z.cu is only included in the file list when cuda is activated. one of the rules for find should be incorrect. It is the only possibility.

@oschuett
Copy link
Member

We should fix this performance regression as it not only eats into our cloud budget but also slows down future development. The culprit is probably the allocation of these device ressources. Maybe this could be done lazily?

@alazzaro
Copy link
Member

We should fix this performance regression as it not only eats into our cloud budget but also slows down future development. The culprit is probably the allocation of these device ressources. Maybe this could be done lazily?

Good catch, @oschuett . Indeed the previous code was using cublas v1...

@mtaillefumier
Copy link
Contributor Author

We should fix this performance regression as it not only eats into our cloud budget but also slows down future development. The culprit is probably the allocation of these device ressources. Maybe this could be done lazily?

Allocating these resources should cost nothing because cublas is initialized by others dependencies such as dbcsr. So allocating an handler does not explain 2x performance difference. It costs some time the first time it is initialized but nothing afterwards. And the handler is allocated only at the beginning so i do not think it is the culprit.

I can replace the dcopy by a stride of two by doing by hand (and doing the memset by hand as well) though if you really think it is the problem. I think the fft plane creation is the issue. Let me explore this tomorrow.

@alazzaro
Copy link
Member

Unfortunately, I have to add another problem here. I'm trying to build the current master with HIP support, but no PW stuff.
This is my arch file:

LMAX=5

CC       = cc
CXX      = CC
FC       = ftn
LD       = ftn
AR       = ar -r
DFLAGS   = -D__FFTW3 -D__parallel -D__SCALAPACK
DFLAGS  += -D__LIBXSMM
DFLAGS  += -D__LIBINT -D__MAX_CONTR=4
DFLAGS  += -D__LIBXC
DFLAGS  += -D__COSMA
DFLAGS  += -D__GRID_HIP -D__HIP_PLATFORM_AMD__
CFLAGS   = $(DFLAGS) -O3 -fopenmp -fopenmp-simd -mavx2 -mfma -fomit-frame-pointer
CXXFLAGS = $(CFLAGS) -std=c++11 -O3 -fopenmp -fopenmp-simd -mavx2 -mfma -fomit-frame-pointer
FCFLAGS  = $(DFLAGS) -O3 -fopenmp -fopenmp-simd -ftree-vectorize -fomit-frame-pointer -funroll-loops -mavx2 -mfma -g -fallow-argument-mismatch
CFLAGS  += -I${CP2K_LIBS}/libxsmm/include
HIPINC  = -I/opt/rocm-${ROCM_COMPILER_VERSION}/hip/include -I/opt/rocm-${ROCM_COMPILER_VERSION}/rocblas/include
CFLAGS  += $(HIPINC)
CXXFLAGS += $(HIPINC)
FCFLAGS += -I${CP2K_LIBS}/libxsmm/include
FCFLAGS += -I${CP2K_LIBS}/libint-lmax-$(LMAX)/include
FCFLAGS += -I${CP2K_LIBS}/libxc/include
LDFLAGS  = $(FCFLAGS)
LIBS    += -L${CP2K_LIBS}/libxsmm/lib -lxsmmf -lxsmm -ldl
LIBS    += -L${CP2K_LIBS}/libint-lmax-$(LMAX)/lib -lint2 -lstdc++
LIBS    += -L${CP2K_LIBS}/libxc/lib -lxcf03 -lxc
LIBS    += -lfftw3 -lfftw3_omp
LIBS    += -L${CP2K_LIBS}/cosma_rccl_install/lib64 -lcosma_prefixed_pxgemm -lcosma -lcosta -lTiled-MM -lsemiprof

GPUVER         = Mi100
OFFLOAD_CC     = hipcc
OFFLOAD_FLAGS  = -fPIE -D__HIP_PLATFORM_AMD__ --offload-arch=gfx90a -O3 --std=c++11 $(DFLAGS)
OFFLOAD_TARGET = hip
LIBS          += -L/opt/rocm-${ROCM_COMPILER_VERSION}/rocblas/lib -lrocblas -L/opt/rocm-${ROCM_COMPILER_VERSION}/hip/lib -lamdhip64
LIBS          += -L${CP2K_LIBS}/rccl/rccl/build/install/lib -lrccl

I get errors like:

/home/users/alazzaro/cp2k/cp2k_hipfft/src/pw/gpu/pw_gpu_internal.cc:28:8: error: 'blasHandle_t' does not name a type
   28 | static blasHandle_t handle_;
      |        ^~~~~~~~~~~~

and tons of others from the same file pw_gpu_internal.cc. From what I can see, this file gets compiled whenever we ask for hip, even if the PW is not requested...

@oschuett
Copy link
Member

Allocating these resources should cost nothing because cublas is initialized by others dependencies such as dbcsr.

It might be that DBCSR uses either lazy initialization or keeps the handle even after the library has been finalized. The GPU tests are run in keep-alive mode (previously we used farming), which means that the same CP2K process is re-used to execute multiple input files.

So allocating an handler does not explain 2x performance difference.

You are right. Looking at the timings it's a 2x slowdown across the board. Maybe blasCreate can slow down the workload of another process? After all, the device is shared among 12 processes executing 6 different input files.

@alazzaro
Copy link
Member

alazzaro commented Jan 24, 2022

About performance, with HIP I see "literally" tons of

Can't find valid buffer assignment with current buffers.
Can't find valid buffer assignment with current buffers.
Can't find valid buffer assignment with current buffers.
Can't find valid buffer assignment with current buffers.

in my output (this is H2O-128-RI-dRPA-TZ.inp benchmark).
This can be due to a hipfft problem with ROCM 4.5.0 though...

@mkrack
Copy link
Member

mkrack commented Jan 24, 2022

A specific slowdown of SCCS test cases indicates a problem with the FFT performance, since SCCS makes extraordinarily use of FFTs compared to other tests.

@mtaillefumier
Copy link
Contributor Author

it seems that the slowdown is very pronounced for fft specific tests. blasCreate is called one time only and I do not think creating an handler even with so many processes in parallel should be an issue.
I am revising the code and adding the functionality to reuse fft plans, use more streams to copy data in parallel, and avoid offloadMalloc and offloadFree as much as possible. The code is almost done but I need to run the regtests then I can open PR that will hopefully fix this issue.

@alazzaro it is rocmfft specific issue. See here ROCm/rocFFT#343

one way to go around it it do out of plane transforms. I can modify the code in that direction.

@mtaillefumier mtaillefumier deleted the hip_fft branch August 18, 2022 10:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants