Skip to content

Conversation

@oschuett
Copy link
Member

@oschuett oschuett commented Jan 9, 2022

No description provided.

@mtaillefumier
Copy link
Contributor

Interesting work. I would suggest one improvement that does not involve much work. Would it be possible to have a more generic interface for the GPU side since you make direct calls to CUDA while cp2k will most probably runs on AMD hardware very soon.

What I am thinking of is to just isolate the cuda depend part to its dedicated directory like it is done with collocate and integrate.

@oschuett
Copy link
Member Author

Interesting work.

Thank you :-)

runs on AMD hardware very soon.

Yes, it would be great if you could help with the HIP backend. Hopefully, Nvidia will step up for the CUDA backend. The current kernel is really just a placeholder.

Would it be possible to have a more generic interface for the GPU

Yes, the code is already laid out to support multiple backends. One only has to hook into these four routines.

What I am thinking of is to just isolate the cuda depend part to its dedicated directory

Yes, we should definitely create a sub-directory once a backend outgrows a single file.

@mtaillefumier
Copy link
Contributor

Interesting work.

Thank you :-)

I think anything that make the code cleaner is welcome. Added bonus if it speed things up ;)

runs on AMD hardware very soon.

Yes, it would be great if you could help with the HIP backend. Hopefully, Nvidia will step up for the CUDA backend. The current kernel is really just a placeholder.

I think I can get that running fast at least to put things in place for hip. kernels optimization will come later but for now I can just use the kernels you wrote since hip or cuda syntax are almost the same.

Would it be possible to have a more generic interface for the GPU

Yes, the code is already laid out to support multiple backends. One only has to hook into these four routines.

What I am thinking of is to just isolate the cuda depend part to its dedicated directory

Yes, we should definitely create a sub-directory once a backend outgrows a single file.
Then better split as soon as possible. I can take care of the split and make a PR on your branch.

@oschuett
Copy link
Member Author

I think anything that make the code cleaner is welcome.

Yes, I was able to reduce the code size significantly (> 10x) which hopefully allows us to keep up with future hardware changes.

Added bonus if it speed things up ;)

According to my measurements there is already a clear speedup with OpenMP, otherwise it's currently a draw. Since I didn't spend much time on optimization there should still be plenty of headroom.

I think I can get that running fast at least to put things in place for hip.

Awesome!

Btw, the code is also pretty much ready for CUDA-Aware MPI, but I'm not using it just yet.

@oschuett oschuett force-pushed the dbm branch 2 times, most recently from 705c653 to b962a41 Compare January 13, 2022 23:59
@oschuett
Copy link
Member Author

Timings for the GW_PBE_4benzene.inp benchmark are looking good, 3x speedup with CUDA despite the naïve kernel:

Code Version OpenMP MPI CUDA
Master (48211d0) 355.334 53.639 508.766
This PR (b962a41) 325.382 54.764 168.443
Code Version OpenMP / MPI CUDA
Master (48211d0) log log
This PR (b962a41) log log

@oschuett oschuett marked this pull request as ready for review January 14, 2022 10:12
@mtaillefumier
Copy link
Contributor

That's impressive. It means a lot of time is spend not doing calculations.

@mtaillefumier
Copy link
Contributor

Also, just using shared memory for the kernel would reduce the runtime even more. I think the basic kernel from the cuda documentation can be adapted without too much effort.

  • I already have the hip support done I just need to make pr to your local repo so you can look at it. But it is basically the same code (not duplicated). I will try to find time to write the dgemm kernel asap. Just locked with hip_fft that misbehaves.
  • Question : Why do you use atomicAdd in your kernel ?

@abussy
Copy link
Contributor

abussy commented Jan 14, 2022

I did some tests of my own, on 16 Piz Daint GPU nodes. The system is 64 water molecules (HF, cc-TZ, ADMM-cpFIT3) with the newly implemented RI-HFX method which relies heavily on tensor contractions. I tested both the MO flavor (which tends to have good GPU performance) and the RHO flavor (in the best case GPU and no GPU runs are equally fast). Here are some timings:

MO flavor, 1 MPI rank and 12 OMP threads per node:

NO GPU GPU this PR (with GPU)
SCF (10 steps) 507 s 375 s 494 s
forces 177s 180 s 148 s

MO flavor, 6 MPI ranks and 2 OMP threads per node:

NO GPU GPU this PR (with GPU)
SCF (10 steps) 250 s 138 s 401 s
forces 72 s 94 s 70 s

RHO flavor, 1 MPI rank and 12 OMP threads per node:

NO GPU GPU this PR (with GPU)
SCF (10 steps) 582 s 630 s 533 s
forces 746 s 744 s 801 s

RHO flavor, 6 MPI ranks and 2 OMP threads per node:

NO GPU GPU this PR (with GPU)
SCF (10 steps) 278 s 1115 s 333 s
forces 517 s 432 s 688 s

Generally, the new library has similar performances as CPU only runs. In the cases where the current DBCSR tensor backend performs well (MO flavor, rather dense and large blocks), the new library is not quite there yet. However, it seems to be more robust, in the sense that there are not occurrence where the timings explode (contrary to GPU SCF in last table). This looks quite good and I am very excited to see what a more optimized kernel will yield! I'd be happy to repeat those tests at each new update.

@mtaillefumier
Copy link
Contributor

The kernel needs some attention. I understand perfectly that @oschuett did not want to spend time on this until the full code is working. I will give it a try as well.

@oschuett
Copy link
Member Author

Thanks a lot @abussy for running those additional benchmarks!

So, unless there are any objects I'll merge this PR later today.

Btw, DBCSR's performance fluctuates because of optimizations like auto-tuned kernels and calling cuBLAS. These optimizations are "binary" in nature, ie. it's hit or miss. For linear scaling DFT this was usually not a problem. However for tensors we apparently need a more "contiguous" behavior.

Once this PR is merged, I'll open some issues to start the discussions on possible followup improvements.

@mtaillefumier
Copy link
Contributor

No objections on my side. I can make a PR specifically for the hip backend afterwards.

@oschuett oschuett merged commit be4ecb1 into cp2k:master Jan 17, 2022
@oschuett
Copy link
Member Author

The GCC 5 test is failing because it lacks support for ALLOCATE(var, source=...).

While we could adopt DBCSR's workaround, I'm wondering if we should keep the clean code and drop GCC 5 support instead?

We dropped GCC 6 over a year ago. Our next release is still many months away. By then GCC 5 will be over 7 years old.

@alazzaro
Copy link
Member

I vote to drop GCC 5...

@alazzaro
Copy link
Member

BTW, CCE is reporting the following error:

ftn-1904 crayftn: ERROR DBM_CREATE_FROM_TEMPLATE, File = ../../../../../home/users/alazzaro/cp2k/CP2K_AUTO_REGTEST/CRAY-XC40-cce/psmp/cp2k/src/dbm/dbm_api.F, Line = 271, Column = 39 
  Dummy argument "ROW_BLOCK_SIZES" has the CONTIGUOUS and POINTER attributes.  It requires an actual argument with the CONTIGUOUS and POINTER attributes.


ftn-1904 crayftn: ERROR DBM_CREATE_FROM_TEMPLATE, File = ../../../../../home/users/alazzaro/cp2k/CP2K_AUTO_REGTEST/CRAY-XC40-cce/psmp/cp2k/src/dbm/dbm_api.F, Line = 272, Column = 39 
  Dummy argument "COL_BLOCK_SIZES" has the CONTIGUOUS and POINTER attributes.  It requires an actual argument with the CONTIGUOUS and POINTER attributes.

@oschuett
Copy link
Member Author

oschuett commented Jan 18, 2022

The CCE error appears to be a false alert. As a workaround I'll try storing the pointers in intermediate variables to make it easier for the compiler to track the CONTIGUOUS attribute.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants