Add new sparse matrix library (DBM) and lightweight fork of DBCSR tensor (DBT) #1863

oschuett · 2022-01-09T23:50:03Z

No description provided.

mtaillefumier · 2022-01-10T09:32:30Z

Interesting work. I would suggest one improvement that does not involve much work. Would it be possible to have a more generic interface for the GPU side since you make direct calls to CUDA while cp2k will most probably runs on AMD hardware very soon.

What I am thinking of is to just isolate the cuda depend part to its dedicated directory like it is done with collocate and integrate.

oschuett · 2022-01-10T15:14:18Z

Interesting work.

Thank you :-)

runs on AMD hardware very soon.

Yes, it would be great if you could help with the HIP backend. Hopefully, Nvidia will step up for the CUDA backend. The current kernel is really just a placeholder.

Would it be possible to have a more generic interface for the GPU

Yes, the code is already laid out to support multiple backends. One only has to hook into these four routines.

What I am thinking of is to just isolate the cuda depend part to its dedicated directory

Yes, we should definitely create a sub-directory once a backend outgrows a single file.

mtaillefumier · 2022-01-10T15:25:05Z

Interesting work.

Thank you :-)

I think anything that make the code cleaner is welcome. Added bonus if it speed things up ;)

runs on AMD hardware very soon.

Yes, it would be great if you could help with the HIP backend. Hopefully, Nvidia will step up for the CUDA backend. The current kernel is really just a placeholder.

I think I can get that running fast at least to put things in place for hip. kernels optimization will come later but for now I can just use the kernels you wrote since hip or cuda syntax are almost the same.

Would it be possible to have a more generic interface for the GPU

Yes, the code is already laid out to support multiple backends. One only has to hook into these four routines.

What I am thinking of is to just isolate the cuda depend part to its dedicated directory

Yes, we should definitely create a sub-directory once a backend outgrows a single file.
Then better split as soon as possible. I can take care of the split and make a PR on your branch.

oschuett · 2022-01-10T15:55:51Z

I think anything that make the code cleaner is welcome.

Yes, I was able to reduce the code size significantly (> 10x) which hopefully allows us to keep up with future hardware changes.

Added bonus if it speed things up ;)

According to my measurements there is already a clear speedup with OpenMP, otherwise it's currently a draw. Since I didn't spend much time on optimization there should still be plenty of headroom.

I think I can get that running fast at least to put things in place for hip.

Awesome!

Btw, the code is also pretty much ready for CUDA-Aware MPI, but I'm not using it just yet.

oschuett · 2022-01-14T10:12:27Z

Timings for the GW_PBE_4benzene.inp benchmark are looking good, 3x speedup with CUDA despite the naïve kernel:

Code Version	OpenMP	MPI	CUDA
Master (`48211d0`)	355.334	53.639	508.766
This PR (`b962a41`)	325.382	54.764	168.443

Code Version	OpenMP / MPI	CUDA
Master (`48211d0`)	log	log
This PR (`b962a41`)	log	log

mtaillefumier · 2022-01-14T10:27:52Z

That's impressive. It means a lot of time is spend not doing calculations.

mtaillefumier · 2022-01-14T10:30:37Z

Also, just using shared memory for the kernel would reduce the runtime even more. I think the basic kernel from the cuda documentation can be adapted without too much effort.

I already have the hip support done I just need to make pr to your local repo so you can look at it. But it is basically the same code (not duplicated). I will try to find time to write the dgemm kernel asap. Just locked with hip_fft that misbehaves.
Question : Why do you use atomicAdd in your kernel ?

abussy · 2022-01-14T14:46:00Z

I did some tests of my own, on 16 Piz Daint GPU nodes. The system is 64 water molecules (HF, cc-TZ, ADMM-cpFIT3) with the newly implemented RI-HFX method which relies heavily on tensor contractions. I tested both the MO flavor (which tends to have good GPU performance) and the RHO flavor (in the best case GPU and no GPU runs are equally fast). Here are some timings:

MO flavor, 1 MPI rank and 12 OMP threads per node:

	NO GPU	GPU	this PR (with GPU)
SCF (10 steps)	507 s	375 s	494 s
forces	177s	180 s	148 s

MO flavor, 6 MPI ranks and 2 OMP threads per node:

	NO GPU	GPU	this PR (with GPU)
SCF (10 steps)	250 s	138 s	401 s
forces	72 s	94 s	70 s

RHO flavor, 1 MPI rank and 12 OMP threads per node:

	NO GPU	GPU	this PR (with GPU)
SCF (10 steps)	582 s	630 s	533 s
forces	746 s	744 s	801 s

RHO flavor, 6 MPI ranks and 2 OMP threads per node:

	NO GPU	GPU	this PR (with GPU)
SCF (10 steps)	278 s	1115 s	333 s
forces	517 s	432 s	688 s

Generally, the new library has similar performances as CPU only runs. In the cases where the current DBCSR tensor backend performs well (MO flavor, rather dense and large blocks), the new library is not quite there yet. However, it seems to be more robust, in the sense that there are not occurrence where the timings explode (contrary to GPU SCF in last table). This looks quite good and I am very excited to see what a more optimized kernel will yield! I'd be happy to repeat those tests at each new update.

mtaillefumier · 2022-01-14T14:54:30Z

The kernel needs some attention. I understand perfectly that @oschuett did not want to spend time on this until the full code is working. I will give it a try as well.

oschuett · 2022-01-17T12:54:45Z

Thanks a lot @abussy for running those additional benchmarks!

So, unless there are any objects I'll merge this PR later today.

Btw, DBCSR's performance fluctuates because of optimizations like auto-tuned kernels and calling cuBLAS. These optimizations are "binary" in nature, ie. it's hit or miss. For linear scaling DFT this was usually not a problem. However for tensors we apparently need a more "contiguous" behavior.

Once this PR is merged, I'll open some issues to start the discussions on possible followup improvements.

mtaillefumier · 2022-01-17T14:00:15Z

No objections on my side. I can make a PR specifically for the hip backend afterwards.

oschuett · 2022-01-18T09:12:48Z

The GCC 5 test is failing because it lacks support for ALLOCATE(var, source=...).

While we could adopt DBCSR's workaround, I'm wondering if we should keep the clean code and drop GCC 5 support instead?

We dropped GCC 6 over a year ago. Our next release is still many months away. By then GCC 5 will be over 7 years old.

alazzaro · 2022-01-18T09:14:51Z

I vote to drop GCC 5...

alazzaro · 2022-01-18T09:18:42Z

BTW, CCE is reporting the following error:

ftn-1904 crayftn: ERROR DBM_CREATE_FROM_TEMPLATE, File = ../../../../../home/users/alazzaro/cp2k/CP2K_AUTO_REGTEST/CRAY-XC40-cce/psmp/cp2k/src/dbm/dbm_api.F, Line = 271, Column = 39 
  Dummy argument "ROW_BLOCK_SIZES" has the CONTIGUOUS and POINTER attributes.  It requires an actual argument with the CONTIGUOUS and POINTER attributes.


ftn-1904 crayftn: ERROR DBM_CREATE_FROM_TEMPLATE, File = ../../../../../home/users/alazzaro/cp2k/CP2K_AUTO_REGTEST/CRAY-XC40-cce/psmp/cp2k/src/dbm/dbm_api.F, Line = 272, Column = 39 
  Dummy argument "COL_BLOCK_SIZES" has the CONTIGUOUS and POINTER attributes.  It requires an actual argument with the CONTIGUOUS and POINTER attributes.

oschuett · 2022-01-18T10:49:04Z

The CCE error appears to be a false alert. As a workaround I'll try storing the pointers in intermediate variables to make it easier for the compiler to track the CONTIGUOUS attribute.

oschuett force-pushed the dbm branch 2 times, most recently from 705c653 to b962a41 Compare January 13, 2022 23:59

oschuett marked this pull request as ready for review January 14, 2022 10:12

oschuett force-pushed the dbm branch from b962a41 to 743e76e Compare January 17, 2022 10:25

oschuett added 3 commits January 17, 2022 11:28

DBM: Add new Distributed Block-sparse Matrix library

f4788e4

DBT: Add lightweight fork of DBCSR tensor library from Patrick Seewald

a86b6b8

Migrate from DBCSR tensors to DBT

a0f7d36

oschuett force-pushed the dbm branch from 743e76e to a0f7d36 Compare January 17, 2022 10:28

oschuett merged commit be4ecb1 into cp2k:master Jan 17, 2022

oschuett mentioned this pull request Jan 17, 2022

DBM/T: Fix various small issues #1875

Merged

oschuett mentioned this pull request Jan 18, 2022

Drop GCC5 and workaround CCE error #1878

Merged

Add new sparse matrix library (DBM) and lightweight fork of DBCSR tensor (DBT) #1863

Add new sparse matrix library (DBM) and lightweight fork of DBCSR tensor (DBT) #1863

Uh oh!

Conversation

oschuett commented Jan 9, 2022

Uh oh!

mtaillefumier commented Jan 10, 2022

Uh oh!

oschuett commented Jan 10, 2022

Uh oh!

mtaillefumier commented Jan 10, 2022

Uh oh!

oschuett commented Jan 10, 2022

Uh oh!

oschuett commented Jan 14, 2022

Uh oh!

mtaillefumier commented Jan 14, 2022

Uh oh!

mtaillefumier commented Jan 14, 2022

Uh oh!

abussy commented Jan 14, 2022

Uh oh!

mtaillefumier commented Jan 14, 2022

Uh oh!

oschuett commented Jan 17, 2022

Uh oh!

mtaillefumier commented Jan 17, 2022

Uh oh!

oschuett commented Jan 18, 2022

Uh oh!

alazzaro commented Jan 18, 2022

Uh oh!

alazzaro commented Jan 18, 2022

Uh oh!

oschuett commented Jan 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

oschuett commented Jan 18, 2022 •

edited

Loading