-
Notifications
You must be signed in to change notification settings - Fork 436
Add new sparse matrix library (DBM) and lightweight fork of DBCSR tensor (DBT) #1863
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Interesting work. I would suggest one improvement that does not involve much work. Would it be possible to have a more generic interface for the GPU side since you make direct calls to CUDA while cp2k will most probably runs on AMD hardware very soon. What I am thinking of is to just isolate the cuda depend part to its dedicated directory like it is done with collocate and integrate. |
Thank you :-)
Yes, it would be great if you could help with the HIP backend. Hopefully, Nvidia will step up for the CUDA backend. The current kernel is really just a placeholder.
Yes, the code is already laid out to support multiple backends. One only has to hook into these four routines.
Yes, we should definitely create a sub-directory once a backend outgrows a single file. |
I think anything that make the code cleaner is welcome. Added bonus if it speed things up ;)
I think I can get that running fast at least to put things in place for hip. kernels optimization will come later but for now I can just use the kernels you wrote since hip or cuda syntax are almost the same.
|
Yes, I was able to reduce the code size significantly (> 10x) which hopefully allows us to keep up with future hardware changes.
According to my measurements there is already a clear speedup with OpenMP, otherwise it's currently a draw. Since I didn't spend much time on optimization there should still be plenty of headroom.
Awesome! Btw, the code is also pretty much ready for CUDA-Aware MPI, but I'm not using it just yet. |
705c653 to
b962a41
Compare
|
Timings for the GW_PBE_4benzene.inp benchmark are looking good, 3x speedup with CUDA despite the naïve kernel:
|
|
That's impressive. It means a lot of time is spend not doing calculations. |
|
Also, just using shared memory for the kernel would reduce the runtime even more. I think the basic kernel from the cuda documentation can be adapted without too much effort.
|
|
I did some tests of my own, on 16 Piz Daint GPU nodes. The system is 64 water molecules (HF, cc-TZ, ADMM-cpFIT3) with the newly implemented RI-HFX method which relies heavily on tensor contractions. I tested both the MO flavor (which tends to have good GPU performance) and the RHO flavor (in the best case GPU and no GPU runs are equally fast). Here are some timings: MO flavor, 1 MPI rank and 12 OMP threads per node:
MO flavor, 6 MPI ranks and 2 OMP threads per node:
RHO flavor, 1 MPI rank and 12 OMP threads per node:
RHO flavor, 6 MPI ranks and 2 OMP threads per node:
Generally, the new library has similar performances as CPU only runs. In the cases where the current DBCSR tensor backend performs well (MO flavor, rather dense and large blocks), the new library is not quite there yet. However, it seems to be more robust, in the sense that there are not occurrence where the timings explode (contrary to GPU SCF in last table). This looks quite good and I am very excited to see what a more optimized kernel will yield! I'd be happy to repeat those tests at each new update. |
|
The kernel needs some attention. I understand perfectly that @oschuett did not want to spend time on this until the full code is working. I will give it a try as well. |
|
Thanks a lot @abussy for running those additional benchmarks! So, unless there are any objects I'll merge this PR later today. Btw, DBCSR's performance fluctuates because of optimizations like auto-tuned kernels and calling cuBLAS. These optimizations are "binary" in nature, ie. it's hit or miss. For linear scaling DFT this was usually not a problem. However for tensors we apparently need a more "contiguous" behavior. Once this PR is merged, I'll open some issues to start the discussions on possible followup improvements. |
|
No objections on my side. I can make a PR specifically for the hip backend afterwards. |
|
The GCC 5 test is failing because it lacks support for While we could adopt DBCSR's workaround, I'm wondering if we should keep the clean code and drop GCC 5 support instead? We dropped GCC 6 over a year ago. Our next release is still many months away. By then GCC 5 will be over 7 years old. |
|
I vote to drop GCC 5... |
|
BTW, CCE is reporting the following error: |
|
The CCE error appears to be a false alert. As a workaround I'll try storing the pointers in intermediate variables to make it easier for the compiler to track the |
No description provided.