Initial commit of scalable, low-memory periodic GW #2920

JWilhelm · 2023-08-11T10:08:24Z

low-memory of tensors achieved by repeated calculation of 3-center integrals; memory can be brought down to almost zero by more often calculating 3-center integrals
scalability should be excellent because every costly operation can be executed in small subgroups; only very little communication between subgroups is needed at the end of tensor operations
suggestion for easy, user-friendly input:

&PROPERTIES
&BANDSTRUCTURE
&GW
NUM_TIME_FREQ_POINTS 6
&END
&END
&END

Inside the &BANDSTRUCTURE section, one could also include SOC shifts of electronic levels (in collab. with Anna Hehn), local density of states, projected density of states, semi-empirical GW, etc.

Any ideas for improvement will be welcome!

still many todos: flexible subgroup sizes, subgroups for kpoint diagonalization, scaling tests on very large unit cells, test of code on GPUs

abussy · 2023-08-11T13:18:57Z

Do these changes only apply to periodic (I guess K-point) GW? Or does it also affect Gamma point low-scaling RPA and SOS-MP2? I would guess that you get memory issues because of the nimg^2 separate tensors you need to store the 3-center integrals (i^0 j^a| k^b), with a, b being cell indices?

JWilhelm · 2023-08-11T14:15:21Z

Do these changes only apply to periodic (I guess K-point) GW? Or does it also affect Gamma point low-scaling RPA and SOS-MP2? I would guess that you get memory issues because of the nimg^2 separate tensors you need to store the 3-center integrals (i^0 j^a| k^b), with a, b being cell indices?

The PR is a rewrite of the periodic GW algorithm reported in

https://arxiv.org/abs/2306.16066

The full periodic GW algorithm is given in the SI; 3-center integrals (mu nu P) are summed over the cell index; (mu nu P) do not carry cell indices 'a, b', see

Probably, the memory saving strategy can be also applied to low-scaling RPA, SOS-MP2 energy and forces.

The memory-saving algorithm is contained in the files

gw_methods.F
gw_types.F
gw_utils.F
gw_communication.F

So, the new code does not affect the existing low-scaling RPA, SOS-MP2 energy and force algorithms.

I have not yet benchmarked the low-memory GW algorithm for large-scale applications (the ultimate goal will be to tackle a 2D moiré structure with 10.000 atoms in a TZVP-MOLOPT basis, let's see whether this will be realistic...).

abussy · 2023-08-11T14:36:07Z

Thanks for the input! I'll make sure to read your paper. I was wondering, because in my experience of low-scaling RPA and SOS-MP2 forces, storing the integrals never seemed to be an issue since they tend to be very sparse.

JWilhelm · 2023-08-11T14:56:18Z

The memory issue comes for calculations with > 500 atoms using TZVP-MOLOPT basis sets. I agree that the memory for storing 3c integrals is not too high: For the largest calculation on 984 atoms from https://arxiv.org/abs/2306.16066 , the memory for 3c integrals is

occupancy * N_AO^2 * N_RI * 8 Byte = 0.657% * 22632 ^2 * 49441 * 8 B = 1.3 TB

(input and output here: https://github.com/JWilhelm/Inputs_outputs_low_scaling_GW_TMDC/tree/main/Figure_4/Twist_angle_19.4_degree_984_atoms)

The occupancy of the 3-index tensors increases to 11.5 % when multiplying with the matrix D_mu nu(it), making 23 TB (which can be reduced again using the memory cut). In my calculations on cells with more than 500 atoms in TZVP-MOLOPT basis, I constantly had memory issues, which I related to an inefficient and imbalanced memory usage of dbcsr/dbt library for large-scale calculations.

Probably, the memory issues are less important for molecular dynamics with MP2/RPA because one is restricted to unit cells with less than 500 atoms anyway. For the 2D moiré structure, the interesting things as atomic reconstruction happens for structures with 1000 to 100000 atoms in the unit cell...

oschuett · 2023-08-15T12:20:12Z

There seem to be two opposing trends in hardware:

GPUs provide an abundance of flops. So, if we eventually manage to offload the ERIs to the GPU then we may no longer need to store them.
Machines with 1.5 TB of main memory are becoming available.

Furthermore, I've been wondering for a while if we could take advantage of local SDDs, which could provide over 10TB of high bandwidth storage?

JWilhelm · 2023-08-15T12:57:48Z

Concerning GPUs: Do you think we could use LibintX for ERIs on GPUs? https://pubs.acs.org/doi/10.1021/acs.jctc.2c00995 Together with Christopher Dahnken, Hans Pabst (both Intel), Gerald Mathias and Momme Allalen (both LRZ Munich), we plan to optimize the new scalable GW code on the new Supermuc GPU partition "Ponte Vecchio", starting this November. We also need to benchmark, whether the computation of ERIs is a bottleneck in the algorithm (the number of ERIs scales linearly with system size, while a diagonalization in GW scales cubically).
Ole, do you know whether computing centers will buy a large number of such 1.5 TB high-memory nodes? On Supermuc, there are currently 144 fat nodes with 768 GB RAM, but one waits a month until a job starts... So, I am in favor to not rely on high-memory nodes at present. For tackling 2D materials with 10.000 atoms in the unit cell, we need a very large number of FLOPs, which hopefully can be provided by GPUs...

Furthermore, I've been wondering for a while if we could take advantage of local SDDs, which could provide over 10TB of high bandwidth storage?

The memory requirement in the "old" low-scaling RPA/SOS-MP2/GW code is in intermediate tensors M_λνP,

M_λνP = sum_µ (µν|P) D_µλ .

Here, I am not sure whether M_λνP can be efficiently stored and accessed on SDDs? Still, as Augustin already mentioned, it might be that the memory issues only arise for large-scale GW calculations > 500 atoms with large basis sets. For molecular dynamics with low-scaling RPA/SOS-MP2, one can anyway only afford unit cells with a few hundred atoms. For very-large-scale GW calculations, it is probably more advisable to use the new, scalable periodic GW implementation.

hfp · 2023-08-15T13:16:42Z

do you know whether computing centers will buy a large number of such 1.5 TB high-memory nodes?

If we look at the number of memory channels of contemporary systems, they currently top-out at 2x12 channels on dual-socket systems (12 channels per socket). There is also a physical/space limit for the number of memory channels per server. Assuming 24 channels as an upper limit, then 64 GB modules will yield an aggregated memory capacity of 1.5 TB. This also assumes systems are not underpopulated (sacrificing memory bandwidth). Even if 64 GB modules had the lowest price per GB, not too many centers will buy them because total price for this capacity is still a concern. If core-count per socket/cpu goes up towards ~384 cores like 768 cores for both sockets, then centers may want to maintain 2 GB per core, i.e., 1.5 TB of aggregated memory capacity.

I guess, implementing some service (in C) which is memory-mapping a file or actually backing some address space using a file is a more viable bet.

oschuett · 2023-08-15T14:36:35Z

Concerning GPUs: Do you think we could use LibintX for ERIs on GPUs?

It would certainly be great to have LibintX integrated into CP2K. However, it will probably require a bit of work. Maybe you could team up with Matt Watkins who is also working on accelerating ERIs? Getting it to work on Intel GPUs might pose an additional challenge because LibintX seems to only support CUDA at the moment. Let's discuss at the developers meeting on Friday.

Ole, do you know whether computing centers will buy a large number of such 1.5 TB high-memory nodes?

I don't know about computing centers, but cloud providers are now offering such machines for like $1.12 / hour.

oschuett · 2023-08-15T14:42:39Z

...while a diagonalization in GW scales cubically).

For diagonalizing large matrices you can now use cuSOLVERMp.

JWilhelm added 4 commits August 11, 2023 11:56

start scalable, low-memory GW

a2066fe

add new *.F files to src/CMakeLists.txt

11fecdb

remove old routineP name

f03eea4

allocate sab_orb_sub

17c92b3

allocate dbcsr matrix from allocated neighbor list

beaf47f

JWilhelm added 2 commits August 11, 2023 18:52

fix memory leak kpoints

1376785

fix dbcsr memory leak

b081bff

oschuett merged commit 8ce3b69 into cp2k:master Aug 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Initial commit of scalable, low-memory periodic GW #2920

Initial commit of scalable, low-memory periodic GW #2920

Uh oh!

JWilhelm commented Aug 11, 2023

Uh oh!

abussy commented Aug 11, 2023

Uh oh!

JWilhelm commented Aug 11, 2023

Uh oh!

abussy commented Aug 11, 2023

Uh oh!

JWilhelm commented Aug 11, 2023

Uh oh!

oschuett commented Aug 15, 2023 •

edited

Loading

Uh oh!

JWilhelm commented Aug 15, 2023 •

edited

Loading

Uh oh!

hfp commented Aug 15, 2023

Uh oh!

oschuett commented Aug 15, 2023 •

edited

Loading

Uh oh!

oschuett commented Aug 15, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Initial commit of scalable, low-memory periodic GW #2920

Initial commit of scalable, low-memory periodic GW #2920

Uh oh!

Conversation

JWilhelm commented Aug 11, 2023

Uh oh!

abussy commented Aug 11, 2023

Uh oh!

JWilhelm commented Aug 11, 2023

Uh oh!

abussy commented Aug 11, 2023

Uh oh!

JWilhelm commented Aug 11, 2023

Uh oh!

oschuett commented Aug 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JWilhelm commented Aug 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hfp commented Aug 15, 2023

Uh oh!

oschuett commented Aug 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

oschuett commented Aug 15, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

oschuett commented Aug 15, 2023 •

edited

Loading

JWilhelm commented Aug 15, 2023 •

edited

Loading

oschuett commented Aug 15, 2023 •

edited

Loading