Skip to content

Conversation

@JWilhelm
Copy link
Member

  • low-memory of tensors achieved by repeated calculation of 3-center integrals; memory can be brought down to almost zero by more often calculating 3-center integrals

  • scalability should be excellent because every costly operation can be executed in small subgroups; only very little communication between subgroups is needed at the end of tensor operations

  • suggestion for easy, user-friendly input:

    &PROPERTIES
    &BANDSTRUCTURE
    &GW
    NUM_TIME_FREQ_POINTS 6
    &END
    &END
    &END

Inside the &BANDSTRUCTURE section, one could also include SOC shifts of electronic levels (in collab. with Anna Hehn), local density of states, projected density of states, semi-empirical GW, etc.

Any ideas for improvement will be welcome!

  • still many todos: flexible subgroup sizes, subgroups for kpoint diagonalization, scaling tests on very large unit cells, test of code on GPUs

@abussy
Copy link
Contributor

abussy commented Aug 11, 2023

Do these changes only apply to periodic (I guess K-point) GW? Or does it also affect Gamma point low-scaling RPA and SOS-MP2? I would guess that you get memory issues because of the nimg^2 separate tensors you need to store the 3-center integrals (i^0 j^a| k^b), with a, b being cell indices?

@JWilhelm
Copy link
Member Author

Do these changes only apply to periodic (I guess K-point) GW? Or does it also affect Gamma point low-scaling RPA and SOS-MP2? I would guess that you get memory issues because of the nimg^2 separate tensors you need to store the 3-center integrals (i^0 j^a| k^b), with a, b being cell indices?

The PR is a rewrite of the periodic GW algorithm reported in

https://arxiv.org/abs/2306.16066

The full periodic GW algorithm is given in the SI; 3-center integrals (mu nu P) are summed over the cell index; (mu nu P) do not carry cell indices 'a, b', see

image

Probably, the memory saving strategy can be also applied to low-scaling RPA, SOS-MP2 energy and forces.

The memory-saving algorithm is contained in the files

gw_methods.F
gw_types.F
gw_utils.F
gw_communication.F

So, the new code does not affect the existing low-scaling RPA, SOS-MP2 energy and force algorithms.

I have not yet benchmarked the low-memory GW algorithm for large-scale applications (the ultimate goal will be to tackle a 2D moiré structure with 10.000 atoms in a TZVP-MOLOPT basis, let's see whether this will be realistic...).

@abussy
Copy link
Contributor

abussy commented Aug 11, 2023

Thanks for the input! I'll make sure to read your paper. I was wondering, because in my experience of low-scaling RPA and SOS-MP2 forces, storing the integrals never seemed to be an issue since they tend to be very sparse.

@JWilhelm
Copy link
Member Author

The memory issue comes for calculations with > 500 atoms using TZVP-MOLOPT basis sets. I agree that the memory for storing 3c integrals is not too high: For the largest calculation on 984 atoms from https://arxiv.org/abs/2306.16066 , the memory for 3c integrals is

occupancy * N_AO^2 * N_RI * 8 Byte = 0.657% * 22632 ^2 * 49441 * 8 B = 1.3 TB

(input and output here: https://github.com/JWilhelm/Inputs_outputs_low_scaling_GW_TMDC/tree/main/Figure_4/Twist_angle_19.4_degree_984_atoms)

The occupancy of the 3-index tensors increases to 11.5 % when multiplying with the matrix D_mu nu(it), making 23 TB (which can be reduced again using the memory cut). In my calculations on cells with more than 500 atoms in TZVP-MOLOPT basis, I constantly had memory issues, which I related to an inefficient and imbalanced memory usage of dbcsr/dbt library for large-scale calculations.

Probably, the memory issues are less important for molecular dynamics with MP2/RPA because one is restricted to unit cells with less than 500 atoms anyway. For the 2D moiré structure, the interesting things as atomic reconstruction happens for structures with 1000 to 100000 atoms in the unit cell...

@oschuett
Copy link
Member

oschuett commented Aug 15, 2023

There seem to be two opposing trends in hardware:

  • GPUs provide an abundance of flops. So, if we eventually manage to offload the ERIs to the GPU then we may no longer need to store them.
  • Machines with 1.5 TB of main memory are becoming available.

Furthermore, I've been wondering for a while if we could take advantage of local SDDs, which could provide over 10TB of high bandwidth storage?

@JWilhelm
Copy link
Member Author

JWilhelm commented Aug 15, 2023

  • Concerning GPUs: Do you think we could use LibintX for ERIs on GPUs? https://pubs.acs.org/doi/10.1021/acs.jctc.2c00995 Together with Christopher Dahnken, Hans Pabst (both Intel), Gerald Mathias and Momme Allalen (both LRZ Munich), we plan to optimize the new scalable GW code on the new Supermuc GPU partition "Ponte Vecchio", starting this November. We also need to benchmark, whether the computation of ERIs is a bottleneck in the algorithm (the number of ERIs scales linearly with system size, while a diagonalization in GW scales cubically).

  • Ole, do you know whether computing centers will buy a large number of such 1.5 TB high-memory nodes? On Supermuc, there are currently 144 fat nodes with 768 GB RAM, but one waits a month until a job starts... So, I am in favor to not rely on high-memory nodes at present. For tackling 2D materials with 10.000 atoms in the unit cell, we need a very large number of FLOPs, which hopefully can be provided by GPUs...

Furthermore, I've been wondering for a while if we could take advantage of local SDDs, which could provide over 10TB of high bandwidth storage?

  • The memory requirement in the "old" low-scaling RPA/SOS-MP2/GW code is in intermediate tensors M_λνP,

M_λνP = sum_µ (µν|P) D_µλ .

Here, I am not sure whether M_λνP can be efficiently stored and accessed on SDDs? Still, as Augustin already mentioned, it might be that the memory issues only arise for large-scale GW calculations > 500 atoms with large basis sets. For molecular dynamics with low-scaling RPA/SOS-MP2, one can anyway only afford unit cells with a few hundred atoms. For very-large-scale GW calculations, it is probably more advisable to use the new, scalable periodic GW implementation.

@hfp
Copy link
Member

hfp commented Aug 15, 2023

do you know whether computing centers will buy a large number of such 1.5 TB high-memory nodes?

If we look at the number of memory channels of contemporary systems, they currently top-out at 2x12 channels on dual-socket systems (12 channels per socket). There is also a physical/space limit for the number of memory channels per server. Assuming 24 channels as an upper limit, then 64 GB modules will yield an aggregated memory capacity of 1.5 TB. This also assumes systems are not underpopulated (sacrificing memory bandwidth). Even if 64 GB modules had the lowest price per GB, not too many centers will buy them because total price for this capacity is still a concern. If core-count per socket/cpu goes up towards ~384 cores like 768 cores for both sockets, then centers may want to maintain 2 GB per core, i.e., 1.5 TB of aggregated memory capacity.

I guess, implementing some service (in C) which is memory-mapping a file or actually backing some address space using a file is a more viable bet.

@oschuett
Copy link
Member

oschuett commented Aug 15, 2023

Concerning GPUs: Do you think we could use LibintX for ERIs on GPUs?

It would certainly be great to have LibintX integrated into CP2K. However, it will probably require a bit of work. Maybe you could team up with Matt Watkins who is also working on accelerating ERIs? Getting it to work on Intel GPUs might pose an additional challenge because LibintX seems to only support CUDA at the moment. Let's discuss at the developers meeting on Friday.

Ole, do you know whether computing centers will buy a large number of such 1.5 TB high-memory nodes?

I don't know about computing centers, but cloud providers are now offering such machines for like $1.12 / hour.

@oschuett
Copy link
Member

...while a diagonalization in GW scales cubically).

For diagonalizing large matrices you can now use cuSOLVERMp.

@oschuett oschuett merged commit 8ce3b69 into cp2k:master Aug 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants