-
Notifications
You must be signed in to change notification settings - Fork 436
Initial commit of scalable, low-memory periodic GW #2920
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Do these changes only apply to periodic (I guess K-point) GW? Or does it also affect Gamma point low-scaling RPA and SOS-MP2? I would guess that you get memory issues because of the |
The PR is a rewrite of the periodic GW algorithm reported in https://arxiv.org/abs/2306.16066 The full periodic GW algorithm is given in the SI; 3-center integrals (mu nu P) are summed over the cell index; (mu nu P) do not carry cell indices 'a, b', see Probably, the memory saving strategy can be also applied to low-scaling RPA, SOS-MP2 energy and forces. The memory-saving algorithm is contained in the files gw_methods.F So, the new code does not affect the existing low-scaling RPA, SOS-MP2 energy and force algorithms. I have not yet benchmarked the low-memory GW algorithm for large-scale applications (the ultimate goal will be to tackle a 2D moiré structure with 10.000 atoms in a TZVP-MOLOPT basis, let's see whether this will be realistic...). |
|
Thanks for the input! I'll make sure to read your paper. I was wondering, because in my experience of low-scaling RPA and SOS-MP2 forces, storing the integrals never seemed to be an issue since they tend to be very sparse. |
|
The memory issue comes for calculations with > 500 atoms using TZVP-MOLOPT basis sets. I agree that the memory for storing 3c integrals is not too high: For the largest calculation on 984 atoms from https://arxiv.org/abs/2306.16066 , the memory for 3c integrals is occupancy * N_AO^2 * N_RI * 8 Byte = 0.657% * 22632 ^2 * 49441 * 8 B = 1.3 TB (input and output here: https://github.com/JWilhelm/Inputs_outputs_low_scaling_GW_TMDC/tree/main/Figure_4/Twist_angle_19.4_degree_984_atoms) The occupancy of the 3-index tensors increases to 11.5 % when multiplying with the matrix D_mu nu(it), making 23 TB (which can be reduced again using the memory cut). In my calculations on cells with more than 500 atoms in TZVP-MOLOPT basis, I constantly had memory issues, which I related to an inefficient and imbalanced memory usage of dbcsr/dbt library for large-scale calculations. Probably, the memory issues are less important for molecular dynamics with MP2/RPA because one is restricted to unit cells with less than 500 atoms anyway. For the 2D moiré structure, the interesting things as atomic reconstruction happens for structures with 1000 to 100000 atoms in the unit cell... |
|
There seem to be two opposing trends in hardware:
Furthermore, I've been wondering for a while if we could take advantage of local SDDs, which could provide over 10TB of high bandwidth storage? |
M_λνP = sum_µ (µν|P) D_µλ . Here, I am not sure whether M_λνP can be efficiently stored and accessed on SDDs? Still, as Augustin already mentioned, it might be that the memory issues only arise for large-scale GW calculations > 500 atoms with large basis sets. For molecular dynamics with low-scaling RPA/SOS-MP2, one can anyway only afford unit cells with a few hundred atoms. For very-large-scale GW calculations, it is probably more advisable to use the new, scalable periodic GW implementation. |
If we look at the number of memory channels of contemporary systems, they currently top-out at 2x12 channels on dual-socket systems (12 channels per socket). There is also a physical/space limit for the number of memory channels per server. Assuming 24 channels as an upper limit, then 64 GB modules will yield an aggregated memory capacity of 1.5 TB. This also assumes systems are not underpopulated (sacrificing memory bandwidth). Even if 64 GB modules had the lowest price per GB, not too many centers will buy them because total price for this capacity is still a concern. If core-count per socket/cpu goes up towards ~384 cores like 768 cores for both sockets, then centers may want to maintain 2 GB per core, i.e., 1.5 TB of aggregated memory capacity. I guess, implementing some service (in C) which is memory-mapping a file or actually backing some address space using a file is a more viable bet. |
It would certainly be great to have LibintX integrated into CP2K. However, it will probably require a bit of work. Maybe you could team up with Matt Watkins who is also working on accelerating ERIs? Getting it to work on Intel GPUs might pose an additional challenge because LibintX seems to only support CUDA at the moment. Let's discuss at the developers meeting on Friday.
I don't know about computing centers, but cloud providers are now offering such machines for like $1.12 / hour. |
For diagonalizing large matrices you can now use cuSOLVERMp. |

low-memory of tensors achieved by repeated calculation of 3-center integrals; memory can be brought down to almost zero by more often calculating 3-center integrals
scalability should be excellent because every costly operation can be executed in small subgroups; only very little communication between subgroups is needed at the end of tensor operations
suggestion for easy, user-friendly input:
&PROPERTIES
&BANDSTRUCTURE
&GW
NUM_TIME_FREQ_POINTS 6
&END
&END
&END
Inside the &BANDSTRUCTURE section, one could also include SOC shifts of electronic levels (in collab. with Anna Hehn), local density of states, projected density of states, semi-empirical GW, etc.
Any ideas for improvement will be welcome!