-
Notifications
You must be signed in to change notification settings - Fork 436
[grid] reduce shared memory usage #2793
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
mtaillefumier
commented
May 23, 2023
- the coefficients are now read directly from global memory when lp_max > 4 instead of storing them in shared memory. This reduce shared memory usage for large l
- grid_miniapp seems unhappy and trigger a race condition when hab coefficients are calculated. So add an atomicAdd.
- minor changes in the cmake build system.
- the coefficients are now read directly from global memory when lp_max > 4 instead of storing them in shared memory. This reduce shared memory usage for large l - grid_miniapp seems unhappy and trigger a race condition when hab coefficients are calculated. So add an atomicAdd. - minor changes in the cmake build system.
|
Oh this is nice! I also contemplated storing the Btw, this exception in the unittest should then no longer be needed. Linking #1785 for posterity. |
|
I only supressed the cxyz coefficients from the shared memory which means that we still have to store cab and alpha. alpha ain't an issue the cab might. for instance ncoset(l = 10) x ncoset(l = 10) is 286^2 doubles which is higher than the shared memory available. One solution might be some hybrid case where we sort task in to low l and high l and then use a different algorithm for low and high l. I can certainly do this on my side (cab in global memory) as integrate/collocate are separated from the calculation of the coefficients (at the prize of more used global memory). I will lift the exception asap (tomorrow). From experience collocate/integrate will always dominate the timers so if we loose a little with the cab been in global memory then be it. The overall gain of treating large l on GPU instead of CPU is worth the effort. it would be worth triggering the HIP pascal tests. the grid_miniapp passes but still |
I actually did the opposite and partitioned
|
indeed. I removed the cab from shared memory entirely. Something is still puzzling me. I get this when I run grid_unittest (same hardware both cases). GPU backend hip backend |
That looks strangely identical. There should be at least numerical noise. What does the statistics at the end say? |
|
GPU backend HIP backend |
|
they are different and it is not where my trouble comes from. It is more about the difference ref/GPU and ref/hip. |
Those difference should be ok because they are below our tolerances of 1e-12 for matrix elements and 1e-8 for forces. These "warning lines" are already printed when the diffs are surpassing 0.01 of our thresholds. While this was useful during development, now it's confusing. Hence, I've opened #2797 to fix this inconsistency. |
|
perfect. thanks for the clarification because I was searching for something wrong in the code. #2797 can be merged first then I will update this PR unless there is no conflict you can merge both. You can squash all commits in to one if you wish |