GPU implementation of ocean baroclinic velocity tendencies#572
GPU implementation of ocean baroclinic velocity tendencies#572philipwjones wants to merge 1 commit intoMPAS-Dev:ocean/developfrom philipwjones:ocean/GPUvel
Conversation
This commit is a substantial modification to all the ocean
baroclinic velocity tendencies and includes:
- a complete GPU implementation in which all tendencies are
computed on the accelerator (using OpenACC) and all data
is transfered at the top level driver (ocn_tend_vel)
- a number of CPU optimizations performed along the way
- elimination of meshPool, configPool
|
Quick questions @philipwjones:
|
|
cc @sbrus89 and Nairita |
I was mistaken. |
|
@pwolfram If you are running on GPUs (OPENACC enabled) and with tidal forcing, this will exit with an error. It still works for CPU only runs. This is temporary - trying to get a lot of GPU code integrated before the end of the month and an ECP deliverable. But the way the tidal forcing modifies zMid and moves pointers back and forth interferes with the copies on the GPU and it was going to take some thought on how to manage that in an efficient way. Sorry - will get back to it once I finish integrating other stuff |
|
@philipwjones, we should have removed changes to |
|
Tried to rebase this yesterday, but there have been some reorganization of code since this was first submitted and the rebase got a little ugly and I don't have high confidence. So...will probably do a fresh checkout of a more recent version and re-implement the GPU mods, maybe incorporating some new ideas from Az first. Will update the branch when this is done. |
That's probably a good idea. We just had an issue where the merge conflict resolution in one of my PRs reverted prior changes (bugfix in #672), so starting over with a fresh checkout in a few subroutines might be better. |
|
Replaced by #772 and another future PR. |
This PR is a substantial modification to all the ocean baroclinic velocity tendencies and includes:
This PR contains similar changes that are in #513 #536 #569 at least so will need to modify/rebase once those are merged.
Performance speedup for this part of the code was 2.8x using 2 GPUs on Summit compared with an 8-rank MPI only case. Details of performance will depend on configuration. CPU performance improved by ~20% in the same 8-rank QU240 test. More speedup is expected as we migrate more data to the device elsewhere. The computational part excluding data transfer showed a 10x speedup.
This is not quite b4b due to changes in order of operations in a couple of routines and not b4b using the accelerator (different chip architecture). But in both cases, the differences are at roundoff level. Tested most of the options (eg for hmix, pgrad), though since it was tested in standalone QU240, the forcing routines really didn't get much of a workout.