GPU implementation of ocean baroclinic velocity tendencies - phase 1#772
GPU implementation of ocean baroclinic velocity tendencies - phase 1#772mattdturner merged 5 commits intoMPAS-Dev:ocean/developfrom philipwjones:ocean/gpuvel1
Conversation
|
Tested in QU240 case on Summit with PGI compiler. CPU runs were bit-for-bit. I also compared CPU and GPU-computed tendencies at every step and they differed only in the last digit (abs err < 10^-16), so GPU runs had only roundoff changes. With this bit-for-bit version, speedups were only modest on GPU. A second round of non-b4b changes should regain the full performance improvement of the original PR #572 . |
|
Visual inspection looks good. I'll start testing sometime before holiday break to confirm BFB w/ additional compilers, as well as running a few E3SM tests. |
|
@philipwjones on grizzly with gnu 6 (these modules) I get compile errors: with with both |
|
@mark-petersen Intel and GNU seem to require that the |
|
Since I mentioned a runtime error in a previous comment, here are some details. I'm trying to run w/ Intel v19 on Grizzly, and I get the following error: This occurs with both OpenMP and no OpenMP. In the same subroutine, prior to the line where the error occurs is It seems that |
|
I figured out this particular issue. The namelist for the test that I am running has So when the @philipwjones and @mark-petersen Do either of you have a preference on how I resolve this? The 2 options that come to mind immediately are:
|
|
@mattdturner I was afraid of that - a third option is to allocate the array at a minimal size if not needed (eg size of 1 in each dim) so it has a valid object to pass, but that's probably too involved for this PR. I guess we should reinstate the forcingPool as an argument - that might be closer to where we will be eventually for this pool (and other smaller pools where we can pass an instance/struct around). That will catch the meridionalWindStress too. |
|
Sounds good. I'll go ahead and re-implement the passing of I'd be happy to modify the framework to allocate inactive arrays to have an extent of 1 in each dimension. But I agree that this is probably best done as a separate PR if that is what we want to do. |
|
Does this only involve That would be a simple way to avoid this, right? |
|
Alright, I was able to get simulations to run w/ Intel v19 + Debug on Grizzly. All but 2 of the tests in the nightly test suite pass: The FAILs are when performing a bfb comparison against a baseline (the current head of ocean/develop). Both of the tests pass the bfb check between threads or number of ranks. For example, Oddly enough, the partition and thread tests for the other timestepping schemes are BFB when compared to baseline. Whatever is causing the non-bfb behavior appears to only affect the RK4 timestepping. I'll push my updates soon. |
|
My general comment is that packages were made to avoid egregious outlays of memory for variables that are not used. If variables are always used for E3SM (our main concern), then allocating some extra memory for unused variables in idealized cases is OK, especially if it makes other things easier. But maybe I'm missing the point. It looks like you are having other issues as well. |
|
@mark-petersen The issue also occurred with |
|
I just pushed a few commits that resolves the issue discussed above, and moves the preprocessor lines so that we are all working with the same code. |
|
@mark-petersen Yeah, I was surprised that the windStress arrays weren't always there and it is probably worth always allocating them. The topDrag maybe not so much. It's not a huge issue to pass the forcingPool right now and as I mentioned previously, we can still pass a similar structure in the future even in a redesign. We were hoping to do that with the mesh too, but the structure is so big, there's a performance hit for traversing it to get the fields you want. Forcing doesn't have that problem. The downside currently is an extra data transfer, but I think I can get around that eventually. |
|
@philipwjones Do you want me to investigate the non-BFB behavior with RK4, or do you think the non-BFB is expected behavior? I don't see why the other timestepping schemes would be BFB, though... |
|
@mattdturner - yes, we should probably look into RK4. There are a couple of instances where tendencies are treated differently for RK4 so there is a possibility of a bug introduced. You might look first at the hadv_coriolis routine where I handled that case a little bit differently to eliminate an extra mask multiplication |
@philipwjones I replaced the new with the old and the RK4 tests pass as BFB. So it looks like the non-BFB behavior w/ Intel is a roundoff error from FMA. Assuming that the non-BFB with RK4 is acceptable, I will go ahead and start testing in E3SM while I also run the MPAS-O nightly suite w/ GNU. |
|
I made (and deleted) a comment about the GNU testing that prematurely claimed all tests pass w/ GNU + Debug without the edits in #772 (comment) . With GNU v5.3.0 + Debug on Grizzly, the same RK4 tests FAIL without the changes to hadv_coriolis. All other tests PASS |
|
I ran a few tests, and compared to a baseline. Here is the process that I used (for my reference): All tests PASSed:
Unfortunately, Cori is still offline so I can't run tests there. @philipwjones @mark-petersen If you both are OK with the non-BFB behavior in the RK4 timestepping (which I have shown to be roundoff), then I believe this PR is good from a CPU standpoint. I'll run a few quick tests on Summit to confirm that I didn't break the OpenACC runs in the process of my updates. |
|
Yep, I'm fine with the RK4 diffs. And the Cori issue was the reason why I hadn't tested a bit more, so thanks for picking up the slack on my part. |
|
This also passes a BFB test when compared to
I'm still working on getting a GPU test running on Summit. |
|
I was able to get So this PR should be ready to merge. |
- added OpenACC loop and data directives - removed unnecessary pool and pointer accesses - did a lot of general cleanup - retained current loop structure so CPU is bit for bit - will optimize the loops for CPU and GPU in a separate branch
|
Rebased. Passes bfb test for MPAS-O w/ Intel v19 + DEBUG on Grizzly. Testing in E3SM |
|
While testing in E3SM, most tests PASS but a few fail: PASS FAIL: I have no idea yet what is causing the |
|
Testing this PR today in E3SM, it passes It still fails the bit-for-bit check for So this should be ready to go. @mark-petersen could you review and approve? |
This is the first phase of a GPU implementation of all the baroclinic velocity tendencies. It contains a subset of the necessary changes for running on the GPU that retain bit-for-bit accuracy with the current version when running on the CPU
mark-petersen
left a comment
There was a problem hiding this comment.
Reviewed visually. Looks good. I ran a quick test with stand-alone. Passes nightly regression with opt and debug. @philipwjones thanks for your work on this.
Update mpas-source: Port ocean velocity to GPU This PR brings in a new mpas-source submodule with changes only to the ocean core. This is the first phase of a GPU implementation of all the baroclinic velocity tendencies. See MPAS-Dev/MPAS-Model#772 [non-BFB]
Update mpas-source: Port ocean velocity to GPU This PR brings in a new mpas-source submodule with changes only to the ocean core. This is the first phase of a GPU implementation of all the baroclinic velocity tendencies. See MPAS-Dev/MPAS-Model#772 [non-BFB]
This is the first phase of a GPU implementation of all the baroclinic velocity tendencies. It contains a subset of the necessary changes for running on the GPU that retain bit-for-bit accuracy with the current version when running on the CPU
This is the first phase of a GPU implementation of all the baroclinic velocity tendencies. It contains a subset of the necessary changes for running on the GPU that retain bit-for-bit accuracy with the current version when running on the CPU.
This is the first phase of a GPU implementation of all the baroclinic velocity tendencies. It contains a subset of the necessary changes for running on the GPU that retain bit-for-bit accuracy with the current version when running on the CPU.
This is the first phase of a GPU implementation of all the baroclinic velocity tendencies. It contains a subset of the necessary changes for running on the GPU that retain bit-for-bit accuracy with the current version when running on the CPU. In particular, this PR has:
A phase 2 PR will contain the loop structure changes for optimal GPU performance that result in roundoff-level changes for the CPU code. This PR and the phase 2 PR are meant to replace PR #572