GPU implementation of ocean baroclinic velocity tendencies - phase 1 by philipwjones · Pull Request #772 · MPAS-Dev/MPAS-Model

philipwjones · 2020-12-16T21:15:47Z

This is the first phase of a GPU implementation of all the baroclinic velocity tendencies. It contains a subset of the necessary changes for running on the GPU that retain bit-for-bit accuracy with the current version when running on the CPU. In particular, this PR has:

added OpenACC loop and data directives for running on GPU
removed unnecessary pool and pointer accesses
a lot of general cleanup
retained current loop structure so current version remains bit-for-bit when run in CPU mode (default)

A phase 2 PR will contain the loop structure changes for optimal GPU performance that result in roundoff-level changes for the CPU code. This PR and the phase 2 PR are meant to replace PR #572

philipwjones · 2020-12-16T21:25:15Z

Tested in QU240 case on Summit with PGI compiler. CPU runs were bit-for-bit. I also compared CPU and GPU-computed tendencies at every step and they differed only in the last digit (abs err < 10^-16), so GPU runs had only roundoff changes. With this bit-for-bit version, speedups were only modest on GPU. A second round of non-b4b changes should regain the full performance improvement of the original PR #572 .

mattdturner · 2020-12-16T22:09:49Z

Visual inspection looks good. I'll start testing sometime before holiday break to confirm BFB w/ additional compilers, as well as running a few E3SM tests.

mark-petersen · 2020-12-17T17:12:38Z

@philipwjones on grizzly with gnu 6 (these modules) I get compile errors:

mpas_ocn_surface_land_ice_fluxes.f90:207:7:

       #ifdef MPAS_OPENACC
       1
Error: Invalid character in name at (1)

with

make gfortran CORE=ocean OPENMP=true DEBUG=true

with both OPENMP true and false. Is something missing?

mattdturner · 2020-12-17T17:47:05Z

@mark-petersen Intel and GNU seem to require that the # occur in the first column of the file. In the local copy that I am working on, simply removing all of the whitespace before the preprocessor statements allows it to compile. I'm running into some runtime issues though.

src/core_ocean/shared/mpas_ocn_tendency.F

mattdturner · 2020-12-17T18:24:57Z

Since I mentioned a runtime error in a previous comment, here are some details. I'm trying to run w/ Intel v19 on Grizzly, and I get the following error:

forrtl: severe (408): fort: (7): Attempt to use pointer WINDSTRESSZONAL when it is not associated with a target

Image              PC                Routine            Line        Source             
ocean_model        00000000026E2DC6  Unknown               Unknown  Unknown
ocean_model        00000000013B72CD  ocn_tendency_mp_o         426  mpas_ocn_tendency.F
ocean_model        0000000001518D08  ocn_time_integrat         537  mpas_ocn_time_integration_split.F
ocean_model        00000000015694DD  ocn_time_integrat         111  mpas_ocn_time_integration.F
ocean_model        000000000105DD1A  ocn_forward_mode_         588  mpas_ocn_forward_mode.F
ocean_model        000000000140CEF4  ocn_core_mp_ocn_c         107  mpas_ocn_core.F
ocean_model        00000000004181E6  mpas_subdriver_mp         347  mpas_subdriver.F
ocean_model        000000000041329A  MAIN__                     16  mpas.F
ocean_model        0000000000413212  Unknown               Unknown  Unknown
libc-2.17.so       00002B0F1FA38555  __libc_start_main     Unknown  Unknown
ocean_model        0000000000413129  Unknown               Unknown  Unknown

This occurs with both OpenMP and no OpenMP. In the same subroutine, prior to the line where the error occurs is

      call mpas_pool_get_array(forcingPool, 'windStressZonal', &                                                   
                                             windStressZonal)

It seems that windStressZonal should already be associated with a target (defined in ../inc/structs_and_variables.inc). So I'm not sure why I'm getting the crash. I'm still working on it though.

mattdturner · 2020-12-17T19:44:03Z

I figured out this particular issue. The namelist for the test that I am running has config_use_bulk_wind_stress = .false.. This means that windStressZonal will be defined as an inactive array. As a result, no memory is allocated.

So when the ocn_vel_tend subroutine tries to pass windStressZonal as an argument to ocn_surface_bulk_forcing_vel, Intel is complaining that it is not associated. This previously was avoided by not using windStressZonal until within the ocn_surface_bulk_forcing_vel routine, which was returned before trying to access the array since bulkWindStressOff was .true.

@philipwjones and @mark-petersen Do either of you have a preference on how I resolve this? The 2 options that come to mind immediately are:

Make the bulkWindStressOff module variable in mpas_ocn_surface_bulk_forcing.F a public variable, and add a conditional around the call to ocn_surface_bulk_forcing_vel.
Move the call mpas_pool_get_array(forcingPool, 'windStressZonal', windStressZonal) call back into ocn_surface_bulk_forcing_vel, and remove windStressZonal from the argument list. This would necessitate moving the copyin for this array (and others?) into the ocn_surface_bulk_forcing_vel routine...

philipwjones · 2020-12-17T21:04:40Z

@mattdturner I was afraid of that - a third option is to allocate the array at a minimal size if not needed (eg size of 1 in each dim) so it has a valid object to pass, but that's probably too involved for this PR. I guess we should reinstate the forcingPool as an argument - that might be closer to where we will be eventually for this pool (and other smaller pools where we can pass an instance/struct around). That will catch the meridionalWindStress too.

mattdturner · 2020-12-17T22:17:37Z

Sounds good. I'll go ahead and re-implement the passing of forcingPool and move the !$acc data statements.

I'd be happy to modify the framework to allocate inactive arrays to have an extent of 1 in each dimension. But I agree that this is probably best done as a separate PR if that is what we want to do.

mark-petersen · 2020-12-17T23:30:44Z

Does this only involve windStressZonal and windStressMeridional? If so, let's just take them out of any package (no package line in Registrly.xml). I see these are the only two in windStressBulkPKG, we could remove that completely. That is always on for E3SM, and for all test cases with wind. For idealized test cases without wind forcing, we can afford those two 2D fields in memory.

That would be a simple way to avoid this, right?

mattdturner · 2020-12-17T23:35:05Z

Alright, I was able to get simulations to run w/ Intel v19 + Debug on Grizzly. All but 2 of the tests in the nightly test suite pass:

 ** Running case Global Ocean 240km - RK4 Partition Test
   ** FAIL (See case_outputs/Global_Ocean_240km_-_RK4_Partition_Test for more information)
 ** Running case Global Ocean 240km - RK4 Thread Test
   ** FAIL (See case_outputs/Global_Ocean_240km_-_RK4_Thread_Test for more information)

The FAILs are when performing a bfb comparison against a baseline (the current head of ocean/develop). Both of the tests pass the bfb check between threads or number of ranks. For example,

Beginning variable comparisons for all time levels of field 'temperature'. Note any time levels reported are 0-based.                                                                                                                 
    Pass thresholds are:                                                                                           
       L1: 0.00000000000000e+00                                                                                    
       L2: 0.00000000000000e+00                                                                                    
       L_Infinity: 0.00000000000000e+00                                                                            
0:  l1: 3.90764838774253e-19  l2: 1.99840144432528e-15  linf: 1.77635683940025e-15                                 
1:  l1: 4.47876622902798e-17  l2: 4.43800457988786e-14  linf: 1.42108547152020e-14                                 
 ** FAIL Comparison of temperature between 1thread_run/output.nc and                                               
    /lustre/scratch3/turquoise/mturner/runs/nightly/gpuvel1_20201217_base/ocean/global_ocean/QU240/rk4_thread_test/1thread_run/output.nc

Oddly enough, the partition and thread tests for the other timestepping schemes are BFB when compared to baseline. Whatever is causing the non-bfb behavior appears to only affect the RK4 timestepping.

I'll push my updates soon.

mark-petersen · 2020-12-17T23:37:02Z

My general comment is that packages were made to avoid egregious outlays of memory for variables that are not used. If variables are always used for E3SM (our main concern), then allocating some extra memory for unused variables in idealized cases is OK, especially if it makes other things easier.

But maybe I'm missing the point. It looks like you are having other issues as well.

mattdturner · 2020-12-17T23:38:46Z

@mark-petersen The issue also occurred with topDrag and topDragMag. But I think in total, only 4 variables were causing a problem.

mattdturner · 2020-12-18T00:33:46Z

I just pushed a few commits that resolves the issue discussed above, and moves the preprocessor lines so that we are all working with the same code.

philipwjones · 2020-12-18T00:39:23Z

@mark-petersen Yeah, I was surprised that the windStress arrays weren't always there and it is probably worth always allocating them. The topDrag maybe not so much. It's not a huge issue to pass the forcingPool right now and as I mentioned previously, we can still pass a similar structure in the future even in a redesign. We were hoping to do that with the mesh too, but the structure is so big, there's a performance hit for traversing it to get the fields you want. Forcing doesn't have that problem. The downside currently is an extra data transfer, but I think I can get around that eventually.

mattdturner · 2020-12-18T15:08:53Z

@philipwjones Do you want me to investigate the non-BFB behavior with RK4, or do you think the non-BFB is expected behavior? I don't see why the other timestepping schemes would be BFB, though...

philipwjones · 2020-12-18T15:29:58Z

@mattdturner - yes, we should probably look into RK4. There are a couple of instances where tendencies are treated differently for RK4 so there is a possibility of a bug introduced. You might look first at the hadv_coriolis routine where I handled that case a little bit differently to eliminate an extra mask multiplication

mattdturner · 2020-12-18T15:58:05Z

@mattdturner - yes, we should probably look into RK4. There are a couple of instances where tendencies are treated differently for RK4 so there is a possibility of a bug introduced. You might look first at the hadv_coriolis routine where I handled that case a little bit differently to eliminate an extra mask multiplication

@philipwjones I replaced the new

               avgVorticity = 0.5_RKIND* &                                                                        
                              (tmpVorticity(k,iEdge) + &                                                          
                               tmpVorticity(k,eoe))

with the old

               avgVorticity = 0.5_RKIND * &                                                                        
                              (normRelVortEdge(k,iEdge) + normPlanetVortEdge(k, iEdge) + &                         
                               normRelVortEdge(k,eoe) + normPlanetVortEdge(k,eoe))

and the RK4 tests pass as BFB. So it looks like the non-BFB behavior w/ Intel is a roundoff error from FMA.

Assuming that the non-BFB with RK4 is acceptable, I will go ahead and start testing in E3SM while I also run the MPAS-O nightly suite w/ GNU.

mattdturner · 2020-12-18T16:37:46Z

I made (and deleted) a comment about the GNU testing that prematurely claimed all tests pass w/ GNU + Debug without the edits in #772 (comment) .

With GNU v5.3.0 + Debug on Grizzly, the same RK4 tests FAIL without the changes to hadv_coriolis. All other tests PASS

mattdturner · 2020-12-18T18:03:17Z

I ran a few tests, and compared to a baseline. Here is the process that I used (for my reference):

git checkout master
git fetch origin
git reset --hard origin/master   # Current hash is 3b965ab2b
git submodule update --init --recursive
cd cime/scripts
# Generate the baseline datasets
./create_test -p e3sm -q slurm --walltime 001:00:00 SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.compy_pgi SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.compy_intel PET_Ln9.T62_oQU240.GMPAS-IAF.compy_pgi PET_Ln9.T62_oQU240.GMPAS-IAF.compy_intel PEM_Ln9.T62_oQU240.GMPAS-IAF.compy_pgi PEM_Ln9.T62_oQU240.GMPAS-IAF.compy_intel  --baseline-root $WORKDIR/e3sm_baselines -g -b master_20201218

cd ../../components/mpas-source
git fetch pwjones
git merge pwjones/ocean/gpuvel1
cd ../../cime/scripts
# Compare to the baseline datasets
./create_test -p e3sm -q slurm --walltime 001:00:00 SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.compy_pgi SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.compy_intel PET_Ln9.T62_oQU240.GMPAS-IAF.compy_pgi PET_Ln9.T62_oQU240.GMPAS-IAF.compy_intel PEM_Ln9.T62_oQU240.GMPAS-IAF.compy_pgi PEM_Ln9.T62_oQU240.GMPAS-IAF.compy_intel  --baseline-root $WORKDIR/e3sm_baselines -c -b master_20201218

All tests PASSed:

SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.compy_intel
PEM_Ln9.T62_oQU240.GMPAS-IAF.compy_intel
SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.compy_pgi
PET_Ln9.T62_oQU240.GMPAS-IAF.compy_pgi
PET_Ln9.T62_oQU240.GMPAS-IAF.compy_intel
PEM_Ln9.T62_oQU240.GMPAS-IAF.compy_pgi

Unfortunately, Cori is still offline so I can't run tests there.

@philipwjones @mark-petersen If you both are OK with the non-BFB behavior in the RK4 timestepping (which I have shown to be roundoff), then I believe this PR is good from a CPU standpoint. I'll run a few quick tests on Summit to confirm that I didn't break the OpenACC runs in the process of my updates.

philipwjones · 2020-12-18T18:51:20Z

Yep, I'm fine with the RK4 diffs. And the Cori issue was the reason why I hadn't tested a bit more, so thanks for picking up the slack on my part.

mattdturner · 2020-12-21T21:27:26Z

This also passes a BFB test when compared to efb184d52 for the following 2 tests:

ERP_Ld11.ne4_oQU240.A_WCYCL2000.summit_pgi
ERP_Ld11.ne4_oQU240.A_WCYCL2000.summit_gnu

I'm still working on getting a GPU test running on Summit.

mattdturner · 2021-01-04T18:41:19Z

I was able to get SMS_P32.T62_oQU120_ais20.MPAS_LISIO_TEST.summit_pgigpu to compile and run on Summit. The test runs to completion, and FAILs the baseline comparison (expected).

So this PR should be ready to merge.

- added OpenACC loop and data directives - removed unnecessary pool and pointer accesses - did a lot of general cleanup - retained current loop structure so CPU is bit for bit - will optimize the loops for CPU and GPU in a separate branch

mattdturner · 2021-01-14T19:09:00Z

Rebased. Passes bfb test for MPAS-O w/ Intel v19 + DEBUG on Grizzly. Testing in E3SM

mattdturner · 2021-01-15T00:08:41Z

While testing in E3SM, most tests PASS but a few fail:

PASS

SMS.T62_oEC60to30v3wLI.GMPAS-DIB-IAF-ISMF.cori-haswell_gnu
SMS.T62_oEC60to30v3wLI.GMPAS-DIB-IAF-ISMF.cori-haswell_intel
SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.cori-haswell_intel
PET_Ln3.T62_oEC60to30v3wLI.GMPAS-DIB-IAF-ISMF.cori-haswell_gnu
PET_Ln3.T62_oEC60to30v3wLI.GMPAS-DIB-IAF-ISMF.cori-haswell_intel
PEM_Ln9.T62_oQU240.GMPAS-IAF.cori-haswell_gnu
PEM_Ln9.T62_oQU240.GMPAS-IAF.cori-haswell_intel
SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.cori-knl_gnu
SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.cori-knl_intel
PET_Ln9.T62_oQU240.GMPAS-IAF.cori-knl_intel
SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.cori-haswell_gnu
  - PASS SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.cori-haswell_gnu BASELINE next_20210114
SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.cori-haswell_intel
  - PASS SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.cori-haswell_intel BASELINE next_20210114
PET_Ln9.T62_oQU240.GMPAS-IAF.compy_pgi

FAIL:

FAIL PEM_Ln9.T62_oQU240.GMPAS-IAF.compy_pgi COMPARE_base_modpes
FAIL SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.compy_pgi BASELINE /compyfs/e3sm_baselines/pgi/master: DIFF

I have no idea yet what is causing the PEM_Ln9.T62_oQU240.GMPAS-IAF.compy_pgi failure. But I'm going to undo the merge of this PR onto e3sm/develop until we can figure it out.

mattdturner · 2021-01-19T18:18:22Z

Testing this PR today in E3SM, it passes PEM_Ln9.T62_oQU240.GMPAS-IAF.compy_pgi.

It still fails the bit-for-bit check for SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.compy_pgi. However, if I generate a new baseline for SMS_D.T62_oQU120_ais20.MPAS_LISIO_TEST.compy_pgi on master and then re-test this PR with debugging enabled, it passes the BFB check.

So this should be ready to go.

@mark-petersen could you review and approve?

This is the first phase of a GPU implementation of all the baroclinic velocity tendencies. It contains a subset of the necessary changes for running on the GPU that retain bit-for-bit accuracy with the current version when running on the CPU

mark-petersen

Reviewed visually. Looks good. I ran a quick test with stand-alone. Passes nightly regression with opt and debug. @philipwjones thanks for your work on this.

Update mpas-source: Port ocean velocity to GPU This PR brings in a new mpas-source submodule with changes only to the ocean core. This is the first phase of a GPU implementation of all the baroclinic velocity tendencies. See MPAS-Dev/MPAS-Model#772 [non-BFB]

This is the first phase of a GPU implementation of all the baroclinic velocity tendencies. It contains a subset of the necessary changes for running on the GPU that retain bit-for-bit accuracy with the current version when running on the CPU

This is the first phase of a GPU implementation of all the baroclinic velocity tendencies. It contains a subset of the necessary changes for running on the GPU that retain bit-for-bit accuracy with the current version when running on the CPU.

philipwjones added Ocean performance labels Dec 16, 2020

philipwjones requested review from mark-petersen and mattdturner December 16, 2020 21:16

philipwjones mentioned this pull request Dec 16, 2020

GPU implementation of ocean baroclinic velocity tendencies #572

Closed

mattdturner reviewed Dec 17, 2020

View reviewed changes

src/core_ocean/shared/mpas_ocn_tendency.F Outdated Show resolved Hide resolved

mattdturner approved these changes Jan 4, 2021

View reviewed changes

mattdturner added 4 commits January 14, 2021 10:42

Re-add the passing of forcingPool to sfc forcing

70bcb49

Move preprocessor statments to first column

a824fc3

Move preprocessor else statments to first column

90115d0

Updates for recent SI merge

a30fd34

mattdturner self-assigned this Jan 15, 2021

mattdturner mentioned this pull request Jan 19, 2021

Update mpas-source: Port ocean velocity to GPU E3SM-Project/E3SM#4041

Merged

mark-petersen approved these changes Jan 20, 2021

View reviewed changes

mattdturner merged commit 27ce43c into MPAS-Dev:ocean/develop Jan 21, 2021

matthewhoffman added Ocean performance labels Mar 17, 2021

Conversation

philipwjones commented Dec 16, 2020

Uh oh!

philipwjones commented Dec 16, 2020

Uh oh!

mattdturner commented Dec 16, 2020

Uh oh!

mark-petersen commented Dec 17, 2020

Uh oh!

mattdturner commented Dec 17, 2020

Uh oh!

Uh oh!

mattdturner commented Dec 17, 2020

Uh oh!

mattdturner commented Dec 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

philipwjones commented Dec 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mattdturner commented Dec 17, 2020

Uh oh!

mark-petersen commented Dec 17, 2020

Uh oh!

mattdturner commented Dec 17, 2020

Uh oh!

mark-petersen commented Dec 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mattdturner commented Dec 17, 2020

Uh oh!

mattdturner commented Dec 18, 2020

Uh oh!

philipwjones commented Dec 18, 2020

Uh oh!

mattdturner commented Dec 18, 2020

Uh oh!

philipwjones commented Dec 18, 2020

Uh oh!

mattdturner commented Dec 18, 2020

Uh oh!

mattdturner commented Dec 18, 2020

Uh oh!

mattdturner commented Dec 18, 2020

Uh oh!

philipwjones commented Dec 18, 2020

Uh oh!

mattdturner commented Dec 21, 2020

Uh oh!

mattdturner commented Jan 4, 2021

Uh oh!

mattdturner commented Jan 14, 2021

Uh oh!

mattdturner commented Jan 15, 2021

Uh oh!

mattdturner commented Jan 19, 2021

Uh oh!

mark-petersen left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mattdturner commented Dec 17, 2020 •

edited

Loading

philipwjones commented Dec 17, 2020 •

edited

Loading

mark-petersen commented Dec 17, 2020 •

edited

Loading