Skip to content

GPU implementation of ocean baroclinic velocity tendencies - phase 1#772

Merged
mattdturner merged 5 commits intoMPAS-Dev:ocean/developfrom
philipwjones:ocean/gpuvel1
Jan 21, 2021
Merged

GPU implementation of ocean baroclinic velocity tendencies - phase 1#772
mattdturner merged 5 commits intoMPAS-Dev:ocean/developfrom
philipwjones:ocean/gpuvel1

Conversation

@philipwjones
Copy link
Contributor

This is the first phase of a GPU implementation of all the baroclinic velocity tendencies. It contains a subset of the necessary changes for running on the GPU that retain bit-for-bit accuracy with the current version when running on the CPU. In particular, this PR has:

  • added OpenACC loop and data directives for running on GPU
  • removed unnecessary pool and pointer accesses
  • a lot of general cleanup
  • retained current loop structure so current version remains bit-for-bit when run in CPU mode (default)

A phase 2 PR will contain the loop structure changes for optimal GPU performance that result in roundoff-level changes for the CPU code. This PR and the phase 2 PR are meant to replace PR #572

@philipwjones
Copy link
Contributor Author

Tested in QU240 case on Summit with PGI compiler. CPU runs were bit-for-bit. I also compared CPU and GPU-computed tendencies at every step and they differed only in the last digit (abs err < 10^-16), so GPU runs had only roundoff changes. With this bit-for-bit version, speedups were only modest on GPU. A second round of non-b4b changes should regain the full performance improvement of the original PR #572 .

@mattdturner
Copy link
Collaborator

Visual inspection looks good. I'll start testing sometime before holiday break to confirm BFB w/ additional compilers, as well as running a few E3SM tests.

@mark-petersen
Copy link
Contributor

@philipwjones on grizzly with gnu 6 (these modules) I get compile errors:

mpas_ocn_surface_land_ice_fluxes.f90:207:7:

       #ifdef MPAS_OPENACC
       1
Error: Invalid character in name at (1)

with

make gfortran CORE=ocean OPENMP=true DEBUG=true

with both OPENMP true and false. Is something missing?

@mattdturner
Copy link
Collaborator

@mark-petersen Intel and GNU seem to require that the # occur in the first column of the file. In the local copy that I am working on, simply removing all of the whitespace before the preprocessor statements allows it to compile. I'm running into some runtime issues though.

@mattdturner
Copy link
Collaborator

Since I mentioned a runtime error in a previous comment, here are some details. I'm trying to run w/ Intel v19 on Grizzly, and I get the following error:

forrtl: severe (408): fort: (7): Attempt to use pointer WINDSTRESSZONAL when it is not associated with a target

Image              PC                Routine            Line        Source             
ocean_model        00000000026E2DC6  Unknown               Unknown  Unknown
ocean_model        00000000013B72CD  ocn_tendency_mp_o         426  mpas_ocn_tendency.F
ocean_model        0000000001518D08  ocn_time_integrat         537  mpas_ocn_time_integration_split.F
ocean_model        00000000015694DD  ocn_time_integrat         111  mpas_ocn_time_integration.F
ocean_model        000000000105DD1A  ocn_forward_mode_         588  mpas_ocn_forward_mode.F
ocean_model        000000000140CEF4  ocn_core_mp_ocn_c         107  mpas_ocn_core.F
ocean_model        00000000004181E6  mpas_subdriver_mp         347  mpas_subdriver.F
ocean_model        000000000041329A  MAIN__                     16  mpas.F
ocean_model        0000000000413212  Unknown               Unknown  Unknown
libc-2.17.so       00002B0F1FA38555  __libc_start_main     Unknown  Unknown
ocean_model        0000000000413129  Unknown               Unknown  Unknown

This occurs with both OpenMP and no OpenMP. In the same subroutine, prior to the line where the error occurs is

      call mpas_pool_get_array(forcingPool, 'windStressZonal', &                                                   
                                             windStressZonal)

It seems that windStressZonal should already be associated with a target (defined in ../inc/structs_and_variables.inc). So I'm not sure why I'm getting the crash. I'm still working on it though.

@mattdturner
Copy link
Collaborator

mattdturner commented Dec 17, 2020

I figured out this particular issue. The namelist for the test that I am running has config_use_bulk_wind_stress = .false.. This means that windStressZonal will be defined as an inactive array. As a result, no memory is allocated.

So when the ocn_vel_tend subroutine tries to pass windStressZonal as an argument to ocn_surface_bulk_forcing_vel, Intel is complaining that it is not associated. This previously was avoided by not using windStressZonal until within the ocn_surface_bulk_forcing_vel routine, which was returned before trying to access the array since bulkWindStressOff was .true.

@philipwjones and @mark-petersen Do either of you have a preference on how I resolve this? The 2 options that come to mind immediately are:

  1. Make the bulkWindStressOff module variable in mpas_ocn_surface_bulk_forcing.F a public variable, and add a conditional around the call to ocn_surface_bulk_forcing_vel.
  2. Move the call mpas_pool_get_array(forcingPool, 'windStressZonal', windStressZonal) call back into ocn_surface_bulk_forcing_vel, and remove windStressZonal from the argument list. This would necessitate moving the copyin for this array (and others?) into the ocn_surface_bulk_forcing_vel routine...

@philipwjones
Copy link
Contributor Author

philipwjones commented Dec 17, 2020

@mattdturner I was afraid of that - a third option is to allocate the array at a minimal size if not needed (eg size of 1 in each dim) so it has a valid object to pass, but that's probably too involved for this PR. I guess we should reinstate the forcingPool as an argument - that might be closer to where we will be eventually for this pool (and other smaller pools where we can pass an instance/struct around). That will catch the meridionalWindStress too.

@mattdturner
Copy link
Collaborator

Sounds good. I'll go ahead and re-implement the passing of forcingPool and move the !$acc data statements.

I'd be happy to modify the framework to allocate inactive arrays to have an extent of 1 in each dimension. But I agree that this is probably best done as a separate PR if that is what we want to do.

@mark-petersen
Copy link
Contributor

Does this only involve windStressZonal and windStressMeridional? If so, let's just take them out of any package (no package line in Registrly.xml). I see these are the only two in windStressBulkPKG, we could remove that completely. That is always on for E3SM, and for all test cases with wind. For idealized test cases without wind forcing, we can afford those two 2D fields in memory.

That would be a simple way to avoid this, right?

@mattdturner
Copy link
Collaborator

Alright, I was able to get simulations to run w/ Intel v19 + Debug on Grizzly. All but 2 of the tests in the nightly test suite pass:

 ** Running case Global Ocean 240km - RK4 Partition Test
   ** FAIL (See case_outputs/Global_Ocean_240km_-_RK4_Partition_Test for more information)
 ** Running case Global Ocean 240km - RK4 Thread Test
   ** FAIL (See case_outputs/Global_Ocean_240km_-_RK4_Thread_Test for more information)

The FAILs are when performing a bfb comparison against a baseline (the current head of ocean/develop). Both of the tests pass the bfb check between threads or number of ranks. For example,

Beginning variable comparisons for all time levels of field 'temperature'. Note any time levels reported are 0-based.                                                                                                                 
    Pass thresholds are:                                                                                           
       L1: 0.00000000000000e+00                                                                                    
       L2: 0.00000000000000e+00                                                                                    
       L_Infinity: 0.00000000000000e+00                                                                            
0:  l1: 3.90764838774253e-19  l2: 1.99840144432528e-15  linf: 1.77635683940025e-15                                 
1:  l1: 4.47876622902798e-17  l2: 4.43800457988786e-14  linf: 1.42108547152020e-14                                 
 ** FAIL Comparison of temperature between 1thread_run/output.nc and                                               
    /lustre/scratch3/turquoise/mturner/runs/nightly/gpuvel1_20201217_base/ocean/global_ocean/QU240/rk4_thread_test/1thread_run/output.nc 

Oddly enough, the partition and thread tests for the other timestepping schemes are BFB when compared to baseline. Whatever is causing the non-bfb behavior appears to only affect the RK4 timestepping.

I'll push my updates soon.

@mark-petersen
Copy link
Contributor

mark-petersen commented Dec 17, 2020

My general comment is that packages were made to avoid egregious outlays of memory for variables that are not used. If variables are always used for E3SM (our main concern), then allocating some extra memory for unused variables in idealized cases is OK, especially if it makes other things easier.

But maybe I'm missing the point. It looks like you are having other issues as well.

@mattdturner
Copy link
Collaborator

@mark-petersen The issue also occurred with topDrag and topDragMag. But I think in total, only 4 variables were causing a problem.

@mattdturner
Copy link
Collaborator

I just pushed a few commits that resolves the issue discussed above, and moves the preprocessor lines so that we are all working with the same code.

@philipwjones
Copy link
Contributor Author

@mark-petersen Yeah, I was surprised that the windStress arrays weren't always there and it is probably worth always allocating them. The topDrag maybe not so much. It's not a huge issue to pass the forcingPool right now and as I mentioned previously, we can still pass a similar structure in the future even in a redesign. We were hoping to do that with the mesh too, but the structure is so big, there's a performance hit for traversing it to get the fields you want. Forcing doesn't have that problem. The downside currently is an extra data transfer, but I think I can get around that eventually.

@mattdturner
Copy link
Collaborator

@philipwjones Do you want me to investigate the non-BFB behavior with RK4, or do you think the non-BFB is expected behavior? I don't see why the other timestepping schemes would be BFB, though...

@philipwjones
Copy link
Contributor Author

@mattdturner - yes, we should probably look into RK4. There are a couple of instances where tendencies are treated differently for RK4 so there is a possibility of a bug introduced. You might look first at the hadv_coriolis routine where I handled that case a little bit differently to eliminate an extra mask multiplication

@mattdturner
Copy link
Collaborator

@mattdturner - yes, we should probably look into RK4. There are a couple of instances where tendencies are treated differently for RK4 so there is a possibility of a bug introduced. You might look first at the hadv_coriolis routine where I handled that case a little bit differently to eliminate an extra mask multiplication

@philipwjones I replaced the new

               avgVorticity = 0.5_RKIND* &                                                                        
                              (tmpVorticity(k,iEdge) + &                                                          
                               tmpVorticity(k,eoe))

with the old

               avgVorticity = 0.5_RKIND * &                                                                        
                              (normRelVortEdge(k,iEdge) + normPlanetVortEdge(k, iEdge) + &                         
                               normRelVortEdge(k,eoe) + normPlanetVortEdge(k,eoe))

and the RK4 tests pass as BFB. So it looks like the non-BFB behavior w/ Intel is a roundoff error from FMA.

Assuming that the non-BFB with RK4 is acceptable, I will go ahead and start testing in E3SM while I also run the MPAS-O nightly suite w/ GNU.

@mattdturner
Copy link
Collaborator

I made (and deleted) a comment about the GNU testing that prematurely claimed all tests pass w/ GNU + Debug without the edits in #772 (comment) .

With GNU v5.3.0 + Debug on Grizzly, the same RK4 tests FAIL without the changes to hadv_coriolis. All other tests PASS

@mattdturner
Copy link
Collaborator

I ran a few tests, and compared to a baseline. Here is the process that I used (for my reference):

git checkout master
git fetch origin
git reset --hard origin/master   # Current hash is 3b965ab2b
git submodule update --init --recursive
cd cime/scripts
# Generate the baseline datasets
./create_test -p e3sm -q slurm --walltime 001:00:00 SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.compy_pgi SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.compy_intel PET_Ln9.T62_oQU240.GMPAS-IAF.compy_pgi PET_Ln9.T62_oQU240.GMPAS-IAF.compy_intel PEM_Ln9.T62_oQU240.GMPAS-IAF.compy_pgi PEM_Ln9.T62_oQU240.GMPAS-IAF.compy_intel  --baseline-root $WORKDIR/e3sm_baselines -g -b master_20201218

cd ../../components/mpas-source
git fetch pwjones
git merge pwjones/ocean/gpuvel1
cd ../../cime/scripts
# Compare to the baseline datasets
./create_test -p e3sm -q slurm --walltime 001:00:00 SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.compy_pgi SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.compy_intel PET_Ln9.T62_oQU240.GMPAS-IAF.compy_pgi PET_Ln9.T62_oQU240.GMPAS-IAF.compy_intel PEM_Ln9.T62_oQU240.GMPAS-IAF.compy_pgi PEM_Ln9.T62_oQU240.GMPAS-IAF.compy_intel  --baseline-root $WORKDIR/e3sm_baselines -c -b master_20201218

All tests PASSed:

  • SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.compy_intel
  • PEM_Ln9.T62_oQU240.GMPAS-IAF.compy_intel
  • SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.compy_pgi
  • PET_Ln9.T62_oQU240.GMPAS-IAF.compy_pgi
  • PET_Ln9.T62_oQU240.GMPAS-IAF.compy_intel
  • PEM_Ln9.T62_oQU240.GMPAS-IAF.compy_pgi

Unfortunately, Cori is still offline so I can't run tests there.


@philipwjones @mark-petersen If you both are OK with the non-BFB behavior in the RK4 timestepping (which I have shown to be roundoff), then I believe this PR is good from a CPU standpoint. I'll run a few quick tests on Summit to confirm that I didn't break the OpenACC runs in the process of my updates.

@philipwjones
Copy link
Contributor Author

Yep, I'm fine with the RK4 diffs. And the Cori issue was the reason why I hadn't tested a bit more, so thanks for picking up the slack on my part.

@mattdturner
Copy link
Collaborator

This also passes a BFB test when compared to efb184d52 for the following 2 tests:

  • ERP_Ld11.ne4_oQU240.A_WCYCL2000.summit_pgi
  • ERP_Ld11.ne4_oQU240.A_WCYCL2000.summit_gnu

I'm still working on getting a GPU test running on Summit.

@mattdturner
Copy link
Collaborator

I was able to get SMS_P32.T62_oQU120_ais20.MPAS_LISIO_TEST.summit_pgigpu to compile and run on Summit. The test runs to completion, and FAILs the baseline comparison (expected).

So this PR should be ready to merge.

  - added OpenACC loop and data directives
  - removed unnecessary pool and pointer accesses
  - did a lot of general cleanup
  - retained current loop structure so CPU is bit for bit
  - will optimize the loops for CPU and GPU in a separate branch
@mattdturner
Copy link
Collaborator

Rebased. Passes bfb test for MPAS-O w/ Intel v19 + DEBUG on Grizzly. Testing in E3SM

@mattdturner
Copy link
Collaborator

While testing in E3SM, most tests PASS but a few fail:

PASS

SMS.T62_oEC60to30v3wLI.GMPAS-DIB-IAF-ISMF.cori-haswell_gnu
SMS.T62_oEC60to30v3wLI.GMPAS-DIB-IAF-ISMF.cori-haswell_intel
SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.cori-haswell_intel
PET_Ln3.T62_oEC60to30v3wLI.GMPAS-DIB-IAF-ISMF.cori-haswell_gnu
PET_Ln3.T62_oEC60to30v3wLI.GMPAS-DIB-IAF-ISMF.cori-haswell_intel
PEM_Ln9.T62_oQU240.GMPAS-IAF.cori-haswell_gnu
PEM_Ln9.T62_oQU240.GMPAS-IAF.cori-haswell_intel
SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.cori-knl_gnu
SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.cori-knl_intel
PET_Ln9.T62_oQU240.GMPAS-IAF.cori-knl_intel
SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.cori-haswell_gnu
  - PASS SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.cori-haswell_gnu BASELINE next_20210114
SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.cori-haswell_intel
  - PASS SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.cori-haswell_intel BASELINE next_20210114
PET_Ln9.T62_oQU240.GMPAS-IAF.compy_pgi

FAIL:

FAIL PEM_Ln9.T62_oQU240.GMPAS-IAF.compy_pgi COMPARE_base_modpes
FAIL SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.compy_pgi BASELINE /compyfs/e3sm_baselines/pgi/master: DIFF

I have no idea yet what is causing the PEM_Ln9.T62_oQU240.GMPAS-IAF.compy_pgi failure. But I'm going to undo the merge of this PR onto e3sm/develop until we can figure it out.

@mattdturner mattdturner self-assigned this Jan 15, 2021
@mattdturner
Copy link
Collaborator

Testing this PR today in E3SM, it passes PEM_Ln9.T62_oQU240.GMPAS-IAF.compy_pgi.

It still fails the bit-for-bit check for SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.compy_pgi. However, if I generate a new baseline for SMS_D.T62_oQU120_ais20.MPAS_LISIO_TEST.compy_pgi on master and then re-test this PR with debugging enabled, it passes the BFB check.

So this should be ready to go.

@mark-petersen could you review and approve?

mattdturner added a commit that referenced this pull request Jan 19, 2021
This is the first phase of a GPU implementation of all the baroclinic velocity tendencies. It contains a subset of the necessary changes for running on the GPU that retain bit-for-bit accuracy with the current version when running on the CPU
Copy link
Contributor

@mark-petersen mark-petersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed visually. Looks good. I ran a quick test with stand-alone. Passes nightly regression with opt and debug. @philipwjones thanks for your work on this.

jonbob added a commit to E3SM-Project/E3SM that referenced this pull request Jan 20, 2021
Update mpas-source: Port ocean velocity to GPU

This PR brings in a new mpas-source submodule with changes only to the
ocean core. This is the first phase of a GPU implementation of all the
baroclinic velocity tendencies.

See MPAS-Dev/MPAS-Model#772

[non-BFB]
jonbob added a commit to E3SM-Project/E3SM that referenced this pull request Jan 21, 2021
Update mpas-source: Port ocean velocity to GPU

This PR brings in a new mpas-source submodule with changes only to the
ocean core. This is the first phase of a GPU implementation of all the
baroclinic velocity tendencies.

See MPAS-Dev/MPAS-Model#772

[non-BFB]
@mattdturner mattdturner merged commit 27ce43c into MPAS-Dev:ocean/develop Jan 21, 2021
mark-petersen pushed a commit to mark-petersen/MPAS-Model that referenced this pull request Mar 16, 2021
This is the first phase of a GPU implementation of all the baroclinic velocity tendencies. It contains a subset of the necessary changes for running on the GPU that retain bit-for-bit accuracy with the current version when running on the CPU
mark-petersen pushed a commit to mark-petersen/MPAS-Model that referenced this pull request Mar 16, 2021
This is the first phase of a GPU implementation of all the baroclinic
velocity tendencies. It contains a subset of the necessary changes for
running on the GPU that retain bit-for-bit accuracy with the current
version when running on the CPU.
vanroekel pushed a commit to vanroekel/MPAS-Model that referenced this pull request Sep 20, 2021
This is the first phase of a GPU implementation of all the baroclinic
velocity tendencies. It contains a subset of the necessary changes for
running on the GPU that retain bit-for-bit accuracy with the current
version when running on the CPU.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants