Localized OpenMP Threading by mattdturner · Pull Request #513 · MPAS-Dev/MPAS-Model

mattdturner · 2020-04-07T19:19:23Z

This PR replaces the OpenMP threading implementation with the parallel region being initiated before the timestep loop with a new implementation that created parallel regions around each OpenMP loop (i.e., more localized threading regions instead of a global region).

mattdturner · 2020-04-07T19:20:43Z

Testing was done with Intel v19 on Cori for a QU240 performance_test case. The results are bit-for-bit when compiling with DEBUG.

mattdturner · 2020-04-07T19:23:41Z

src/core_ocean/shared/mpas_ocn_gm.F

This loop has the OpenMP directives commented out since there is a race condition that is yet to be resolved. This will be resolved in a future PR (the work in this PR need to be merged for future work to build upon).

mattdturner · 2020-04-07T19:24:20Z

src/core_ocean/shared/mpas_ocn_vmix_cvmix.F

This loop has the OpenMP directives commented out since there is a race condition that is yet to be resolved. This will be resolved in a future PR (the work in this PR need to be merged for future work to build upon). I'm also putting off resolving this race condition until the other work on CVMIX is merged.

philipwjones

Sorry for the delay - was going to test on Summit but PGI with threading is still broken. I have looked it over and based on Matt's testing will go ahead and approve.

mark-petersen · 2020-05-22T03:43:23Z

Rebased. @mattdturner, these two files had so many previous changes, I thought it was dangerous to try to resolve the conflicts:

src/core_ocean/shared/mpas_ocn_gm.F
src/core_ocean/shared/mpas_ocn_tracer_hmix_redi.F

Would you mind redoing the process on those two?

Currently, intel with debug and OpenMP=true produces an error in an ALE routine. With OpenMP=false everything passes. That error could be from my rebasing, and I can look tomorrow.

mattdturner · 2020-05-22T14:26:33Z

Yeah, I'll redo the process on those files. I'll also look into the ALE routines while I'm at it to see what might be causing the problem

mattdturner · 2020-05-23T02:27:27Z

@mark-petersen I just pushed 2 commits to re-do the localized threading in mpas_ocn_tracer_hmix_redi.F and mpas_ocn_gm.F. The additional changes were bfb on Intel with debug flags enabled for the QU240 performance test. Note that they are only bfb if I use -O0 as a compile option.

What problems were you seeing with the ALE routines? I didn't encounter any crashes because of them, but I'm not sure if my test case actually ran those routines.

mark-petersen · 2020-05-26T11:47:03Z

@mattdturner Based on your last comment, we should expect this to be non-BFB for the E3SM nightly regression suite, where optimization is enabled.

mattdturner · 2020-05-26T14:36:55Z

In my QU240 testing, this wasn't even BFB with DEBUG=true. I had to modify the Makefile to add -O0 to the debug flags in order to get BFB. So yes, it would probably be non-BFB for the E3SM regression suite.

mark-petersen · 2020-05-26T18:12:46Z

src/core_ocean/shared/mpas_ocn_thick_ale.F

This section is producing an error on grizzly using

module load intel/17.0.1 openmpi/1.10.5 netcdf/4.4.1 parallel-netcdf/1.5.0 pio/1.7.2 make ifort CORE=ocean DEBUG=true OPENMP=true GEN_F90=true

It looks like iCell is missing in the private list. I added it, and still get this same error:

forrtl: severe (174): SIGSEGV, segmentation fault occurred Image PC Routine Line Source ... ocean_model 0000000001B89B2B ocn_thick_ale_mp_ 172 mpas_ocn_thick_ale.f90

which is this line, just below:

160 SSH_ALE_Thickness(k) = SSH(iCell) * vertCoordMovementWeights(k) * restingThickness(k, iCell)

I have not found the root problem yet. @mattdturner if your tests on cori don't reproduce it, it might be an intel version thing. Could you double check that the allocatable array SSH_ALE_Thickness(k) is declared correctly? It does not need to be module level with these changes?

I can confirm that this error does not occur with optimization on, or with OPENMP=false.

As the code currently stands, I should be seeing the error that you are seeing. I'm not sure why it didn't show up in my test on Cori, but the SSH_ALE_Thickness array is uninitialized within the OpenMP region. I'll look into this now.

I just pushed a commit that changes SSH_ALE_Thickness to firstprivate, and I was able to successfully run a test on Grizzly.

mark-petersen · 2020-05-26T18:16:09Z

@mattdturner could you do this:

cd src/core_ocean
git grep -in 'private(' |grep omp | grep -v iCell|grep -v iEdge|grep -v iVertex

It produces 86 lines where the partitioned index is not in the private pragma. Could you check those? You can ignore the modules removed in #569. I was not sure if the others were oversights.

mattdturner · 2020-05-26T20:18:53Z

@mark-petersen I'm not sure what you want me to look into? Is the issue that the index over which a loop is iterating isn't declared private? In an !$omp do directive, the index of the initial do loop is inherently private. E.g.,

!$omp do
do i = 1, 10

In this example OpenMP automatically assigns i to be private. If that is what you were asking, I can explicitly include them as private for completeness-sake if you would prefer.

mark-petersen · 2020-05-26T23:33:04Z

src/core_ocean/mode_forward/mpas_ocn_time_integration_rk4.F

@mattdturner when I looked through this PR, it appeared that you were intentionally adding the main index loop to the private list, so I thought that went with the addition of the parallel line before it. Are you saying that we don't need to put iCell on the private list here, or in all the other places in this PR?

Correct, the iCell does not need to be in the private list here. I thought that I had removed all of the main loop indices from private lists, but it looks like there are quite a few that I missed.

Sorry for the extra work, but could you remove all those? Otherwise it leads to confusion. I think you could use sed within a shell script to automate it on all the files.

Sure. I'll take care of that today and push the changes once I'm done.

I just pushed changes that remove the main loop index from the !$omp private directives. It PASSED the thread test when compiled with DEBUG using Intel v19 on Cori

mark-petersen · 2020-05-27T03:26:43Z

@mattdturner thanks for the update. I can confirm that this PR now passes the nightly regression suite using intel and debug, with OPENMP=true.

mark-petersen · 2020-05-29T02:05:03Z

@mattdturner thanks for the update. This fails the 1 vs 2 threads bfb comparison on grizzly with both gnu/optimized and intel/debug. Passes all other tests. I'll see if I can find it.

mark-petersen · 2020-05-29T17:22:15Z

@mattdturner I believe I see two problems - one was on this branch before, and one introduced by 268275878. I added an RK4 1 vs 2 thread test to provide more information. Here is what I see:

ocean/develop (80d2f30) passes bfb thread test with all: gnu optimized, gnu debug, intel debug
this branch, previous commit (70f782913): fails gnu debug, gnu opt, but passes intel debug
this branch, last commit (268275878): fails gnu debug, gnu opt. Passes intel debug with RK4 but fails for split explicit

Let's just concentrate on the bold in (2) for now. Let's see if you can reproduce my error. Please start with 480252a on grizzly, and try the following:

# standard gcc modules:
module unload python; 
source /usr/projects/climate/SHARED_CLIMATE/anaconda_envs/load_latest_compass.sh
module use /usr/projects/climate/SHARED_CLIMATE/modulefiles/all/
module load gcc/5.3.0 openmpi/1.10.5 netcdf/4.4.1 parallel-netcdf/1.5.0 pio/1.7.2
make gfortran CORE=ocean OPENMP=true GEN_F90=true
cd testing_and_setup/compass/
./manage_regression_suite.py -m runtime_definitions/mpirun.xml -t ocean/regression_suites/nightly.xml -f config.ocean_turq -s --work_dir $WORKDIR

with your own config.ocean_turq file. I'm sure you know all that, but wanted to be clear how I got the error. Then on a compute node reload the modules and run, and you should see this:

cd $WORKDIR
./nightly_ocean_test_suite.py
 ** Running case Global Ocean 240km - Init Test
      PASS
 ** Running case Global Ocean 240km - Performance Test
      PASS
 ** Running case Global Ocean 240km - Restart Test
      PASS
 ** Running case Global Ocean 240km - Thread Test
   ** FAIL (See case_outputs/Global_Ocean_240km_-_Thread_Test for more information)
 ** Running case Global Ocean 240km - RK4 Thread Test
   ** FAIL (See case_outputs/Global_Ocean_240km_-_RK4_Thread_Test for more information)

where the first uses split_explicit (the default) and the second is the newly added RK4 test. We can look at the individual error. I like to use one core for simplicity now:

cd ocean/global_ocean/QU240/thread_test/2thread_run/
export OMP_NUM_THREADS=2
mpirun -n 1 ./ocean_model

and we get the error:

mpirun -n 1 ./ocean_model
Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.

Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.

Backtrace for this error:

Backtrace for this error:
--------------------------------------------------------------------------
Sorry!  You were supposed to get help about:
    mpi_init:warn-fork
from the file:
    help-mpi-runtime.txt
But I couldn't find that topic in the file.  Sorry!
--------------------------------------------------------------------------

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0  0x2AADD1FF2C57
#0  0x2AADD1FF2C57
#1  0x2AADD1FF1E50
#1  0x2AADD1FF1E50
#2  0x2AADD2EC33EF
#2  0x2AADD2C8062F
#3  0x2AADD4F41B4B
#4  0x2AADD4F3F7B3
#5  0x2AADD4C34533
#6  0x2AADD1BABD51
#7  0x2AADD2F528CB
#8  0x2AADD1FF2B42
#9  0x2AADD1FF1E50
#10  0x2AADD2EC33EF
#3  0xFC49C0 in __ocn_time_integration_split_MOD_ocn_time_integrator_split._omp_fn.10 at mpas_ocn_time_integration_split.F:966
#4  0x2AADD260808D
#5  0x2AADD2C78EA4
#6  0x2AADD2F8B8CC
#7  0xFFFFFFFFFFFFFFFF
#11  0xFC4B90 in __ocn_time_integration_split_MOD_ocn_time_integrator_split._omp_fn.10 at mpas_ocn_time_integration_split.F:968 (discriminator 20)
#12  0x2AADD26049DE
#13  0xFB9B6D in __ocn_time_integration_split_MOD_ocn_time_integrator_split at mpas_ocn_time_integration_split.F:940 (discriminator 4)
#14  0xFB3818 in __ocn_time_integration_MOD_ocn_timestep at mpas_ocn_time_integration.F:110
#15  0xE28CDF in __ocn_forward_mode_MOD_ocn_forward_mode_run at mpas_ocn_forward_mode.F:621
#16  0xF65AF9 in __ocn_core_MOD_ocn_core_run at mpas_ocn_core.F:111
#17  0x409AC1 in __mpas_subdriver_MOD_mpas_run at mpas_subdriver.F:347
#18  0x408893 in MAIN__ at mpas.F:16

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 16467 RUNNING AT gr0162
=   EXIT CODE: 11
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions

so we get an invalid memory reference at mpas_ocn_time_integration_split.F:966. I looked:

 940                 !$omp parallel
 941                 !$omp do schedule(runtime) &
 942                 !$omp private(i, iEdge, cell1, cell2, sshEdge, thicknessSum, flux)
...
 964 flux = ((1.0-config_btr_gam1_velWt1) * normalBarotropicVelocitySubcycleCur(iEdge) &
 965 + config_btr_gam1_velWt1 * normalBarotropicVelocitySubcycleNew(iEdge)) &
 966 * thicknessSum

and the error is not obvious. But I find gnu debug messages uninformative. Maybe it's a race condition, or related to an allocated array like the last few problems.

@mattdturner could you look at this?

mark-petersen · 2020-05-29T18:00:25Z

@mattdturner There are some loops without OpenMP directives:

src/core_ocean/mode_forward/mpas_ocn_time_integration_split.F
2205                do iEdge = 1, nEdges
2223                do iEdge = 1, nEdges

These are in the init subroutine, so performance doesn't matter. After this PR, what happens to these? Are they run by one thread or all? Do they potentially cause an error with multiple threads, or just not use all threads effectively?

The answers also apply to the init mode files in here (initi mode is only used to generate the initial conditions):

src/core_ocean/mode_init/

where we don't care about performance. Can we just go without OpenMP directives in init mode? If we do, and someone runs init mode with multiple threads, will it cause any problems?

mattdturner · 2020-08-13T22:10:20Z

Rebased.

After rebasing, the MPAS-O nightly regression suite PASSes all tests if compiling with DEBUG=true and adding the -O0 compilation flag. If compiling with just DEBUG=true, but not including -O0 then there are DIFFs of around O(-15).

I'll start running E3SM tests w/ these changes this afternoon...

mattdturner · 2020-08-14T18:44:09Z

The following E3SM tests PASS:

SMS_D.T62_oQU120_ais20.MPAS_LISIO_TEST.cori-haswell_intel
SMS_D.T62_oQU120_ais20.MPAS_LISIO_TEST.cori-haswell_gnu
PEM_Ln9.T62_oQU240.GMPAS-IAF.cori-haswell_gnu
PET_Ln3.T62_oEC60to30v3wLI.GMPAS-DIB-IAF-ISMF.cori-haswell_intel
PET_Ln3.T62_oEC60to30v3wLI.GMPAS-DIB-IAF-ISMF.cori-haswell_gnu
PEM_Ln9.T62_oQU240.GMPAS-IAF.cori-haswell_intel
SMS_D.T62_oQU120_ais20.MPAS_LISIO_TEST.cori-knl_intel
SMS_D.T62_oQU120_ais20.MPAS_LISIO_TEST.cori-knl_gnu
PET_Ln9.T62_oQU240.GMPAS-IAF.cori-knl_intel

The following tests FAIL:

PET_Ln9_P1024.ne30_oECv3_ICG.A_WCYCL1850S.cori-knl_gnu
PEM_Ln9_P1024.ne30_oECv3_ICG.A_WCYCL1850S.cori-knl_intel

The PET_Ln9 run fails before it even enters the Ocean code. It fails w/ a Program received signal SIGILL: Illegal instruction. error. I'm not entirely sure what is going on there.

The PEM_Ln9 run fails the COMPARE_base_modpes stage.

I'll start trying to debug these 2 P1024.ne30 tests.

Note that I tested this by doing the following in E3SM (partially here for my reference):

git checkout master
git submodule update --init --recursive
cd components/mpas-source
git fetch mturner     # I previously setup my MPAS fork as a remote
git merge mturner/ocean/local_openmp_threading

I had to also remove the !$omp parallel region from components/mpas-ocean/driver/ocn_comp_mct.F

I haven't done any comparison to baseline datasets, but I do expect some DIFFs in the E3SM tests based on what I saw in the MPAS-O standalone tests.

mattdturner · 2020-08-14T19:58:19Z

Oddly enough, while PET_Ln9_P1024.ne30_oECv3_ICG.A_WCYCL1850S.cori-knl_gnu FAILs, PET_Ln9_D_P1024.ne30_oECv3_ICG.A_WCYCL1850S.cori-knl_intel passes.

philipwjones · 2020-08-14T20:24:52Z

@mattdturner I sometimes get that illegal instruction error when the compiler has built for the wrong target processor. Looks like this was the only combination of gnu/threading/knl. Might try rebuilding just to see if it's a bad build or try building an older version to see if this particular combination config is messed up?

mattdturner · 2020-08-14T21:13:15Z

@philipwjones I reran the PET_Ln9_P1024.ne30_oECv3_ICG.A_WCYCL1850S.cori-knl_gnu test, and it failed with the same illegal instruction error. I'll have to checkout master and re-run to confirm that the fail isn't a result of my changes.

mattdturner · 2020-08-17T15:17:51Z

I ran PET_Ln9_P1024.ne30_oECv3_ICG.A_WCYCL1850S.cori-knl_gnu and PET_Ln9_P1024.ne30_oECv3_ICG.A_WCYCL1850S.cori-knl_intel w/ E3SM master.

PET_Ln9_P1024.ne30_oECv3_ICG.A_WCYCL1850S.cori-knl_gnu FAILs with the same illegal instruction error.

I'm assuming that I can ignore this failure for this PR, since master has an identical failure and it is not related to (or solvable by) threading changes.

PET_Ln9_P1024.ne30_oECv3_ICG.A_WCYCL1850S.cori-knl_intel also FAILs the comparison test.

@philipwjones and/or @mark-petersen , do either of you have any insights on which files I should focus on to try and determine the source of the comparison failure? The lower resolution simulations (QU120, QU240, EC60to30) are all bit-for-bit between thread counts, but the ne30_oECv3_ICG resolution is not. Even though master has the same failure, since it is likely a threading error I think it should be fixed in this PR (unless either of you disagrees).

mattdturner · 2020-08-17T17:34:53Z

Quick update on the PET_Ln9_P1024.ne30_oECv3_ICG.A_WCYCL1850S.cori-knl_intel failure. I commented out all OpenMP directives in the ocean code, and re-ran the test. It still fails the comparison step. So I am assuming that the error is not in the ocean code.

Although, the same test (PET_Ln9_PS.ne30_oECv3_ICG.A_WCYCL1850S.sandiatoss3_intel.allactive-mach-pet) appears to PASS on sandiatoss3 so I am not sure why the test is failing for me on Cori...

mattdturner · 2020-08-18T14:18:42Z

Since the 2 above FAILures also fail in master, even with all openmp disabled in the ocean model, I've decided that the errors should not be investigated / resolved in this PR.

I ran PET_Ln9_P4096.T62_oRRS18to6v3.GMPAS-IAF.cori-knl_intel w/ both master and master updated with the changes in this branch. The test FAILs on master, but PASSes with this branch.

So, based on all of the testing for both lower and higher resolutions, I think this branch is ready to be merged.

* origin/ocean/develop: Localized OpenMP Threading #513

#3596) Localized OpenMP Threading in MPAS-Ocean This PR brings in a new mpas-source submodule with changes only to the ocean core. It replaces the OpenMP threading implementation with the parallel region being initiated before the timestep loop with a new implementation that created parallel regions around each OpenMP loop (i.e., more localized threading regions instead of a global region). See more details at MPAS-Dev/MPAS-Model#513 This PR is BFB in debug or intel -O0 compile settings, non-BFB otherwise. [non-BFB]

…evelop Localized OpenMP Threading #513 This PR replaces the OpenMP threading implementation with the parallel region being initiated before the timestep loop with a new implementation that created parallel regions around each OpenMP loop (i.e., more localized threading regions instead of a global region).

Localized OpenMP Threading in MPAS-Ocean This PR brings in a new mpas-source submodule with changes only to the ocean core. It replaces the OpenMP threading implementation with the parallel region being initiated before the timestep loop with a new implementation that created parallel regions around each OpenMP loop (i.e., more localized threading regions instead of a global region). See more details at MPAS-Dev/MPAS-Model#513 This PR is BFB in debug or intel -O0 compile settings, non-BFB otherwise. [non-BFB]

Update mixed_layer_depth file to include changes necessary for the localized OpenMP threading (PR MPAS-Dev#513). Also remove interp_local from private clause.

… ocean/develop Localized OpenMP Threading MPAS-Dev#513 This PR replaces the OpenMP threading implementation with the parallel region being initiated before the timestep loop with a new implementation that created parallel regions around each OpenMP loop (i.e., more localized threading regions instead of a global region).

…an/develop Localized OpenMP Threading MPAS-Dev#513 This PR replaces the OpenMP threading implementation with the parallel region being initiated before the timestep loop with a new implementation that created parallel regions around each OpenMP loop (i.e., more localized threading regions instead of a global region).

mattdturner requested review from mark-petersen and philipwjones April 7, 2020 19:19

mattdturner added Ocean performance labels Apr 7, 2020

mattdturner commented Apr 7, 2020

View reviewed changes

mattdturner mentioned this pull request Apr 14, 2020

Ocean/tracer advection optimization #519

Closed

This was referenced Apr 27, 2020

Optimizations for tracer horizontal mixing #538

Closed

Remove scratch allocates and config. Part 1: analysis members #539

Merged

philipwjones approved these changes Apr 29, 2020

View reviewed changes

mark-petersen self-assigned this May 19, 2020

mark-petersen force-pushed the ocean/local_openmp_threading branch from 23aea8f to 0e573e1 Compare May 22, 2020 03:37

philipwjones mentioned this pull request May 26, 2020

GPU implementation of ocean baroclinic velocity tendencies #572

Closed

mark-petersen reviewed May 26, 2020

View reviewed changes

mark-petersen mentioned this pull request May 26, 2020

Localized OpenMP Threading in MPAS-Ocean E3SM-Project/E3SM#3596

Merged

mark-petersen reviewed May 26, 2020

View reviewed changes

mark-petersen self-requested a review August 18, 2020 19:25

mark-petersen approved these changes Aug 18, 2020

View reviewed changes

mark-petersen assigned mattdturner Aug 18, 2020

mattdturner unassigned mark-petersen Aug 18, 2020

mark-petersen assigned mark-petersen and unassigned mark-petersen Aug 18, 2020

mattdturner merged commit 1550912 into MPAS-Dev:ocean/develop Aug 18, 2020

mattdturner added a commit that referenced this pull request Aug 18, 2020

Merge branch 'ocean/develop' into e3sm/develop

8061101

* origin/ocean/develop: Localized OpenMP Threading #513

mark-petersen mentioned this pull request Aug 21, 2020

Merge coastal to ocean/develop: COMPASS and sediment additions #485

Merged

xylar mentioned this pull request Aug 28, 2020

Remove landIceMask from diagnostics computation #672

Merged

mark-petersen mentioned this pull request Sep 10, 2020

Revert GM buoyancy gradient calculation to v1 form #687

Merged

mark-petersen mentioned this pull request Sep 29, 2020

GOTM in MPAS-Ocean #704

Merged

mattdturner deleted the ocean/local_openmp_threading branch March 16, 2021 16:43

matthewhoffman added Ocean performance labels Mar 17, 2021

Conversation

mattdturner commented Apr 7, 2020

Uh oh!

mattdturner commented Apr 7, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

philipwjones left a comment

Choose a reason for hiding this comment

Uh oh!

mark-petersen commented May 22, 2020

Uh oh!

mattdturner commented May 22, 2020

Uh oh!

mattdturner commented May 23, 2020

Uh oh!

mark-petersen commented May 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mattdturner commented May 26, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mark-petersen commented May 26, 2020

Uh oh!

mattdturner commented May 26, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mark-petersen commented May 27, 2020

Uh oh!

mark-petersen commented May 29, 2020

Uh oh!

mark-petersen commented May 29, 2020

Uh oh!

mark-petersen commented May 29, 2020

Uh oh!

mattdturner commented Aug 13, 2020

Uh oh!

mattdturner commented Aug 14, 2020

Uh oh!

mattdturner commented Aug 14, 2020

Uh oh!

philipwjones commented Aug 14, 2020

Uh oh!

mattdturner commented Aug 14, 2020

Uh oh!

mattdturner commented Aug 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mattdturner commented Aug 17, 2020

Uh oh!

mattdturner commented Aug 18, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

mark-petersen commented May 26, 2020 •

edited

Loading

mattdturner commented Aug 17, 2020 •

edited

Loading