Skip to content

Localized OpenMP Threading#513

Merged
mattdturner merged 16 commits intoMPAS-Dev:ocean/developfrom
mattdturner:ocean/local_openmp_threading
Aug 18, 2020
Merged

Localized OpenMP Threading#513
mattdturner merged 16 commits intoMPAS-Dev:ocean/developfrom
mattdturner:ocean/local_openmp_threading

Conversation

@mattdturner
Copy link
Collaborator

This PR replaces the OpenMP threading implementation with the parallel region being initiated before the timestep loop with a new implementation that created parallel regions around each OpenMP loop (i.e., more localized threading regions instead of a global region).

@mattdturner
Copy link
Collaborator Author

Testing was done with Intel v19 on Cori for a QU240 performance_test case. The results are bit-for-bit when compiling with DEBUG.

Comment on lines 632 to 635
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loop has the OpenMP directives commented out since there is a race condition that is yet to be resolved. This will be resolved in a future PR (the work in this PR need to be merged for future work to build upon).

Comment on lines 360 to 370
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loop has the OpenMP directives commented out since there is a race condition that is yet to be resolved. This will be resolved in a future PR (the work in this PR need to be merged for future work to build upon). I'm also putting off resolving this race condition until the other work on CVMIX is merged.

Copy link
Contributor

@philipwjones philipwjones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay - was going to test on Summit but PGI with threading is still broken. I have looked it over and based on Matt's testing will go ahead and approve.

@mark-petersen mark-petersen self-assigned this May 19, 2020
@mark-petersen mark-petersen force-pushed the ocean/local_openmp_threading branch from 23aea8f to 0e573e1 Compare May 22, 2020 03:37
@mark-petersen
Copy link
Contributor

Rebased. @mattdturner, these two files had so many previous changes, I thought it was dangerous to try to resolve the conflicts:

src/core_ocean/shared/mpas_ocn_gm.F
src/core_ocean/shared/mpas_ocn_tracer_hmix_redi.F

Would you mind redoing the process on those two?

Currently, intel with debug and OpenMP=true produces an error in an ALE routine. With OpenMP=false everything passes. That error could be from my rebasing, and I can look tomorrow.

@mattdturner
Copy link
Collaborator Author

Yeah, I'll redo the process on those files. I'll also look into the ALE routines while I'm at it to see what might be causing the problem

@mattdturner
Copy link
Collaborator Author

@mark-petersen I just pushed 2 commits to re-do the localized threading in mpas_ocn_tracer_hmix_redi.F and mpas_ocn_gm.F. The additional changes were bfb on Intel with debug flags enabled for the QU240 performance test. Note that they are only bfb if I use -O0 as a compile option.

What problems were you seeing with the ALE routines? I didn't encounter any crashes because of them, but I'm not sure if my test case actually ran those routines.

@mark-petersen
Copy link
Contributor

mark-petersen commented May 26, 2020

@mattdturner Based on your last comment, we should expect this to be non-BFB for the E3SM nightly regression suite, where optimization is enabled.

@mattdturner
Copy link
Collaborator Author

In my QU240 testing, this wasn't even BFB with DEBUG=true. I had to modify the Makefile to add -O0 to the debug flags in order to get BFB. So yes, it would probably be non-BFB for the E3SM regression suite.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section is producing an error on grizzly using

module load intel/17.0.1 openmpi/1.10.5 netcdf/4.4.1 parallel-netcdf/1.5.0 pio/1.7.2
make ifort CORE=ocean DEBUG=true OPENMP=true GEN_F90=true

It looks like iCell is missing in the private list. I added it, and still get this same error:

forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source
...
ocean_model        0000000001B89B2B  ocn_thick_ale_mp_         172  mpas_ocn_thick_ale.f90

which is this line, just below:

160                SSH_ALE_Thickness(k) = SSH(iCell) * vertCoordMovementWeights(k) * restingThickness(k, iCell)

I have not found the root problem yet. @mattdturner if your tests on cori don't reproduce it, it might be an intel version thing. Could you double check that the allocatable array SSH_ALE_Thickness(k) is declared correctly? It does not need to be module level with these changes?

I can confirm that this error does not occur with optimization on, or with OPENMP=false.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the code currently stands, I should be seeing the error that you are seeing. I'm not sure why it didn't show up in my test on Cori, but the SSH_ALE_Thickness array is uninitialized within the OpenMP region. I'll look into this now.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just pushed a commit that changes SSH_ALE_Thickness to firstprivate, and I was able to successfully run a test on Grizzly.

@mark-petersen
Copy link
Contributor

@mattdturner could you do this:

cd src/core_ocean
git grep -in 'private(' |grep omp | grep -v iCell|grep -v iEdge|grep -v iVertex

It produces 86 lines where the partitioned index is not in the private pragma. Could you check those? You can ignore the modules removed in #569. I was not sure if the others were oversights.

@mattdturner
Copy link
Collaborator Author

@mark-petersen I'm not sure what you want me to look into? Is the issue that the index over which a loop is iterating isn't declared private? In an !$omp do directive, the index of the initial do loop is inherently private. E.g.,

!$omp do
do i = 1, 10

In this example OpenMP automatically assigns i to be private. If that is what you were asking, I can explicitly include them as private for completeness-sake if you would prefer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mattdturner when I looked through this PR, it appeared that you were intentionally adding the main index loop to the private list, so I thought that went with the addition of the parallel line before it. Are you saying that we don't need to put iCell on the private list here, or in all the other places in this PR?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, the iCell does not need to be in the private list here. I thought that I had removed all of the main loop indices from private lists, but it looks like there are quite a few that I missed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the extra work, but could you remove all those? Otherwise it leads to confusion. I think you could use sed within a shell script to automate it on all the files.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I'll take care of that today and push the changes once I'm done.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just pushed changes that remove the main loop index from the !$omp private directives. It PASSED the thread test when compiled with DEBUG using Intel v19 on Cori

@mark-petersen
Copy link
Contributor

@mattdturner thanks for the update. I can confirm that this PR now passes the nightly regression suite using intel and debug, with OPENMP=true.

@mark-petersen
Copy link
Contributor

@mattdturner thanks for the update. This fails the 1 vs 2 threads bfb comparison on grizzly with both gnu/optimized and intel/debug. Passes all other tests. I'll see if I can find it.

@mark-petersen
Copy link
Contributor

@mattdturner I believe I see two problems - one was on this branch before, and one introduced by 268275878. I added an RK4 1 vs 2 thread test to provide more information. Here is what I see:

  1. ocean/develop (80d2f30) passes bfb thread test with all: gnu optimized, gnu debug, intel debug
  2. this branch, previous commit (70f782913): fails gnu debug, gnu opt, but passes intel debug
  3. this branch, last commit (268275878): fails gnu debug, gnu opt. Passes intel debug with RK4 but fails for split explicit

Let's just concentrate on the bold in (2) for now. Let's see if you can reproduce my error. Please start with 480252a on grizzly, and try the following:

# standard gcc modules:
module unload python; 
source /usr/projects/climate/SHARED_CLIMATE/anaconda_envs/load_latest_compass.sh
module use /usr/projects/climate/SHARED_CLIMATE/modulefiles/all/
module load gcc/5.3.0 openmpi/1.10.5 netcdf/4.4.1 parallel-netcdf/1.5.0 pio/1.7.2
make gfortran CORE=ocean OPENMP=true GEN_F90=true
cd testing_and_setup/compass/
./manage_regression_suite.py -m runtime_definitions/mpirun.xml -t ocean/regression_suites/nightly.xml -f config.ocean_turq -s --work_dir $WORKDIR

with your own config.ocean_turq file. I'm sure you know all that, but wanted to be clear how I got the error. Then on a compute node reload the modules and run, and you should see this:

cd $WORKDIR
./nightly_ocean_test_suite.py
 ** Running case Global Ocean 240km - Init Test
      PASS
 ** Running case Global Ocean 240km - Performance Test
      PASS
 ** Running case Global Ocean 240km - Restart Test
      PASS
 ** Running case Global Ocean 240km - Thread Test
   ** FAIL (See case_outputs/Global_Ocean_240km_-_Thread_Test for more information)
 ** Running case Global Ocean 240km - RK4 Thread Test
   ** FAIL (See case_outputs/Global_Ocean_240km_-_RK4_Thread_Test for more information)

where the first uses split_explicit (the default) and the second is the newly added RK4 test. We can look at the individual error. I like to use one core for simplicity now:

cd ocean/global_ocean/QU240/thread_test/2thread_run/
export OMP_NUM_THREADS=2
mpirun -n 1 ./ocean_model
and we get the error:

mpirun -n 1 ./ocean_model
Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.

Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.

Backtrace for this error:

Backtrace for this error:
--------------------------------------------------------------------------
Sorry!  You were supposed to get help about:
    mpi_init:warn-fork
from the file:
    help-mpi-runtime.txt
But I couldn't find that topic in the file.  Sorry!
--------------------------------------------------------------------------

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0  0x2AADD1FF2C57
#0  0x2AADD1FF2C57
#1  0x2AADD1FF1E50
#1  0x2AADD1FF1E50
#2  0x2AADD2EC33EF
#2  0x2AADD2C8062F
#3  0x2AADD4F41B4B
#4  0x2AADD4F3F7B3
#5  0x2AADD4C34533
#6  0x2AADD1BABD51
#7  0x2AADD2F528CB
#8  0x2AADD1FF2B42
#9  0x2AADD1FF1E50
#10  0x2AADD2EC33EF
#3  0xFC49C0 in __ocn_time_integration_split_MOD_ocn_time_integrator_split._omp_fn.10 at mpas_ocn_time_integration_split.F:966
#4  0x2AADD260808D
#5  0x2AADD2C78EA4
#6  0x2AADD2F8B8CC
#7  0xFFFFFFFFFFFFFFFF
#11  0xFC4B90 in __ocn_time_integration_split_MOD_ocn_time_integrator_split._omp_fn.10 at mpas_ocn_time_integration_split.F:968 (discriminator 20)
#12  0x2AADD26049DE
#13  0xFB9B6D in __ocn_time_integration_split_MOD_ocn_time_integrator_split at mpas_ocn_time_integration_split.F:940 (discriminator 4)
#14  0xFB3818 in __ocn_time_integration_MOD_ocn_timestep at mpas_ocn_time_integration.F:110
#15  0xE28CDF in __ocn_forward_mode_MOD_ocn_forward_mode_run at mpas_ocn_forward_mode.F:621
#16  0xF65AF9 in __ocn_core_MOD_ocn_core_run at mpas_ocn_core.F:111
#17  0x409AC1 in __mpas_subdriver_MOD_mpas_run at mpas_subdriver.F:347
#18  0x408893 in MAIN__ at mpas.F:16

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 16467 RUNNING AT gr0162
=   EXIT CODE: 11
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions

so we get an invalid memory reference at mpas_ocn_time_integration_split.F:966. I looked:

 940                 !$omp parallel
 941                 !$omp do schedule(runtime) &
 942                 !$omp private(i, iEdge, cell1, cell2, sshEdge, thicknessSum, flux)
...
 964 flux = ((1.0-config_btr_gam1_velWt1) * normalBarotropicVelocitySubcycleCur(iEdge) &
 965 + config_btr_gam1_velWt1 * normalBarotropicVelocitySubcycleNew(iEdge)) &
 966 * thicknessSum

and the error is not obvious. But I find gnu debug messages uninformative. Maybe it's a race condition, or related to an allocated array like the last few problems.

@mattdturner could you look at this?

@mark-petersen
Copy link
Contributor

@mattdturner There are some loops without OpenMP directives:

src/core_ocean/mode_forward/mpas_ocn_time_integration_split.F
2205                do iEdge = 1, nEdges
2223                do iEdge = 1, nEdges

These are in the init subroutine, so performance doesn't matter. After this PR, what happens to these? Are they run by one thread or all? Do they potentially cause an error with multiple threads, or just not use all threads effectively?

The answers also apply to the init mode files in here (initi mode is only used to generate the initial conditions):

src/core_ocean/mode_init/

where we don't care about performance. Can we just go without OpenMP directives in init mode? If we do, and someone runs init mode with multiple threads, will it cause any problems?

@mattdturner
Copy link
Collaborator Author

Rebased.

After rebasing, the MPAS-O nightly regression suite PASSes all tests if compiling with DEBUG=true and adding the -O0 compilation flag. If compiling with just DEBUG=true, but not including -O0 then there are DIFFs of around O(-15).

I'll start running E3SM tests w/ these changes this afternoon...

@mattdturner
Copy link
Collaborator Author

The following E3SM tests PASS:

SMS_D.T62_oQU120_ais20.MPAS_LISIO_TEST.cori-haswell_intel
SMS_D.T62_oQU120_ais20.MPAS_LISIO_TEST.cori-haswell_gnu
PEM_Ln9.T62_oQU240.GMPAS-IAF.cori-haswell_gnu
PET_Ln3.T62_oEC60to30v3wLI.GMPAS-DIB-IAF-ISMF.cori-haswell_intel
PET_Ln3.T62_oEC60to30v3wLI.GMPAS-DIB-IAF-ISMF.cori-haswell_gnu
PEM_Ln9.T62_oQU240.GMPAS-IAF.cori-haswell_intel
SMS_D.T62_oQU120_ais20.MPAS_LISIO_TEST.cori-knl_intel
SMS_D.T62_oQU120_ais20.MPAS_LISIO_TEST.cori-knl_gnu
PET_Ln9.T62_oQU240.GMPAS-IAF.cori-knl_intel

The following tests FAIL:

PET_Ln9_P1024.ne30_oECv3_ICG.A_WCYCL1850S.cori-knl_gnu
PEM_Ln9_P1024.ne30_oECv3_ICG.A_WCYCL1850S.cori-knl_intel

The PET_Ln9 run fails before it even enters the Ocean code. It fails w/ a Program received signal SIGILL: Illegal instruction. error. I'm not entirely sure what is going on there.

The PEM_Ln9 run fails the COMPARE_base_modpes stage.

I'll start trying to debug these 2 P1024.ne30 tests.


Note that I tested this by doing the following in E3SM (partially here for my reference):

git checkout master
git submodule update --init --recursive
cd components/mpas-source
git fetch mturner     # I previously setup my MPAS fork as a remote
git merge mturner/ocean/local_openmp_threading

I had to also remove the !$omp parallel region from components/mpas-ocean/driver/ocn_comp_mct.F

I haven't done any comparison to baseline datasets, but I do expect some DIFFs in the E3SM tests based on what I saw in the MPAS-O standalone tests.

@mattdturner
Copy link
Collaborator Author

Oddly enough, while PET_Ln9_P1024.ne30_oECv3_ICG.A_WCYCL1850S.cori-knl_gnu FAILs, PET_Ln9_D_P1024.ne30_oECv3_ICG.A_WCYCL1850S.cori-knl_intel passes.

@philipwjones
Copy link
Contributor

@mattdturner I sometimes get that illegal instruction error when the compiler has built for the wrong target processor. Looks like this was the only combination of gnu/threading/knl. Might try rebuilding just to see if it's a bad build or try building an older version to see if this particular combination config is messed up?

@mattdturner
Copy link
Collaborator Author

@philipwjones I reran the PET_Ln9_P1024.ne30_oECv3_ICG.A_WCYCL1850S.cori-knl_gnu test, and it failed with the same illegal instruction error. I'll have to checkout master and re-run to confirm that the fail isn't a result of my changes.

@mattdturner
Copy link
Collaborator Author

mattdturner commented Aug 17, 2020

I ran PET_Ln9_P1024.ne30_oECv3_ICG.A_WCYCL1850S.cori-knl_gnu and PET_Ln9_P1024.ne30_oECv3_ICG.A_WCYCL1850S.cori-knl_intel w/ E3SM master.

PET_Ln9_P1024.ne30_oECv3_ICG.A_WCYCL1850S.cori-knl_gnu FAILs with the same illegal instruction error.

I'm assuming that I can ignore this failure for this PR, since master has an identical failure and it is not related to (or solvable by) threading changes.

PET_Ln9_P1024.ne30_oECv3_ICG.A_WCYCL1850S.cori-knl_intel also FAILs the comparison test.

@philipwjones and/or @mark-petersen , do either of you have any insights on which files I should focus on to try and determine the source of the comparison failure? The lower resolution simulations (QU120, QU240, EC60to30) are all bit-for-bit between thread counts, but the ne30_oECv3_ICG resolution is not. Even though master has the same failure, since it is likely a threading error I think it should be fixed in this PR (unless either of you disagrees).

@mattdturner
Copy link
Collaborator Author

Quick update on the PET_Ln9_P1024.ne30_oECv3_ICG.A_WCYCL1850S.cori-knl_intel failure. I commented out all OpenMP directives in the ocean code, and re-ran the test. It still fails the comparison step. So I am assuming that the error is not in the ocean code.

Although, the same test (PET_Ln9_PS.ne30_oECv3_ICG.A_WCYCL1850S.sandiatoss3_intel.allactive-mach-pet) appears to PASS on sandiatoss3 so I am not sure why the test is failing for me on Cori...

@mattdturner
Copy link
Collaborator Author

Since the 2 above FAILures also fail in master, even with all openmp disabled in the ocean model, I've decided that the errors should not be investigated / resolved in this PR.

I ran PET_Ln9_P4096.T62_oRRS18to6v3.GMPAS-IAF.cori-knl_intel w/ both master and master updated with the changes in this branch. The test FAILs on master, but PASSes with this branch.

So, based on all of the testing for both lower and higher resolutions, I think this branch is ready to be merged.

@mark-petersen mark-petersen self-requested a review August 18, 2020 19:25
@mattdturner mattdturner merged commit 1550912 into MPAS-Dev:ocean/develop Aug 18, 2020
mattdturner added a commit that referenced this pull request Aug 18, 2020
* origin/ocean/develop:
 Localized OpenMP Threading #513
jonbob added a commit to E3SM-Project/E3SM that referenced this pull request Aug 20, 2020
#3596)

Localized OpenMP Threading in MPAS-Ocean

This PR brings in a new mpas-source submodule with changes only to the ocean
core. It replaces the OpenMP threading implementation with the parallel region
being initiated before the timestep loop with a new implementation that created
parallel regions around each OpenMP loop (i.e., more localized threading regions
instead of a global region). See more details at MPAS-Dev/MPAS-Model#513

This PR is BFB in debug or intel -O0 compile settings, non-BFB otherwise.

[non-BFB]
mark-petersen added a commit that referenced this pull request Aug 21, 2020
…evelop

Localized OpenMP Threading #513

This PR replaces the OpenMP threading implementation with the parallel
region being initiated before the timestep loop with a new
implementation that created parallel regions around each OpenMP loop
(i.e., more localized threading regions instead of a global region).
jonbob added a commit to E3SM-Project/E3SM that referenced this pull request Aug 21, 2020
Localized OpenMP Threading in MPAS-Ocean

This PR brings in a new mpas-source submodule with changes only to the ocean
core. It replaces the OpenMP threading implementation with the parallel region
being initiated before the timestep loop with a new implementation that created
parallel regions around each OpenMP loop (i.e., more localized threading regions
instead of a global region). See more details at MPAS-Dev/MPAS-Model#513

This PR is BFB in debug or intel -O0 compile settings, non-BFB otherwise.

[non-BFB]
mattdturner added a commit to mattdturner/MPAS-Model that referenced this pull request Sep 15, 2020
Update mixed_layer_depth file to include changes necessary
for the localized OpenMP threading (PR MPAS-Dev#513).  Also remove
interp_local from private clause.
@mark-petersen mark-petersen mentioned this pull request Sep 29, 2020
mark-petersen pushed a commit to mark-petersen/MPAS-Model that referenced this pull request Jan 11, 2021
Update mixed_layer_depth file to include changes necessary
for the localized OpenMP threading (PR MPAS-Dev#513).  Also remove
interp_local from private clause.
caozd999 pushed a commit to caozd999/MPAS-Model that referenced this pull request Jan 14, 2021
… ocean/develop

Localized OpenMP Threading MPAS-Dev#513

This PR replaces the OpenMP threading implementation with the parallel
region being initiated before the timestep loop with a new
implementation that created parallel regions around each OpenMP loop
(i.e., more localized threading regions instead of a global region).
caozd999 pushed a commit to caozd999/MPAS-Model that referenced this pull request Jan 14, 2021
…an/develop

Localized OpenMP Threading MPAS-Dev#513

This PR replaces the OpenMP threading implementation with the parallel region being initiated before the timestep loop with a new implementation that created parallel regions around each OpenMP loop (i.e., more localized threading regions instead of a global region).
@mattdturner mattdturner deleted the ocean/local_openmp_threading branch March 16, 2021 16:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants