Localized OpenMP Threading#513
Conversation
|
Testing was done with Intel v19 on Cori for a QU240 |
src/core_ocean/shared/mpas_ocn_gm.F
Outdated
There was a problem hiding this comment.
This loop has the OpenMP directives commented out since there is a race condition that is yet to be resolved. This will be resolved in a future PR (the work in this PR need to be merged for future work to build upon).
There was a problem hiding this comment.
This loop has the OpenMP directives commented out since there is a race condition that is yet to be resolved. This will be resolved in a future PR (the work in this PR need to be merged for future work to build upon). I'm also putting off resolving this race condition until the other work on CVMIX is merged.
philipwjones
left a comment
There was a problem hiding this comment.
Sorry for the delay - was going to test on Summit but PGI with threading is still broken. I have looked it over and based on Matt's testing will go ahead and approve.
23aea8f to
0e573e1
Compare
|
Rebased. @mattdturner, these two files had so many previous changes, I thought it was dangerous to try to resolve the conflicts: Would you mind redoing the process on those two? Currently, intel with debug and OpenMP=true produces an error in an ALE routine. With OpenMP=false everything passes. That error could be from my rebasing, and I can look tomorrow. |
|
Yeah, I'll redo the process on those files. I'll also look into the ALE routines while I'm at it to see what might be causing the problem |
|
@mark-petersen I just pushed 2 commits to re-do the localized threading in What problems were you seeing with the ALE routines? I didn't encounter any crashes because of them, but I'm not sure if my test case actually ran those routines. |
|
@mattdturner Based on your last comment, we should expect this to be non-BFB for the E3SM nightly regression suite, where optimization is enabled. |
|
In my QU240 testing, this wasn't even BFB with |
There was a problem hiding this comment.
This section is producing an error on grizzly using
module load intel/17.0.1 openmpi/1.10.5 netcdf/4.4.1 parallel-netcdf/1.5.0 pio/1.7.2
make ifort CORE=ocean DEBUG=true OPENMP=true GEN_F90=true
It looks like iCell is missing in the private list. I added it, and still get this same error:
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
...
ocean_model 0000000001B89B2B ocn_thick_ale_mp_ 172 mpas_ocn_thick_ale.f90
which is this line, just below:
160 SSH_ALE_Thickness(k) = SSH(iCell) * vertCoordMovementWeights(k) * restingThickness(k, iCell)
I have not found the root problem yet. @mattdturner if your tests on cori don't reproduce it, it might be an intel version thing. Could you double check that the allocatable array SSH_ALE_Thickness(k) is declared correctly? It does not need to be module level with these changes?
I can confirm that this error does not occur with optimization on, or with OPENMP=false.
There was a problem hiding this comment.
As the code currently stands, I should be seeing the error that you are seeing. I'm not sure why it didn't show up in my test on Cori, but the SSH_ALE_Thickness array is uninitialized within the OpenMP region. I'll look into this now.
There was a problem hiding this comment.
I just pushed a commit that changes SSH_ALE_Thickness to firstprivate, and I was able to successfully run a test on Grizzly.
|
@mattdturner could you do this: It produces 86 lines where the partitioned index is not in the private pragma. Could you check those? You can ignore the modules removed in #569. I was not sure if the others were oversights. |
|
@mark-petersen I'm not sure what you want me to look into? Is the issue that the index over which a loop is iterating isn't declared In this example OpenMP automatically assigns |
There was a problem hiding this comment.
@mattdturner when I looked through this PR, it appeared that you were intentionally adding the main index loop to the private list, so I thought that went with the addition of the parallel line before it. Are you saying that we don't need to put iCell on the private list here, or in all the other places in this PR?
There was a problem hiding this comment.
Correct, the iCell does not need to be in the private list here. I thought that I had removed all of the main loop indices from private lists, but it looks like there are quite a few that I missed.
There was a problem hiding this comment.
Sorry for the extra work, but could you remove all those? Otherwise it leads to confusion. I think you could use sed within a shell script to automate it on all the files.
There was a problem hiding this comment.
Sure. I'll take care of that today and push the changes once I'm done.
There was a problem hiding this comment.
I just pushed changes that remove the main loop index from the !$omp private directives. It PASSED the thread test when compiled with DEBUG using Intel v19 on Cori
|
@mattdturner thanks for the update. I can confirm that this PR now passes the nightly regression suite using intel and debug, with OPENMP=true. |
|
@mattdturner thanks for the update. This fails the 1 vs 2 threads bfb comparison on grizzly with both gnu/optimized and intel/debug. Passes all other tests. I'll see if I can find it. |
|
@mattdturner I believe I see two problems - one was on this branch before, and one introduced by 268275878. I added an RK4 1 vs 2 thread test to provide more information. Here is what I see:
Let's just concentrate on the bold in (2) for now. Let's see if you can reproduce my error. Please start with 480252a on grizzly, and try the following: with your own where the first uses split_explicit (the default) and the second is the newly added RK4 test. We can look at the individual error. I like to use one core for simplicity now: and we get the error:
so we get an and the error is not obvious. But I find gnu debug messages uninformative. Maybe it's a race condition, or related to an allocated array like the last few problems. @mattdturner could you look at this? |
|
@mattdturner There are some loops without OpenMP directives: These are in the init subroutine, so performance doesn't matter. After this PR, what happens to these? Are they run by one thread or all? Do they potentially cause an error with multiple threads, or just not use all threads effectively? The answers also apply to the init mode files in here (initi mode is only used to generate the initial conditions): where we don't care about performance. Can we just go without OpenMP directives in init mode? If we do, and someone runs init mode with multiple threads, will it cause any problems? |
|
Rebased. After rebasing, the MPAS-O nightly regression suite PASSes all tests if compiling with I'll start running E3SM tests w/ these changes this afternoon... |
|
The following E3SM tests PASS: The following tests FAIL: The The I'll start trying to debug these 2 Note that I tested this by doing the following in E3SM (partially here for my reference): I had to also remove the I haven't done any comparison to baseline datasets, but I do expect some DIFFs in the E3SM tests based on what I saw in the MPAS-O standalone tests. |
|
Oddly enough, while |
|
@mattdturner I sometimes get that illegal instruction error when the compiler has built for the wrong target processor. Looks like this was the only combination of gnu/threading/knl. Might try rebuilding just to see if it's a bad build or try building an older version to see if this particular combination config is messed up? |
|
@philipwjones I reran the |
|
I ran
I'm assuming that I can ignore this failure for this PR, since
@philipwjones and/or @mark-petersen , do either of you have any insights on which files I should focus on to try and determine the source of the comparison failure? The lower resolution simulations ( |
|
Quick update on the Although, the same test (PET_Ln9_PS.ne30_oECv3_ICG.A_WCYCL1850S.sandiatoss3_intel.allactive-mach-pet) appears to PASS on sandiatoss3 so I am not sure why the test is failing for me on Cori... |
|
Since the 2 above FAILures also fail in I ran So, based on all of the testing for both lower and higher resolutions, I think this branch is ready to be merged. |
* origin/ocean/develop: Localized OpenMP Threading #513
#3596) Localized OpenMP Threading in MPAS-Ocean This PR brings in a new mpas-source submodule with changes only to the ocean core. It replaces the OpenMP threading implementation with the parallel region being initiated before the timestep loop with a new implementation that created parallel regions around each OpenMP loop (i.e., more localized threading regions instead of a global region). See more details at MPAS-Dev/MPAS-Model#513 This PR is BFB in debug or intel -O0 compile settings, non-BFB otherwise. [non-BFB]
…evelop Localized OpenMP Threading #513 This PR replaces the OpenMP threading implementation with the parallel region being initiated before the timestep loop with a new implementation that created parallel regions around each OpenMP loop (i.e., more localized threading regions instead of a global region).
Localized OpenMP Threading in MPAS-Ocean This PR brings in a new mpas-source submodule with changes only to the ocean core. It replaces the OpenMP threading implementation with the parallel region being initiated before the timestep loop with a new implementation that created parallel regions around each OpenMP loop (i.e., more localized threading regions instead of a global region). See more details at MPAS-Dev/MPAS-Model#513 This PR is BFB in debug or intel -O0 compile settings, non-BFB otherwise. [non-BFB]
Update mixed_layer_depth file to include changes necessary for the localized OpenMP threading (PR MPAS-Dev#513). Also remove interp_local from private clause.
Update mixed_layer_depth file to include changes necessary for the localized OpenMP threading (PR MPAS-Dev#513). Also remove interp_local from private clause.
… ocean/develop Localized OpenMP Threading MPAS-Dev#513 This PR replaces the OpenMP threading implementation with the parallel region being initiated before the timestep loop with a new implementation that created parallel regions around each OpenMP loop (i.e., more localized threading regions instead of a global region).
…an/develop Localized OpenMP Threading MPAS-Dev#513 This PR replaces the OpenMP threading implementation with the parallel region being initiated before the timestep loop with a new implementation that created parallel regions around each OpenMP loop (i.e., more localized threading regions instead of a global region).
This PR replaces the OpenMP threading implementation with the parallel region being initiated before the timestep loop with a new implementation that created parallel regions around each OpenMP loop (i.e., more localized threading regions instead of a global region).