Skip to content

Conversation

@hyungyukang
Copy link

@hyungyukang hyungyukang commented Jan 9, 2020

Adds semi-implicit barotropic mode solver.

@hyungyukang hyungyukang changed the title PR for a semi implicit barotropic mode solver A semi implicit barotropic mode solver codes Jan 9, 2020
@hyungyukang hyungyukang changed the title A semi implicit barotropic mode solver codes Codes for a semi implicit barotropic mode solver Jan 9, 2020
@mark-petersen mark-petersen self-assigned this Jan 9, 2020
@mark-petersen mark-petersen self-requested a review January 9, 2020 15:30
@mark-petersen
Copy link
Contributor

@hyungyukang Thank you for your effort on this. Please also link or attach any design documents or testing results you already have. For example, you could link to the talk you gave at the last E3SM meeting.

@hyungyukang
Copy link
Author

@hyungyukang Thank you for your effort on this. Please also link or attach any design documents or testing results you already have. For example, you could link to the talk you gave at the last E3SM meeting.

I'll do that soon. Thanks, @mark-petersen!

@hyungyukang
Copy link
Author

Uploading a slide file summarizing this work. A bit revised from my AGU talk last December.
Kang_A Semi-implicit Barotropic Mode Solver for the MPAS-Ocean.pdf

@mark-petersen
Copy link
Contributor

@hyungyukang I am able to work on this now. My apologies for the delay.

It looks like commit e89de64ad removed all the files in the repo, and 687f11ac3 added them back again. At the command line, git diff --stat and wc are handy for this:

git diff --stat 3979482ff e89de64ad | wc -l
1764

What were you trying to do with those two commits? We need to reset back to 3979482 and add just the changes you intended. One way to do that is to use a unix cp or tar on the files you want, then
git reset --hard 3979482ff
and copy them back into place again, commit, and git push --force back to this branch.

@mark-petersen
Copy link
Contributor

mark-petersen commented Jun 27, 2020

Rebased. For my own reference, I compiled lapack (https://github.com/Reference-LAPACK/lapack) on LANL IC as follows:

git clone git@github.com:Reference-LAPACK/lapack.git
cd lapack
mkdir build/lapack_gnu/lib
cd build
module load gcc/5.3.0 openmpi/1.10.5 cmake
cmake -DCMAKE_INSTALL_LIBDIR=/usr/projects/climate/mpeterse/repos/lapack/build/lapack_gnu/lib ..
cmake --build . -j --target install
ls lapack_gnu/lib
    cmake  libblas.a  liblapack.a  pkgconfig

Then to compile:

export LAPACK=/usr/projects/climate/mpeterse/repos/lapack/build/lapack_gnu
make $COMPILER CORE=ocean OPENMP=false DEBUG=true GEN_F90=true USE_LAPACK=true

Some test cases run using semi-implicit, but some result in a seg fault with an invalid memory reference. Looking into it now...

@mark-petersen
Copy link
Contributor

Fixed invalid memory reference. A flag name had changed. Not passing restart tests yet.

@mark-petersen
Copy link
Contributor

@hyungyukang, did you test the semi-implicit solver with restarts? Most of my tests work when config_do_restart = .false. but in restart tests with config_do_restart = .true., where the initial condition is read in from a restart file, it dies with

CRITICAL ERROR: Iteration number exceeds Max. #iteration: PROGRAM STOP

In this example, the full run goes 4 timesteps from time 0 to 8 hours (2hr/time step) and completes successfully. The restart run reads in the restart file at time step 2 (4 hours) and goes to 8 hours, and should be bfb identical to the full run.

Is there a variable that you need added to the restart file? Or some other condition that could be missing for a restart? Perhaps on restart init we have to populate a variable used for the initial guess for the implicit solver?

@hyungyukang
Copy link
Author

hyungyukang commented Jun 29, 2020

@mark-petersen, thanks for reviewing the codes. Actually I've never tested the solver using config_do_restart = .false.. I just checked the same problem.
But when I run the model with config_time_integrator = 'split_explicit', the model also showed error:
ERROR: Warning: Sea surface height is outside of acceptable physical range, i.e. abs(sum(h)-bottomDepth)>20m. iCell: 1, maxLevelCell(iCell): 55, bottomDepth(iCell): 4222.79159790825, sum(h): 0.00000000000000

I'm currently running these cases - QU240, EC60to80, and RRS30to10. Are these test cases working with config_do_restart = .true.? If not, can you please recommend some other test cases?

@mark-petersen
Copy link
Contributor

You need to run the code forward and write restart files, and then restart from there. I can send a test case. That SSH warning is because the model is reading in all zeros, because no restart file is available.

@hyungyukang
Copy link
Author

You need to run the code forward and write restart files, and then restart from there. I can send a test case. That SSH warning is because the model is reading in all zeros, because no restart file is available.

@mark-petersen, got it. I'll run it today. I'll let you know when I find problems as soon as possible.

@mark-petersen
Copy link
Contributor

@hyungyukang I just sent you a zip file of our nightly regression suite, with all the correct flags for your rebased PR. unzip it, then

cd ocean/global_ocean/QU240/restart_test/

and point ocean_model to your executable in each subdirectory.

You can use run.py or cd into full_run and then restart_run and run each of those separately. The run.py in the restart_test directory compares if you get bit-for-bit matching solutions at the end of the full run and restart run.

@hyungyukang
Copy link
Author

@hyungyukang I just sent you a zip file of our nightly regression suite, with all the correct flags for your rebased PR. unzip it, then

cd ocean/global_ocean/QU240/restart_test/

and point ocean_model to your executable in each subdirectory.

You can use run.py or cd into full_run and then restart_run and run each of those separately. The run.py in the restart_test directory compares if you get bit-for-bit matching solutions at the end of the full run and restart run.

@mark-petersen, just received it. Thanks!

@mark-petersen
Copy link
Contributor

I'll let you test it. When the restart run is not identical to the original run, the most likely cause is that you have a variable or setting in memory that is not saved for the restart. In that case, you need to add that variable to

Registry.xml
1503         <stream name="restart"

list just below that.

Another possibility is that you are missing something in your init routine, like initializing variables for your implicit method. The original run (no restart) initializes the normalVelocity to zero everywhere, so it can hide an error of something you might need to initialize on every simulation start, not just the restart ones.

@hyungyukang
Copy link
Author

@mark-petersen , I just fixed the bugs for restart run. Can you please test it?

@hyungyukang
Copy link
Author

hyungyukang commented Jul 1, 2020

@mark-petersen , I've made some changes for the SI code based on what I've tested in other branch.
One thing you have to change before run the model is namelist.ocean. I create a new section and changed some in Registry.xml.

Since explicit-subcycling and semi-implicit schemes are sharing these options (config_n_ts_iter, config_n_bcl_iter_beg, config_n_bcl_iter_mid, config_n_bcl_iter_end), so I took them out from &split_explicit_ts and put them in &split_timestep_share which is the new namelist section. Ideas on this would be welcome!

/
&split_timestep_share
    config_n_ts_iter = 2
    config_n_bcl_iter_beg = 1
    config_n_bcl_iter_mid = 2
    config_n_bcl_iter_end = 2
/
&split_explicit_ts
    config_btr_dt = '0000_00:00:15'
    config_n_btr_cor_iter = 2
    config_vel_correction = .true.
    config_btr_subcycle_loop_factor = 2
    config_btr_gam1_velWt1 = 0.5
    config_btr_gam2_SSHWt1 = 1.0
    config_btr_gam3_velWt2 = 1.0
    config_btr_solve_SSH2 = .false.
/
&semi_implicit_ts
    config_btr_si_preconditioner = 'ras'
    config_btr_si_tolerance = 1.0e-9
    config_n_btr_si_outer_iter = 2
/

And now the SI code passes the restart run test:

Beginning variable comparisons for all time levels of field 'temperature'. Note any time levels reported are 0-based.
    Pass thresholds are:
       L1: 0.00000000000000e+00
       L2: 0.00000000000000e+00
       L_Infinity: 0.00000000000000e+00
 ** PASS Comparison of temperature between full_run/output.nc and
    restart_run/output.nc
Beginning variable comparisons for all time levels of field 'salinity'. Note any time levels reported are 0-based.
    Pass thresholds are:
       L1: 0.00000000000000e+00
       L2: 0.00000000000000e+00
       L_Infinity: 0.00000000000000e+00
 ** PASS Comparison of salinity between full_run/output.nc and
    restart_run/output.nc
Beginning variable comparisons for all time levels of field 'layerThickness'. Note any time levels reported are 0-based.
    Pass thresholds are:
       L1: 0.00000000000000e+00
       L2: 0.00000000000000e+00
       L_Infinity: 0.00000000000000e+00
 ** PASS Comparison of layerThickness between full_run/output.nc and
    restart_run/output.nc
Beginning variable comparisons for all time levels of field 'normalVelocity'. Note any time levels reported are 0-based.
    Pass thresholds are:
       L1: 0.00000000000000e+00
       L2: 0.00000000000000e+00
       L_Infinity: 0.00000000000000e+00
 ** PASS Comparison of normalVelocity between full_run/output.nc and

@mark-petersen
Copy link
Contributor

mark-petersen commented Jul 1, 2020

@hyungyukang this is great progress. I just ran our nightly regression suite with gnu debug and optimized, but with openmp off. I compile with

make gfortran CORE=ocean OPENMP=false DEBUG=true

and get

./nightly_ocean_test_suite.py
 ** Running case Baroclinic Channel 10km - Default Test
      PASS
 ** Running case Global Ocean 240km - Init Test
      PASS
 ** Running case Global Ocean 240km - Performance Test
      PASS
 ** Running case Global Ocean 240km - Restart Test
      PASS
 ** Running case Global Ocean 240km - Thread Test
      PASS
 ** Running case Global Ocean 240km - Analysis Test
      PASS
 ** Running case Global Ocean 240km - BGC Ecosys Test
      PASS
 ** Running case ZISO 20km - Smoke Test
      PASS
 ** Running case ZISO 20km - Smoke Test with frazil
      PASS
 ** Running case Baroclinic Channel 10km - Thread Test
      PASS
 ** Running case Baroclinic Channel 10km - Decomp Test
   ** FAIL (See case_outputs/Baroclinic_Channel_10km_-_Decomp_Test for more information)
 ** Running case Baroclinic Channel 10km - Restart Test
      PASS
 ** Running case sub-ice-shelf 2D - restart test
   ** FAIL (See case_outputs/sub-ice-shelf_2D_-_restart_test for more information)
 ** Running case Periodic Planar 20km - LIGHT particle region reset test
      PASS
 ** Running case Periodic Planar 20km - LIGHT particle time reset test
      PASS

First, thank you for fixing the restart, it now works! That is great.
Let's not worry about the sub-ice-shelf yet.
We do need to worry about the fail on the Baroclinic Channel 10km - Decomp Test. That means that if you run with 4 versus 8 cores (say), that they are not bfb with each other. There are two possible causes:

  1. There is place where we need to run a calculation on more halo layers (nCells = nCellsArray( 2 ), nEdges = nEdgesArray( size(nEdgesArray) ), etc).
  2. a halo update needs to be added somewhere

I will make a QU240 that tests partitions as well and send it to you. I don't have time to debug it much this week, but you could try. The general method is, make sure you also get a non-bfb match between 4 vs 8 cores (say), but a bfb match with split explicit.

Then in your semi-implicit code, set your nCells and nEdges variables are maxed out, i.e.
nCells = nCellsArray( size(nCellsArray) ), nEdges = nEdgesArray( size(nEdgesArray) ). Making them large is safer. Once we have bfb match across partitions, we can make them smaller for efficiency. You may get an error with it all maxed out, and need to reduce it by one somewhere, if it is referencing a bad cell at the very edge that causes a problem.

If that does not produce a bfb match, then look through and guess where you might need a halo update, and add it. Again, you can put in lots extra and there is no harm done. Once you get a bfb match, we need to take out all the extra ones, because halo updates are communication and are very expensive.

@hyungyukang
Copy link
Author

hyungyukang commented Jul 1, 2020

Thanks @mark-petersen for the results. I'll be waiting for the partition test and look into the baroclinic channel test as well!

@mark-petersen
Copy link
Contributor

I just sent a tar file with a new global test case, comparing 4 vs 8 processors. To run it:

cd ocean/global_ocean/QU240/decomp_test/
./run.py
 * Running 4proc_run
  - Complete
 * Running 8proc_run
  - Complete
Beginning variable comparisons for all time levels of field 'temperature'. Note any time levels reported are 0-based.

and it shows a mismatch between the two. Run with split explicit and it should match. Give it a try, and you can explore the halo strategies I have above.

mattdturner and others added 20 commits January 13, 2021 09:39
Add a new preprocessor directive (USE_LAPACK) around the calls to LAPACK
routines.  Also add OpenMP parallel regions to reflect recent threading PRs.
Finally, remove calls to `mpas_ocn_get_config` and replace with `use ocn_config`
module.
Update the log output when too many SI iterations occur.  Previously,
not all messages would be written to the log since MPAS_LOG_CRIT was
passed prior to all log messages being written.
Modify time step size setting for SI code.
Update USE_LAPACK flag in Makefile to provide broader compatibility
In this mode, the Jacobi preconditioner is used.
Also, a simple single-precision allreduce is implemented to achieve BFB match across partitions.
Later version, this BFB allreduce would be changed using better algorithms.
mark-petersen added a commit that referenced this pull request Jan 13, 2021
Codes for a semi implicit barotropic mode solver #422
@mark-petersen
Copy link
Contributor

rebased. Testing...

@mark-petersen
Copy link
Contributor

Passes mpas-o nightly regression suite with gnu debug and both split explicit and semi-implicit. Note that for semi-implicit, compile on grizzly with

export LAPACK=/usr/projects/climate/mpeterse/repos/lapack/build/lapack_gnu
source /usr/projects/climate/SHARED_CLIMATE/anaconda_envs/load_latest_compass.sh
module load gcc/5.3.0 openmpi/1.10.5 netcdf/4.4.1 parallel-netcdf/1.5.0 pio/1.7.2
make gfortran CORE=ocean USE_LAPACK=true

@mark-petersen mark-petersen merged commit 3d13880 into MPAS-Dev:ocean/develop Jan 14, 2021
jonbob added a commit to E3SM-Project/E3SM that referenced this pull request Jan 14, 2021
Adds ocean semi-implicit barotropic mode solver

This brings in a new mpas-source submodule with changes only to the
ocean core. It adds a semi-implicit barotropic solver for ocean
time-stepping. Current default is split explicit and this new option is
off by default, so this PR should be BFB. But it does require changes to
the mpas-ocean Registry, so this PR also includes corresponding updates
to its e3sm bld scripts.

See MPAS-Dev/MPAS-Model#422

[NML]
[BFB]
jonbob added a commit to E3SM-Project/E3SM that referenced this pull request Jan 15, 2021
Adds ocean semi-implicit barotropic mode solver

This brings in a new mpas-source submodule with changes only to the
ocean core. It adds a semi-implicit barotropic solver for ocean
time-stepping. Current default is split explicit and this new option is
off by default, so this PR should be BFB. But it does require changes to
the mpas-ocean Registry, so this PR also includes corresponding updates
to its e3sm bld scripts.

See MPAS-Dev/MPAS-Model#422

[NML]
[BFB]
mark-petersen added a commit to mark-petersen/MPAS-Model that referenced this pull request Mar 16, 2021
Codes for a semi implicit barotropic mode solver MPAS-Dev#422
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants