added parallel processing for regression suites #593

jamilgafur · 2020-06-03T19:44:08Z

Added parallel capability to the manage_regession_suite for regression tests.

xylar · 2020-06-03T20:44:01Z

This probably needs to get rebased onto #559. I don't know how many incompatibilities there will be but I had intended #559 in part as a starting point for this work. For example, I would like to avoid the use of global variables in the script if at all possible.

xylar · 2020-06-03T20:50:45Z

Sorry for the moving target. It wasn't clear to me that this didn't already include my commit(s) from #559 until I could actually look at the changes. If it would help to have a quick conversation over WebEx or whatever, let me know.

vanroekel · 2020-06-03T20:54:15Z

we may have lost your changes in #559 in the rebase process. We'll make sure to rebase once that PR is merged.

xylar · 2020-06-03T21:04:40Z

@vanroekel, @jamilgafur and I have a quick call and I think we're on the same page. A rebase should take care of things and I don't think my request to move the global variables in as function arguments should be too much work. I'll give it a more careful review soon.

jamilgafur · 2020-06-03T21:09:42Z

did: git rebase -i xylar/compass/add_prerequisite_tag

then pushed to this branch

xylar · 2020-06-05T11:10:28Z

@jamilgafur, something is still not right, since there are several commits still showing up that are not related to this PR.

Since xylar/MPAS-Model/compass/add_prerequisite_tag has been merged (#559), you can now rebase onto develop. Could you do the following?

git rebase -i MPAS-Dev/MPAS-Model/develop

Remove lines for commits that you didn't add. These should not be included in your rebase and, in fact, the purpose of the rebase is to remove them.

You will also find two commits with the same name: "added parallel processing for regression suites". These are both yours and I think they are identical. (I think this because I tried to include both in my rebase and I got an error about an empty commit for the second one.) Please change "pick" to "squash" for the second of these two commits so it gets combined with the previous one with the same name. You should have 2 commits with "pick" and one with "squash". Then, edit the commit message when prompted to remove the redundancy.

At the end of that process, I see:

$ git logg
* ac0e2a73 (HEAD -> parallel_processing_branch) removed global variables and passed them as local parameters
* 9d6d065c added parallel processing for regression suites
*   0fda9ab5 (MPAS-Dev/MPAS-Model/develop, develop) Merge branch 'xylar/compass/add_prerequisite_tag' into develop
|\  
| * 95abf0c6 (compass/add_prerequisite_tag) Move main to function
...

xylar · 2020-06-05T12:02:37Z

I've cherry-picked all of my commits from #559 and yours from here (after my rebase above) onto ocean/develop. I didn't use #560, which I should have, but it was instructive. I discovered that the code hangs when there are no prerequisites in the test suite. I added a print statement at the top of stream_queue and I can see that command = [], which doesn't seem to be a case that this code is ready to handle.

I'll continue with testing later on but that's what I've had time for so far.

And I don't mean to be all criticism here. This is some very nice work!

jamilgafur · 2020-06-05T13:02:22Z

I noticed the hang as well, I will add a small exception to the code for that scenario

xylar · 2020-06-05T14:20:00Z

@jamilgafur, you deleted your comment about rebasing. Did you figure that out? I'm still seeing extra commits so I'm guessing not.

jamilgafur · 2020-06-05T14:21:19Z

@xylar I did the git rebase; now when I do a git status I get:

On branch parallel_processing_branch
Your branch is ahead of 'origin/master' by 733 commits.
(use "git push" to publish your local commits)

xylar · 2020-06-05T14:27:30Z

@jamilgafur, that might be expected, since develop might be ahead of master but 733 commits. The important thing is whether you see only your expected commits (5 of 6 at this point, I guess) beyond MPAS-Dev/MPAS-Model/develop.

xylar · 2020-06-05T14:28:53Z

@jamilgafur, what you just pushed here is definitely not correct. Maybe I can call you later today to see if I can help you get the rebase done correctly.

jamilgafur · 2020-06-05T14:32:36Z

how does 11:30 EST (3 hours) sound?

xylar · 2020-06-05T14:44:04Z

I'm not sure if I can talk at 11:30 your time (MDT, not EST). That's 7:30 pm my time and it will depend on dinner plans, which I haven't made yet. Maybe send me a slack message or an email or something closer to that time and we can see what will work.

jamilgafur · 2020-06-05T15:03:43Z

Shot you an email for monday to go over this!

xylar · 2020-06-05T15:33:36Z

I'm gone all of next week so today would definitely be better.

xylar · 2020-06-05T16:06:09Z

Okay, so it seems like the problem is that you are somehow going back and forth between develop and ocean/develop. Really, the only safe thing to do is:

Put the commits you want to have in this PR on a branch that is based off of develop. Your commits should be the only thing beyond MPAS-Dev/MPAS-Model/develop. There should be no merges or anything like that
When you want to test how these changes are working in ocean/develop, make a new branch (let's call it test_parallel_regression) that points to MPAS-Dev/MPAS-Model/ocean/develop, then do a test merge of this branch (parallel_processing_branch) to test_parallel_regression. Then, test whether things work as expected. You may also need to test-merge my changes from Add prerequisite tags to all necessary test cases #560 into this test branch.
Any changes you find you have to make, you need to make on this branch (parallel_processing_branch). You could temporarily make them on test_parallel_regression just to try them out but then you would need to copy or cherry-pick them onto parallel_processing_branch. Then, you would need to start your test branch over by repeating 2.

In this way, parallel_processing_branch should never have anything to do with ocean/develop. Currently, it has somehow picked up hundreds of commits from ocean/develop.

xylar · 2020-06-05T16:23:03Z

@jamilgafur, to try to undo the damage, here's what I did.

I started on the parallel_processing_branch branch.

$ git reset --hard 0369e17d

This was the commit hash 5 hours ago when I checked out your branch. I have it around on Grizzly so that's how I know. If I didn't, I'd need to do some sleuthing around in the git log to figure out what orphaned commits exist and figure out which is the one I wanted, so lucky for us that isn't necessary.

Then, I cherry-pick your latest commit onto it. That's the only one from you I could find in the 733 commits since your current version of the branch diverged from develop but it's a long history to search through. If there are others, find them and cherry-pick them in the order they were originally committed:

$ git cherry-pick 7dec021ac

Then, I rebase onto develop, keeping only the commits that are yours and squashing the redundant ones:

$ git rebase --interactive MPAS-Dev/MPAS-Model/develop

pick 188c7d948 added parallel processing for regression suites
squash 6e9c0d8e6 added parallel processing for regression suites
pick 0369e17d4 removed global variables and passed them as local parameters
pick ef97f400e added check in stream processing to see if command is empty

Then, I fix the commit message:

added parallel processing for regression suites

# Please enter the commit message for your changes. Lines starting
...

Once this is done, I see the expected log:

$ git logg
* b7e3df626 (HEAD -> parallel_processing_branch) added check in stream processing to see if command is empty
* 0c1b9bc5c removed global variables and passed them as local parameters
* c0cca0d94 added parallel processing for regression suites
*   0fda9ab5b (MPAS-Dev/MPAS-Model/develop, develop) Merge branch 'xylar/compass/add_prerequisite_tag' into develop
|\  
| * 95abf0c62 (compass/add_prerequisite_tag) Move main to function
| * 118c336de Detect prerequisite tests
| * 748a7c35d Pull test-case info into a dict
| * e7862e43a Clean-up unneeded function argument
| * b592dcb2e Fix case setup output with --verbose
* |   b6cb09736 Merge branch 'atmosphere/fix_timekeeping_imports' into develop (PR #582)
|\ \  
| * | 6c58629aa (MiCurry/MPAS-Model/atmosphere/fix_timekeeping_imports) Fix mpas_timekeeping dependencies
* | |   1294a6557 Merge pull request #575 from xylar/fix_docs_ci
|\ \ \  
| * | | 7a7aa09ac (fix_docs_ci) Update GITHUB_TOKEN
| * | | b9d718282 Fix bad docs path in Travis CI
|/ / /  
* | |   51828a158 Merge PR #472 'xylar/add_compass_docs' into develop
...

xylar

@jamilgafur, I think maybe things didn't get fully tested after your switched to local, rather than global, variables. I wasn't able to run with the current branch, either with or without local parallelism.

For reference, I did the following to get the branch I tested on:

git reset --hard MPAS-Dev/MPAS-Model/ocean/develop
git merge MPAS-Dev/MPAS-Model/develop
# had to fix a merge conflict
vim testing_and_setup/compass/setup_testcase.py
git add testing_and_setup/compass/setup_testcase.py 
git merge --continue
git merge jamilgafur/MPAS-Model/parallel_processing_branch
git merge xylar/MPAS-Model/ocean/add_prerequisite_tag

testing_and_setup/compass/manage_regression_suite.py

matthewhoffman · 2020-06-18T19:49:19Z

I'm following this PR, but I'll wait to take a look at it closely until issues from @xylar 's review are resolved.

xylar · 2020-06-18T20:11:34Z

@matthewhoffman (and @mark-petersen), yep, once I've approved I'll ping you to have a look, too. Till then, I agree, one step at a time.

jamilgafur · 2020-06-19T14:48:34Z

when I run all regression tests pass except: ocean_sub_ice_shelf_2D_5km_restart_test

I updated the code and fixed it based on your comments above. I tested it with the following parameters:

Normal Run no parallelism / default
With default local parallelism (one node)
With local parallelism and two nodes

each of these run without any errors except the one stated above.

jamilgafur · 2020-07-08T13:50:08Z

@xylar I ran this code and it all seems to be working fine!

xylar · 2020-07-08T13:56:07Z

Great! I'll do another round of reviewing as soon as I can, but that's likely not going to be until after Friday.

xylar

@jamilgafur, this is some great progress!

There were some troubles with missing imports and variables in the serial test that should be very easy to fix.

The parallel test isn't working for me. Once the tests complete, the loop didn't notice somehow and was stuck looping infinitely. I would suggest simplifying the code in the while loop significantly. This needs some formatting fixes (4-space indentation, for example) but this is what I had in mind and what seems to work for me:

iteration = 0
completed =[]
running = []

while len(completed) < len(testcase_data.keys()):

  # adding phase
  running_names = [running_proc[1] for running_proc in running]
  print('running_names {}, completed {}'.format(running_names, completed))
  for key in testcase_data.keys():
      if not key in completed and not key in running_names:
        if not testcase_data[key]['prereqs'] == [] and testcase_data[key]['runnable'] == False:
          prereq_tests = [prereq['name'] for prereq in testcase_data[key]['prereqs']]
          prereq_completed =  all(prereq in completed for prereq in prereq_tests)
          if prereq_completed:
            testcase_data[key]['runnable'] = True
        if number_of_procs >= testcase_data[key]['procs'] and testcase_data[key]['runnable'] == True:
          number_of_procs = number_of_procs - testcase_data[key]['procs']
          process , key = run(key)
          running.append((process,key))

  # processing phase
  for process in running:
      pid = process[0].pid
      running_key = process[1]
      if not psutil.pid_exists(pid):
        running.remove(process)
        print('DONE: {}'.format(running_key))
        completed.append(running_key)
        number_of_procs = number_of_procs + testcase_data[running_key]['procs']
      else:
        try:
          process[0].wait(timeout=3)
        except subprocess.TimeoutExpired:
          continue
  iteration = iteration +1

Also, I think you had added code to check which parallel tests had failed and to report that but that seems to have been lost. Tests that fail now simply report that they are "DONE".

Along those lines, it seems like no attempt is made to check if prerequisites ran successfully or not, just that they completed. It doesn't make sense to run a dependency if the prerequisite test failed. Most dependencies should fail if their prerequisite didn't finish successfully but I can imagine circumstances where the dependency runs but produces bad results instead of failing when the prerequisite failed. I would have some sort of status of "successful" or "failed" for the tasks that ran and mark dependencies as "failed" and add them to "completed" if on of their prerequisites failed.

testing_and_setup/compass/manage_regression_suite.py

Co-authored-by: Xylar Asay-Davis <xylarstorm@gmail.com>

jamilgafur · 2020-07-14T00:16:53Z

@xylar , I set the code up that runs as follows:

if it has no prereqs it will run
if the prereqs are completed it will run the test
if the prereqs failed the other tests who depend on it wont even run

Co-authored-by: Xylar Asay-Davis <xylarstorm@gmail.com>

jamilgafur · 2020-07-14T00:24:25Z

testing_and_setup/compass/manage_regression_suite.py

+    local_parallel_code += "            pid = process[0].pid\n"
+    local_parallel_code += "            running_key = process[1]\n"
+    local_parallel_code += "            if not psutil.pid_exists(pid):\n"
+    local_parallel_code += "                error_out = subprocess.call(['grep', 'FAIL', running_key.replace(' ', '_')], stdout=subprocess.DEVNULL, stderr=subprocess.STDOUT)\n"


This checks to see if the run failed and what to do with it

I don't think this grep call is working as you want. And why not use python directly rather than a subprocess call?

Following up a bit:

filename = running_key.replace(' ', '_') with open(filename) as f: error_out = 'FAIL' in f.read()

should accomplish the same thing without a subprocess call.

xylar · 2020-07-24T06:39:26Z

@jamilgafur, I was kind of waiting for you to ping me to review this again. Is it ready for me to test?

jamilgafur · 2020-07-24T12:45:32Z

@xylar sorry you are right, I think it is ready for review.

jamilgafur · 2020-08-10T14:15:52Z

@xylar any comments?

xylar · 2020-08-10T14:18:03Z

@jamilgafur, I haven't had time to get to this and I probably won't for a few more days still. Sorry about that.

jamilgafur · 2020-08-10T16:26:18Z

@xylar thats fine!

xylar

I'm seeing output that indicates that all tests failed, see below. In reality, the logs from the tests say that they passed. Stdout:

$ ./parallel_nightly_ocean_test_suite.py 
Number of usable Processors: 144
		** Running case Baroclinic Channel 10km - Default Test **
['time', '-p', './run_test.py']
we have enough to add: 1, 143, Baroclinic Channel 10km - Default Test
		** Running case Global Ocean 240km - Init Test **
['time', '-p', './run.py']
we have enough to add: 4, 139, Global Ocean 240km - Init Test
		** Running case Global Ocean 240km - RK4 Thread Test **
['time', '-p', './run.py']
we have enough to add: 4, 135, Global Ocean 240km - RK4 Thread Test
		** Running case ZISO 20km - Smoke Test **
['time', '-p', './run_test.py']
we have enough to add: 1, 134, ZISO 20km - Smoke Test
		** Running case ZISO 20km - Smoke Test with frazil **
['time', '-p', './run_test.py']
we have enough to add: 1, 133, ZISO 20km - Smoke Test with frazil
		** Running case Baroclinic Channel 10km - Thread Test **
['time', '-p', './run_test.py']
we have enough to add: 1, 132, Baroclinic Channel 10km - Thread Test
		** Running case Baroclinic Channel 10km - Decomp Test **
['time', '-p', './run_test.py']
we have enough to add: 1, 131, Baroclinic Channel 10km - Decomp Test
		** Running case Baroclinic Channel 10km - Restart Test **
['time', '-p', './run_test.py']
we have enough to add: 4, 127, Baroclinic Channel 10km - Restart Test
		** Running case sub-ice-shelf 2D - restart test **
['time', '-p', './run_test.py']
we have enough to add: 4, 123, sub-ice-shelf 2D - restart test
		** Running case Periodic Planar 20km - LIGHT particle region reset test **
['time', '-p', './run_test.py']
we have enough to add: 1, 122, Periodic Planar 20km - LIGHT particle region reset test
		** Running case Periodic Planar 20km - LIGHT particle time reset test **
['time', '-p', './run_test.py']
we have enough to add: 1, 121, Periodic Planar 20km - LIGHT particle time reset test
Failed: Global Ocean 240km - RK4 Thread Test 
DONE: 1 121 Global Ocean 240km - RK4 Thread Test
Failed: Baroclinic Channel 10km - Restart Test 
DONE: 1 125 Baroclinic Channel 10km - Restart Test
Failed: Periodic Planar 20km - LIGHT particle region reset test 
DONE: 1 129 Periodic Planar 20km - LIGHT particle region reset test
		** Running case Global Ocean 240km - RK4 Thread Test **
['time', '-p', './run.py']
we have enough to add: 4, 126, Global Ocean 240km - RK4 Thread Test
		** Running case Baroclinic Channel 10km - Restart Test **
['time', '-p', './run_test.py']
we have enough to add: 4, 122, Baroclinic Channel 10km - Restart Test
		** Running case Periodic Planar 20km - LIGHT particle region reset test **
['time', '-p', './run_test.py']
we have enough to add: 1, 121, Periodic Planar 20km - LIGHT particle region reset test
Failed: Baroclinic Channel 10km - Default Test 
DONE: 1 121 Baroclinic Channel 10km - Default Test
Failed: ZISO 20km - Smoke Test with frazil 
DONE: 1 122 ZISO 20km - Smoke Test with frazil
Failed: Baroclinic Channel 10km - Decomp Test 
DONE: 1 123 Baroclinic Channel 10km - Decomp Test
		** Running case Baroclinic Channel 10km - Default Test **
['time', '-p', './run_test.py']
we have enough to add: 1, 123, Baroclinic Channel 10km - Default Test
		** Running case ZISO 20km - Smoke Test with frazil **
['time', '-p', './run_test.py']
we have enough to add: 1, 122, ZISO 20km - Smoke Test with frazil
		** Running case Baroclinic Channel 10km - Decomp Test **
['time', '-p', './run_test.py']
we have enough to add: 1, 121, Baroclinic Channel 10km - Decomp Test
Failed: Baroclinic Channel 10km - Thread Test 
DONE: 1 121 Baroclinic Channel 10km - Thread Test
Failed: Periodic Planar 20km - LIGHT particle time reset test 
DONE: 1 122 Periodic Planar 20km - LIGHT particle time reset test
		** Running case Baroclinic Channel 10km - Thread Test **
['time', '-p', './run_test.py']
we have enough to add: 1, 122, Baroclinic Channel 10km - Thread Test
		** Running case Periodic Planar 20km - LIGHT particle time reset test **
['time', '-p', './run_test.py']
we have enough to add: 1, 121, Periodic Planar 20km - LIGHT particle time reset test
Failed: ZISO 20km - Smoke Test 
DONE: 1 121 ZISO 20km - Smoke Test
Failed: Global Ocean 240km - RK4 Thread Test 
DONE: 1 122 Global Ocean 240km - RK4 Thread Test
Failed: Periodic Planar 20km - LIGHT particle region reset test 
DONE: 1 126 Periodic Planar 20km - LIGHT particle region reset test
Failed: Baroclinic Channel 10km - Decomp Test 
DONE: 1 127 Baroclinic Channel 10km - Decomp Test
		** Running case Global Ocean 240km - RK4 Thread Test **
['time', '-p', './run.py']
we have enough to add: 4, 124, Global Ocean 240km - RK4 Thread Test
		** Running case ZISO 20km - Smoke Test **
['time', '-p', './run_test.py']
we have enough to add: 1, 123, ZISO 20km - Smoke Test
		** Running case Baroclinic Channel 10km - Decomp Test **
['time', '-p', './run_test.py']
we have enough to add: 1, 122, Baroclinic Channel 10km - Decomp Test
		** Running case Periodic Planar 20km - LIGHT particle region reset test **
['time', '-p', './run_test.py']
we have enough to add: 1, 121, Periodic Planar 20km - LIGHT particle region reset test
Failed: Baroclinic Channel 10km - Default Test 
DONE: 1 121 Baroclinic Channel 10km - Default Test
		** Running case Baroclinic Channel 10km - Default Test **
['time', '-p', './run_test.py']
we have enough to add: 1, 121, Baroclinic Channel 10km - Default Test
Failed: sub-ice-shelf 2D - restart test 
DONE: 1 121 sub-ice-shelf 2D - restart test
Failed: ZISO 20km - Smoke Test with frazil 
DONE: 1 125 ZISO 20km - Smoke Test with frazil
Failed: Global Ocean 240km - RK4 Thread Test 
DONE: 1 126 Global Ocean 240km - RK4 Thread Test
Failed: Baroclinic Channel 10km - Decomp Test 
DONE: 1 130 Baroclinic Channel 10km - Decomp Test
Failed: Baroclinic Channel 10km - Restart Test 
DONE: 1 131 Baroclinic Channel 10km - Restart Test
Failed: Periodic Planar 20km - LIGHT particle time reset test 
DONE: 1 135 Periodic Planar 20km - LIGHT particle time reset test
Failed: Periodic Planar 20km - LIGHT particle region reset test 
DONE: 1 136 Periodic Planar 20km - LIGHT particle region reset test
Failed: Baroclinic Channel 10km - Thread Test 
DONE: 1 137 Baroclinic Channel 10km - Thread Test
Failed: Baroclinic Channel 10km - Default Test 
DONE: 1 138 Baroclinic Channel 10km - Default Test
Failed: ZISO 20km - Smoke Test 
DONE: 1 139 ZISO 20km - Smoke Test
Failed: Global Ocean 240km - Init Test 
DONE: 1 140 Global Ocean 240km - Init Test
runtime: 135.74175310134888
Wall clock time: 02:16



TEST RUNTIMES & ERROR:
00:11 Baroclinic_Channel_10km_-_Decomp_Test
00:07 Baroclinic_Channel_10km_-_Default_Test
00:18 Baroclinic_Channel_10km_-_Restart_Test
00:11 Baroclinic_Channel_10km_-_Thread_Test
02:16 Global_Ocean_240km_-_Init_Test
00:01 Global_Ocean_240km_-_RK4_Thread_Test
00:14 Periodic_Planar_20km_-_LIGHT_particle_region_reset_test
00:18 Periodic_Planar_20km_-_LIGHT_particle_time_reset_test
00:27 ZISO_20km_-_Smoke_Test
00:12 ZISO_20km_-_Smoke_Test_with_frazil
00:35 sub-ice-shelf_2D_-_restart_test
Total runtime 04:50

Example output from ZISO_20km_-_Smoke_Test:

 * Running init_step1
srun: Warning: can't run 1 processes on 4 nodes, setting nnodes to 1
     Complete
 * Running init_step2
srun: Warning: can't run 1 processes on 4 nodes, setting nnodes to 1
     Complete
 * Running forward
     Complete
real 26.97
user 0.85
sys 0.98

The only task that genuinely seems to have failed is Global_Ocean_240km_-_RK4_Thread_Test:

 * Running 1thread_run
File graph.info does not exist!
Traceback (most recent call last):
  File "./run.py", line 16, in <module>
    subprocess.check_call(['gpmetis', 'graph.info', '4'])
  File "/usr/projects/climate/SHARED_CLIMATE/anaconda_envs/base/envs/compass_0.1.8/lib/python3.8/subprocess.py", line 364, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['gpmetis', 'graph.info', '4']' returned non-zero exit status 254.
Traceback (most recent call last):
  File "./run.py", line 50, in <module>
  File "/usr/projects/climate/SHARED_CLIMATE/anaconda_envs/base/envs/compass_0.1.8/lib/python3.8/subprocess.py", line 364, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['./run.py']' returned non-zero exit status 1.
real 0.17
user 0.07
sys 0.04

This appears to be because it ran before its prerequisite, and that seems to be my fault: this test didn't get included in #560. I will fix this now.

But several tests did not run at all, possibly because they think their prerequisite wasn't successful. But there was no indication at all that they didn't run or why. These include:

Global Ocean 240km - Performance Test
Global Ocean 240km - Restart Test
Global Ocean 240km - Thread Test
Global Ocean 240km - Analysis Test

Could you try to figure out:

Why tests are saying the failed when they didn't?
Why there is no output when a test fails because its prerequisite failed?

xylar · 2020-08-20T13:08:11Z

testing_and_setup/compass/manage_regression_suite.py

+    for child in suite_tag:
+        for script in child:
+            run_scripts.append(script.attrib['name'])
+    print(run_scripts)


This print statement still needs to be removed eventually.

xylar · 2020-08-20T13:10:15Z

testing_and_setup/compass/manage_regression_suite.py

+    regression_script_code += 'import numpy as np\n'
+    regression_script_code += 'import time'
+    regression_script_code += '\n\n'
+    regression_script_code += 'start_time = time.time()'


This line is missing a \n

xylar · 2020-08-20T14:17:33Z

testing_and_setup/compass/manage_regression_suite.py

+    local_parallel_code += "\tprint('\t\t** Running case {} **'.format(key))\n"
+    local_parallel_code += "\tos.chdir('./{}'.format(testcase_data[key]['path']))\n"
+    local_parallel_code += "\tcommand = ['time', '-p', './{}'.format(testcase_data[key]['filename'])]\n"
+    local_parallel_code += "\tprint('{}'.format(command))\n"


Suggested change

local_parallel_code += "\tprint('{}'.format(command))\n"

local_parallel_code += "\tprint('{}'.format(' '.join(command)))\n"

If you want to keep this at all, it should be formatted just a s a command, not as a list of strings. I don't think it's useful, though, and I would just remove this line.

xylar · 2020-08-20T14:43:07Z

@jamilgafur, I think we're getting close to having this working!

To test, I did the following (much simpler than before!) in a test branch that I will throw away once I'm done:

cd test_parallel_regression
git reset --hard xylar/MPAS-Model/ocean/add_prerequisite_tag
git merge jamilgafur/MPAS-Model/parallel_processing_branch

Then, I built the code as normal and then I set up both serial and 4-node versions of the nightly regression test suite. Let me know if you have trouble following the same process. I added the prerequisite tag to the test case (Global_Ocean_240km_-_RK4_Thread_Test) that was missing it in my testing above, so hopefully you should be able to test with this and see why these false fails are showing up.

xylar · 2020-11-06T13:12:29Z

This PR will not be merged here. I have made a branch on my COMPASS fork with a version of this branch appropriate for that repo. When the time is right, we can revive this work on that repo:
https://github.com/xylar/compass/tree/parallel_regression_suite

vanroekel added COMPASS feature in progress labels Jun 3, 2020

vanroekel requested review from mark-petersen and xylar June 3, 2020 19:45

xylar requested a review from matthewhoffman June 3, 2020 20:39

xylar mentioned this pull request Jun 3, 2020

Add support for a prerequisite tag in COMPASS #559

Merged

xylar mentioned this pull request Jun 3, 2020

Added Parallel processing support with prereq for manage_regression_s… #588

Closed

xylar requested changes Jun 17, 2020

View reviewed changes

xylar requested changes Jul 11, 2020

View reviewed changes

jamilgafur and others added 8 commits July 13, 2020 08:52

Update testing_and_setup/compass/manage_regression_suite.py

30e3604

Co-authored-by: Xylar Asay-Davis <xylarstorm@gmail.com>

Update testing_and_setup/compass/manage_regression_suite.py

b8ab1f2

Co-authored-by: Xylar Asay-Davis <xylarstorm@gmail.com>

Update testing_and_setup/compass/manage_regression_suite.py

407a714

Co-authored-by: Xylar Asay-Davis <xylarstorm@gmail.com>

Update testing_and_setup/compass/manage_regression_suite.py

cf4493d

Co-authored-by: Xylar Asay-Davis <xylarstorm@gmail.com>

Update testing_and_setup/compass/manage_regression_suite.py

8f3ef3d

Co-authored-by: Xylar Asay-Davis <xylarstorm@gmail.com>

Update testing_and_setup/compass/manage_regression_suite.py

0eb4989

Co-authored-by: Xylar Asay-Davis <xylarstorm@gmail.com>

fixed small errors

1bed7d0

fixed infinite looping and error searching

e32eb49

jamilgafur and others added 4 commits July 13, 2020 18:17

Update testing_and_setup/compass/manage_regression_suite.py

e154638

Co-authored-by: Xylar Asay-Davis <xylarstorm@gmail.com>

Update testing_and_setup/compass/manage_regression_suite.py

c64df6d

Co-authored-by: Xylar Asay-Davis <xylarstorm@gmail.com>

Update testing_and_setup/compass/manage_regression_suite.py

3f026da

Co-authored-by: Xylar Asay-Davis <xylarstorm@gmail.com>

Update testing_and_setup/compass/manage_regression_suite.py

355d409

Co-authored-by: Xylar Asay-Davis <xylarstorm@gmail.com>

jamilgafur commented Jul 14, 2020

View reviewed changes

added brackets

823fdda

xylar requested changes Aug 20, 2020

View reviewed changes

xylar closed this Nov 6, 2020

matthewhoffman added COMPASS feature in progress labels Mar 17, 2021

	local_parallel_code += "\tprint('{}'.format(command))\n"
	local_parallel_code += "\tprint('{}'.format(' '.join(command)))\n"

added parallel processing for regression suites #593

added parallel processing for regression suites #593

Uh oh!

Conversation

jamilgafur commented Jun 3, 2020

Uh oh!

xylar commented Jun 3, 2020

Uh oh!

xylar commented Jun 3, 2020

Uh oh!

vanroekel commented Jun 3, 2020

Uh oh!

xylar commented Jun 3, 2020

Uh oh!

jamilgafur commented Jun 3, 2020

Uh oh!

xylar commented Jun 5, 2020

Uh oh!

xylar commented Jun 5, 2020

Uh oh!

jamilgafur commented Jun 5, 2020

Uh oh!

xylar commented Jun 5, 2020

Uh oh!

jamilgafur commented Jun 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xylar commented Jun 5, 2020

Uh oh!

xylar commented Jun 5, 2020

Uh oh!

jamilgafur commented Jun 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xylar commented Jun 5, 2020

Uh oh!

jamilgafur commented Jun 5, 2020

Uh oh!

xylar commented Jun 5, 2020

Uh oh!

xylar commented Jun 5, 2020

Uh oh!

xylar commented Jun 5, 2020

Uh oh!

xylar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

matthewhoffman commented Jun 18, 2020

Uh oh!

xylar commented Jun 18, 2020

Uh oh!

jamilgafur commented Jun 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jamilgafur commented Jul 8, 2020

Uh oh!

xylar commented Jul 8, 2020

Uh oh!

xylar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jamilgafur commented Jul 14, 2020

Uh oh!

jamilgafur Jul 14, 2020

Choose a reason for hiding this comment

Uh oh!

jamilgafur commented Jun 5, 2020 •

edited

Loading

jamilgafur commented Jun 5, 2020 •

edited

Loading

jamilgafur commented Jun 19, 2020 •

edited

Loading

xylar commented Jul 24, 2020 •

edited

Loading