Skip to content

Conversation

@jamilgafur
Copy link

Added parallel capability to the manage_regession_suite for regression tests.

@xylar
Copy link
Collaborator

xylar commented Jun 3, 2020

This probably needs to get rebased onto #559. I don't know how many incompatibilities there will be but I had intended #559 in part as a starting point for this work. For example, I would like to avoid the use of global variables in the script if at all possible.

@xylar
Copy link
Collaborator

xylar commented Jun 3, 2020

Sorry for the moving target. It wasn't clear to me that this didn't already include my commit(s) from #559 until I could actually look at the changes. If it would help to have a quick conversation over WebEx or whatever, let me know.

@vanroekel
Copy link
Contributor

we may have lost your changes in #559 in the rebase process. We'll make sure to rebase once that PR is merged.

@xylar
Copy link
Collaborator

xylar commented Jun 3, 2020

@vanroekel, @jamilgafur and I have a quick call and I think we're on the same page. A rebase should take care of things and I don't think my request to move the global variables in as function arguments should be too much work. I'll give it a more careful review soon.

@jamilgafur
Copy link
Author

did: git rebase -i xylar/compass/add_prerequisite_tag

then pushed to this branch

@xylar
Copy link
Collaborator

xylar commented Jun 5, 2020

@jamilgafur, something is still not right, since there are several commits still showing up that are not related to this PR.

Since xylar/MPAS-Model/compass/add_prerequisite_tag has been merged (#559), you can now rebase onto develop. Could you do the following?

git rebase -i MPAS-Dev/MPAS-Model/develop

Remove lines for commits that you didn't add. These should not be included in your rebase and, in fact, the purpose of the rebase is to remove them.

You will also find two commits with the same name: "added parallel processing for regression suites". These are both yours and I think they are identical. (I think this because I tried to include both in my rebase and I got an error about an empty commit for the second one.) Please change "pick" to "squash" for the second of these two commits so it gets combined with the previous one with the same name. You should have 2 commits with "pick" and one with "squash". Then, edit the commit message when prompted to remove the redundancy.

At the end of that process, I see:

$ git logg
* ac0e2a73 (HEAD -> parallel_processing_branch) removed global variables and passed them as local parameters
* 9d6d065c added parallel processing for regression suites
*   0fda9ab5 (MPAS-Dev/MPAS-Model/develop, develop) Merge branch 'xylar/compass/add_prerequisite_tag' into develop
|\  
| * 95abf0c6 (compass/add_prerequisite_tag) Move main to function
...

@xylar
Copy link
Collaborator

xylar commented Jun 5, 2020

I've cherry-picked all of my commits from #559 and yours from here (after my rebase above) onto ocean/develop. I didn't use #560, which I should have, but it was instructive. I discovered that the code hangs when there are no prerequisites in the test suite. I added a print statement at the top of stream_queue and I can see that command = [], which doesn't seem to be a case that this code is ready to handle.

I'll continue with testing later on but that's what I've had time for so far.

And I don't mean to be all criticism here. This is some very nice work!

@jamilgafur
Copy link
Author

I noticed the hang as well, I will add a small exception to the code for that scenario

@xylar
Copy link
Collaborator

xylar commented Jun 5, 2020

@jamilgafur, you deleted your comment about rebasing. Did you figure that out? I'm still seeing extra commits so I'm guessing not.

@jamilgafur
Copy link
Author

jamilgafur commented Jun 5, 2020

@xylar I did the git rebase; now when I do a git status I get:

On branch parallel_processing_branch
Your branch is ahead of 'origin/master' by 733 commits.
(use "git push" to publish your local commits)

@xylar
Copy link
Collaborator

xylar commented Jun 5, 2020

@jamilgafur, that might be expected, since develop might be ahead of master but 733 commits. The important thing is whether you see only your expected commits (5 of 6 at this point, I guess) beyond MPAS-Dev/MPAS-Model/develop.

@xylar
Copy link
Collaborator

xylar commented Jun 5, 2020

@jamilgafur, what you just pushed here is definitely not correct. Maybe I can call you later today to see if I can help you get the rebase done correctly.

@jamilgafur
Copy link
Author

jamilgafur commented Jun 5, 2020

how does 11:30 EST (3 hours) sound?

@xylar
Copy link
Collaborator

xylar commented Jun 5, 2020

I'm not sure if I can talk at 11:30 your time (MDT, not EST). That's 7:30 pm my time and it will depend on dinner plans, which I haven't made yet. Maybe send me a slack message or an email or something closer to that time and we can see what will work.

@jamilgafur
Copy link
Author

Shot you an email for monday to go over this!

@xylar
Copy link
Collaborator

xylar commented Jun 5, 2020

I'm gone all of next week so today would definitely be better.

@xylar
Copy link
Collaborator

xylar commented Jun 5, 2020

Okay, so it seems like the problem is that you are somehow going back and forth between develop and ocean/develop. Really, the only safe thing to do is:

  1. Put the commits you want to have in this PR on a branch that is based off of develop. Your commits should be the only thing beyond MPAS-Dev/MPAS-Model/develop. There should be no merges or anything like that
  2. When you want to test how these changes are working in ocean/develop, make a new branch (let's call it test_parallel_regression) that points to MPAS-Dev/MPAS-Model/ocean/develop, then do a test merge of this branch (parallel_processing_branch) to test_parallel_regression. Then, test whether things work as expected. You may also need to test-merge my changes from Add prerequisite tags to all necessary test cases #560 into this test branch.
  3. Any changes you find you have to make, you need to make on this branch (parallel_processing_branch). You could temporarily make them on test_parallel_regression just to try them out but then you would need to copy or cherry-pick them onto parallel_processing_branch. Then, you would need to start your test branch over by repeating 2.

In this way, parallel_processing_branch should never have anything to do with ocean/develop. Currently, it has somehow picked up hundreds of commits from ocean/develop.

@xylar
Copy link
Collaborator

xylar commented Jun 5, 2020

@jamilgafur, to try to undo the damage, here's what I did.

I started on the parallel_processing_branch branch.

$ git reset --hard 0369e17d

This was the commit hash 5 hours ago when I checked out your branch. I have it around on Grizzly so that's how I know. If I didn't, I'd need to do some sleuthing around in the git log to figure out what orphaned commits exist and figure out which is the one I wanted, so lucky for us that isn't necessary.

Then, I cherry-pick your latest commit onto it. That's the only one from you I could find in the 733 commits since your current version of the branch diverged from develop but it's a long history to search through. If there are others, find them and cherry-pick them in the order they were originally committed:

$ git cherry-pick 7dec021ac

Then, I rebase onto develop, keeping only the commits that are yours and squashing the redundant ones:

$ git rebase --interactive MPAS-Dev/MPAS-Model/develop

pick 188c7d948 added parallel processing for regression suites
squash 6e9c0d8e6 added parallel processing for regression suites
pick 0369e17d4 removed global variables and passed them as local parameters
pick ef97f400e added check in stream processing to see if command is empty

Then, I fix the commit message:

added parallel processing for regression suites

# Please enter the commit message for your changes. Lines starting
...

Once this is done, I see the expected log:

$ git logg
* b7e3df626 (HEAD -> parallel_processing_branch) added check in stream processing to see if command is empty
* 0c1b9bc5c removed global variables and passed them as local parameters
* c0cca0d94 added parallel processing for regression suites
*   0fda9ab5b (MPAS-Dev/MPAS-Model/develop, develop) Merge branch 'xylar/compass/add_prerequisite_tag' into develop
|\  
| * 95abf0c62 (compass/add_prerequisite_tag) Move main to function
| * 118c336de Detect prerequisite tests
| * 748a7c35d Pull test-case info into a dict
| * e7862e43a Clean-up unneeded function argument
| * b592dcb2e Fix case setup output with --verbose
* |   b6cb09736 Merge branch 'atmosphere/fix_timekeeping_imports' into develop (PR #582)
|\ \  
| * | 6c58629aa (MiCurry/MPAS-Model/atmosphere/fix_timekeeping_imports) Fix mpas_timekeeping dependencies
* | |   1294a6557 Merge pull request #575 from xylar/fix_docs_ci
|\ \ \  
| * | | 7a7aa09ac (fix_docs_ci) Update GITHUB_TOKEN
| * | | b9d718282 Fix bad docs path in Travis CI
|/ / /  
* | |   51828a158 Merge PR #472 'xylar/add_compass_docs' into develop
...

Copy link
Collaborator

@xylar xylar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jamilgafur, I think maybe things didn't get fully tested after your switched to local, rather than global, variables. I wasn't able to run with the current branch, either with or without local parallelism.

For reference, I did the following to get the branch I tested on:

git reset --hard MPAS-Dev/MPAS-Model/ocean/develop
git merge MPAS-Dev/MPAS-Model/develop
# had to fix a merge conflict
vim testing_and_setup/compass/setup_testcase.py
git add testing_and_setup/compass/setup_testcase.py 
git merge --continue
git merge jamilgafur/MPAS-Model/parallel_processing_branch
git merge xylar/MPAS-Model/ocean/add_prerequisite_tag

@matthewhoffman
Copy link
Member

I'm following this PR, but I'll wait to take a look at it closely until issues from @xylar 's review are resolved.

@xylar
Copy link
Collaborator

xylar commented Jun 18, 2020

@matthewhoffman (and @mark-petersen), yep, once I've approved I'll ping you to have a look, too. Till then, I agree, one step at a time.

@jamilgafur
Copy link
Author

jamilgafur commented Jun 19, 2020

when I run all regression tests pass except: ocean_sub_ice_shelf_2D_5km_restart_test

I updated the code and fixed it based on your comments above. I tested it with the following parameters:

  • Normal Run no parallelism / default
  • With default local parallelism (one node)
  • With local parallelism and two nodes

each of these run without any errors except the one stated above.

@jamilgafur
Copy link
Author

@xylar I ran this code and it all seems to be working fine!

@xylar
Copy link
Collaborator

xylar commented Jul 8, 2020

Great! I'll do another round of reviewing as soon as I can, but that's likely not going to be until after Friday.

Copy link
Collaborator

@xylar xylar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jamilgafur, this is some great progress!

There were some troubles with missing imports and variables in the serial test that should be very easy to fix.

The parallel test isn't working for me. Once the tests complete, the loop didn't notice somehow and was stuck looping infinitely. I would suggest simplifying the code in the while loop significantly. This needs some formatting fixes (4-space indentation, for example) but this is what I had in mind and what seems to work for me:

iteration = 0
completed =[]
running = []

while len(completed) < len(testcase_data.keys()):

  # adding phase
  running_names = [running_proc[1] for running_proc in running]
  print('running_names {}, completed {}'.format(running_names, completed))
  for key in testcase_data.keys():
      if not key in completed and not key in running_names:
        if not testcase_data[key]['prereqs'] == [] and testcase_data[key]['runnable'] == False:
          prereq_tests = [prereq['name'] for prereq in testcase_data[key]['prereqs']]
          prereq_completed =  all(prereq in completed for prereq in prereq_tests)
          if prereq_completed:
            testcase_data[key]['runnable'] = True
        if number_of_procs >= testcase_data[key]['procs'] and testcase_data[key]['runnable'] == True:
          number_of_procs = number_of_procs - testcase_data[key]['procs']
          process , key = run(key)
          running.append((process,key))

  # processing phase
  for process in running:
      pid = process[0].pid
      running_key = process[1]
      if not psutil.pid_exists(pid):
        running.remove(process)
        print('DONE: {}'.format(running_key))
        completed.append(running_key)
        number_of_procs = number_of_procs + testcase_data[running_key]['procs']
      else:
        try:
          process[0].wait(timeout=3)
        except subprocess.TimeoutExpired:
          continue
  iteration = iteration +1

Also, I think you had added code to check which parallel tests had failed and to report that but that seems to have been lost. Tests that fail now simply report that they are "DONE".

Along those lines, it seems like no attempt is made to check if prerequisites ran successfully or not, just that they completed. It doesn't make sense to run a dependency if the prerequisite test failed. Most dependencies should fail if their prerequisite didn't finish successfully but I can imagine circumstances where the dependency runs but produces bad results instead of failing when the prerequisite failed. I would have some sort of status of "successful" or "failed" for the tasks that ran and mark dependencies as "failed" and add them to "completed" if on of their prerequisites failed.

jamilgafur and others added 8 commits July 13, 2020 08:52
Co-authored-by: Xylar Asay-Davis <xylarstorm@gmail.com>
Co-authored-by: Xylar Asay-Davis <xylarstorm@gmail.com>
Co-authored-by: Xylar Asay-Davis <xylarstorm@gmail.com>
Co-authored-by: Xylar Asay-Davis <xylarstorm@gmail.com>
Co-authored-by: Xylar Asay-Davis <xylarstorm@gmail.com>
Co-authored-by: Xylar Asay-Davis <xylarstorm@gmail.com>
@jamilgafur
Copy link
Author

@xylar , I set the code up that runs as follows:

  • if it has no prereqs it will run
  • if the prereqs are completed it will run the test
  • if the prereqs failed the other tests who depend on it wont even run

jamilgafur and others added 4 commits July 13, 2020 18:17
Co-authored-by: Xylar Asay-Davis <xylarstorm@gmail.com>
Co-authored-by: Xylar Asay-Davis <xylarstorm@gmail.com>
Co-authored-by: Xylar Asay-Davis <xylarstorm@gmail.com>
Co-authored-by: Xylar Asay-Davis <xylarstorm@gmail.com>
local_parallel_code += " pid = process[0].pid\n"
local_parallel_code += " running_key = process[1]\n"
local_parallel_code += " if not psutil.pid_exists(pid):\n"
local_parallel_code += " error_out = subprocess.call(['grep', 'FAIL', running_key.replace(' ', '_')], stdout=subprocess.DEVNULL, stderr=subprocess.STDOUT)\n"
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This checks to see if the run failed and what to do with it

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this grep call is working as you want. And why not use python directly rather than a subprocess call?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following up a bit:

filename = running_key.replace(' ', '_')
with open(filename) as f:
    error_out = 'FAIL' in f.read()

should accomplish the same thing without a subprocess call.

@xylar
Copy link
Collaborator

xylar commented Jul 24, 2020

@jamilgafur, I was kind of waiting for you to ping me to review this again. Is it ready for me to test?

@jamilgafur
Copy link
Author

@xylar sorry you are right, I think it is ready for review.

@jamilgafur
Copy link
Author

@xylar any comments?

@xylar
Copy link
Collaborator

xylar commented Aug 10, 2020

@jamilgafur, I haven't had time to get to this and I probably won't for a few more days still. Sorry about that.

@jamilgafur
Copy link
Author

@xylar thats fine!

Copy link
Collaborator

@xylar xylar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm seeing output that indicates that all tests failed, see below. In reality, the logs from the tests say that they passed. Stdout:

$ ./parallel_nightly_ocean_test_suite.py 
Number of usable Processors: 144
		** Running case Baroclinic Channel 10km - Default Test **
['time', '-p', './run_test.py']
we have enough to add: 1, 143, Baroclinic Channel 10km - Default Test
		** Running case Global Ocean 240km - Init Test **
['time', '-p', './run.py']
we have enough to add: 4, 139, Global Ocean 240km - Init Test
		** Running case Global Ocean 240km - RK4 Thread Test **
['time', '-p', './run.py']
we have enough to add: 4, 135, Global Ocean 240km - RK4 Thread Test
		** Running case ZISO 20km - Smoke Test **
['time', '-p', './run_test.py']
we have enough to add: 1, 134, ZISO 20km - Smoke Test
		** Running case ZISO 20km - Smoke Test with frazil **
['time', '-p', './run_test.py']
we have enough to add: 1, 133, ZISO 20km - Smoke Test with frazil
		** Running case Baroclinic Channel 10km - Thread Test **
['time', '-p', './run_test.py']
we have enough to add: 1, 132, Baroclinic Channel 10km - Thread Test
		** Running case Baroclinic Channel 10km - Decomp Test **
['time', '-p', './run_test.py']
we have enough to add: 1, 131, Baroclinic Channel 10km - Decomp Test
		** Running case Baroclinic Channel 10km - Restart Test **
['time', '-p', './run_test.py']
we have enough to add: 4, 127, Baroclinic Channel 10km - Restart Test
		** Running case sub-ice-shelf 2D - restart test **
['time', '-p', './run_test.py']
we have enough to add: 4, 123, sub-ice-shelf 2D - restart test
		** Running case Periodic Planar 20km - LIGHT particle region reset test **
['time', '-p', './run_test.py']
we have enough to add: 1, 122, Periodic Planar 20km - LIGHT particle region reset test
		** Running case Periodic Planar 20km - LIGHT particle time reset test **
['time', '-p', './run_test.py']
we have enough to add: 1, 121, Periodic Planar 20km - LIGHT particle time reset test
Failed: Global Ocean 240km - RK4 Thread Test 
DONE: 1 121 Global Ocean 240km - RK4 Thread Test
Failed: Baroclinic Channel 10km - Restart Test 
DONE: 1 125 Baroclinic Channel 10km - Restart Test
Failed: Periodic Planar 20km - LIGHT particle region reset test 
DONE: 1 129 Periodic Planar 20km - LIGHT particle region reset test
		** Running case Global Ocean 240km - RK4 Thread Test **
['time', '-p', './run.py']
we have enough to add: 4, 126, Global Ocean 240km - RK4 Thread Test
		** Running case Baroclinic Channel 10km - Restart Test **
['time', '-p', './run_test.py']
we have enough to add: 4, 122, Baroclinic Channel 10km - Restart Test
		** Running case Periodic Planar 20km - LIGHT particle region reset test **
['time', '-p', './run_test.py']
we have enough to add: 1, 121, Periodic Planar 20km - LIGHT particle region reset test
Failed: Baroclinic Channel 10km - Default Test 
DONE: 1 121 Baroclinic Channel 10km - Default Test
Failed: ZISO 20km - Smoke Test with frazil 
DONE: 1 122 ZISO 20km - Smoke Test with frazil
Failed: Baroclinic Channel 10km - Decomp Test 
DONE: 1 123 Baroclinic Channel 10km - Decomp Test
		** Running case Baroclinic Channel 10km - Default Test **
['time', '-p', './run_test.py']
we have enough to add: 1, 123, Baroclinic Channel 10km - Default Test
		** Running case ZISO 20km - Smoke Test with frazil **
['time', '-p', './run_test.py']
we have enough to add: 1, 122, ZISO 20km - Smoke Test with frazil
		** Running case Baroclinic Channel 10km - Decomp Test **
['time', '-p', './run_test.py']
we have enough to add: 1, 121, Baroclinic Channel 10km - Decomp Test
Failed: Baroclinic Channel 10km - Thread Test 
DONE: 1 121 Baroclinic Channel 10km - Thread Test
Failed: Periodic Planar 20km - LIGHT particle time reset test 
DONE: 1 122 Periodic Planar 20km - LIGHT particle time reset test
		** Running case Baroclinic Channel 10km - Thread Test **
['time', '-p', './run_test.py']
we have enough to add: 1, 122, Baroclinic Channel 10km - Thread Test
		** Running case Periodic Planar 20km - LIGHT particle time reset test **
['time', '-p', './run_test.py']
we have enough to add: 1, 121, Periodic Planar 20km - LIGHT particle time reset test
Failed: ZISO 20km - Smoke Test 
DONE: 1 121 ZISO 20km - Smoke Test
Failed: Global Ocean 240km - RK4 Thread Test 
DONE: 1 122 Global Ocean 240km - RK4 Thread Test
Failed: Periodic Planar 20km - LIGHT particle region reset test 
DONE: 1 126 Periodic Planar 20km - LIGHT particle region reset test
Failed: Baroclinic Channel 10km - Decomp Test 
DONE: 1 127 Baroclinic Channel 10km - Decomp Test
		** Running case Global Ocean 240km - RK4 Thread Test **
['time', '-p', './run.py']
we have enough to add: 4, 124, Global Ocean 240km - RK4 Thread Test
		** Running case ZISO 20km - Smoke Test **
['time', '-p', './run_test.py']
we have enough to add: 1, 123, ZISO 20km - Smoke Test
		** Running case Baroclinic Channel 10km - Decomp Test **
['time', '-p', './run_test.py']
we have enough to add: 1, 122, Baroclinic Channel 10km - Decomp Test
		** Running case Periodic Planar 20km - LIGHT particle region reset test **
['time', '-p', './run_test.py']
we have enough to add: 1, 121, Periodic Planar 20km - LIGHT particle region reset test
Failed: Baroclinic Channel 10km - Default Test 
DONE: 1 121 Baroclinic Channel 10km - Default Test
		** Running case Baroclinic Channel 10km - Default Test **
['time', '-p', './run_test.py']
we have enough to add: 1, 121, Baroclinic Channel 10km - Default Test
Failed: sub-ice-shelf 2D - restart test 
DONE: 1 121 sub-ice-shelf 2D - restart test
Failed: ZISO 20km - Smoke Test with frazil 
DONE: 1 125 ZISO 20km - Smoke Test with frazil
Failed: Global Ocean 240km - RK4 Thread Test 
DONE: 1 126 Global Ocean 240km - RK4 Thread Test
Failed: Baroclinic Channel 10km - Decomp Test 
DONE: 1 130 Baroclinic Channel 10km - Decomp Test
Failed: Baroclinic Channel 10km - Restart Test 
DONE: 1 131 Baroclinic Channel 10km - Restart Test
Failed: Periodic Planar 20km - LIGHT particle time reset test 
DONE: 1 135 Periodic Planar 20km - LIGHT particle time reset test
Failed: Periodic Planar 20km - LIGHT particle region reset test 
DONE: 1 136 Periodic Planar 20km - LIGHT particle region reset test
Failed: Baroclinic Channel 10km - Thread Test 
DONE: 1 137 Baroclinic Channel 10km - Thread Test
Failed: Baroclinic Channel 10km - Default Test 
DONE: 1 138 Baroclinic Channel 10km - Default Test
Failed: ZISO 20km - Smoke Test 
DONE: 1 139 ZISO 20km - Smoke Test
Failed: Global Ocean 240km - Init Test 
DONE: 1 140 Global Ocean 240km - Init Test
runtime: 135.74175310134888
Wall clock time: 02:16



TEST RUNTIMES & ERROR:
00:11 Baroclinic_Channel_10km_-_Decomp_Test
00:07 Baroclinic_Channel_10km_-_Default_Test
00:18 Baroclinic_Channel_10km_-_Restart_Test
00:11 Baroclinic_Channel_10km_-_Thread_Test
02:16 Global_Ocean_240km_-_Init_Test
00:01 Global_Ocean_240km_-_RK4_Thread_Test
00:14 Periodic_Planar_20km_-_LIGHT_particle_region_reset_test
00:18 Periodic_Planar_20km_-_LIGHT_particle_time_reset_test
00:27 ZISO_20km_-_Smoke_Test
00:12 ZISO_20km_-_Smoke_Test_with_frazil
00:35 sub-ice-shelf_2D_-_restart_test
Total runtime 04:50

Example output from ZISO_20km_-_Smoke_Test:

 * Running init_step1
srun: Warning: can't run 1 processes on 4 nodes, setting nnodes to 1
     Complete
 * Running init_step2
srun: Warning: can't run 1 processes on 4 nodes, setting nnodes to 1
     Complete
 * Running forward
     Complete
real 26.97
user 0.85
sys 0.98

The only task that genuinely seems to have failed is Global_Ocean_240km_-_RK4_Thread_Test:

 * Running 1thread_run
File graph.info does not exist!
Traceback (most recent call last):
  File "./run.py", line 16, in <module>
    subprocess.check_call(['gpmetis', 'graph.info', '4'])
  File "/usr/projects/climate/SHARED_CLIMATE/anaconda_envs/base/envs/compass_0.1.8/lib/python3.8/subprocess.py", line 364, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['gpmetis', 'graph.info', '4']' returned non-zero exit status 254.
Traceback (most recent call last):
  File "./run.py", line 50, in <module>
  File "/usr/projects/climate/SHARED_CLIMATE/anaconda_envs/base/envs/compass_0.1.8/lib/python3.8/subprocess.py", line 364, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['./run.py']' returned non-zero exit status 1.
real 0.17
user 0.07
sys 0.04

This appears to be because it ran before its prerequisite, and that seems to be my fault: this test didn't get included in #560. I will fix this now.

But several tests did not run at all, possibly because they think their prerequisite wasn't successful. But there was no indication at all that they didn't run or why. These include:

  • Global Ocean 240km - Performance Test
  • Global Ocean 240km - Restart Test
  • Global Ocean 240km - Thread Test
  • Global Ocean 240km - Analysis Test

Could you try to figure out:

  1. Why tests are saying the failed when they didn't?
  2. Why there is no output when a test fails because its prerequisite failed?

for child in suite_tag:
for script in child:
run_scripts.append(script.attrib['name'])
print(run_scripts)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This print statement still needs to be removed eventually.

regression_script_code += 'import numpy as np\n'
regression_script_code += 'import time'
regression_script_code += '\n\n'
regression_script_code += 'start_time = time.time()'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is missing a \n

local_parallel_code += "\tprint('\t\t** Running case {} **'.format(key))\n"
local_parallel_code += "\tos.chdir('./{}'.format(testcase_data[key]['path']))\n"
local_parallel_code += "\tcommand = ['time', '-p', './{}'.format(testcase_data[key]['filename'])]\n"
local_parallel_code += "\tprint('{}'.format(command))\n"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
local_parallel_code += "\tprint('{}'.format(command))\n"
local_parallel_code += "\tprint('{}'.format(' '.join(command)))\n"

If you want to keep this at all, it should be formatted just a s a command, not as a list of strings. I don't think it's useful, though, and I would just remove this line.

@xylar
Copy link
Collaborator

xylar commented Aug 20, 2020

@jamilgafur, I think we're getting close to having this working!

To test, I did the following (much simpler than before!) in a test branch that I will throw away once I'm done:

cd test_parallel_regression
git reset --hard xylar/MPAS-Model/ocean/add_prerequisite_tag
git merge jamilgafur/MPAS-Model/parallel_processing_branch

Then, I built the code as normal and then I set up both serial and 4-node versions of the nightly regression test suite. Let me know if you have trouble following the same process. I added the prerequisite tag to the test case (Global_Ocean_240km_-_RK4_Thread_Test) that was missing it in my testing above, so hopefully you should be able to test with this and see why these false fails are showing up.

@xylar
Copy link
Collaborator

xylar commented Nov 6, 2020

This PR will not be merged here. I have made a branch on my COMPASS fork with a version of this branch appropriate for that repo. When the time is right, we can revive this work on that repo:
https://github.com/xylar/compass/tree/parallel_regression_suite

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants