-
Notifications
You must be signed in to change notification settings - Fork 388
added parallel processing for regression suites #593
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
added parallel processing for regression suites #593
Conversation
|
Sorry for the moving target. It wasn't clear to me that this didn't already include my commit(s) from #559 until I could actually look at the changes. If it would help to have a quick conversation over WebEx or whatever, let me know. |
|
we may have lost your changes in #559 in the rebase process. We'll make sure to rebase once that PR is merged. |
|
@vanroekel, @jamilgafur and I have a quick call and I think we're on the same page. A rebase should take care of things and I don't think my request to move the global variables in as function arguments should be too much work. I'll give it a more careful review soon. |
|
did: git rebase -i xylar/compass/add_prerequisite_tag then pushed to this branch |
|
@jamilgafur, something is still not right, since there are several commits still showing up that are not related to this PR. Since Remove lines for commits that you didn't add. These should not be included in your rebase and, in fact, the purpose of the rebase is to remove them. You will also find two commits with the same name: "added parallel processing for regression suites". These are both yours and I think they are identical. (I think this because I tried to include both in my rebase and I got an error about an empty commit for the second one.) Please change "pick" to "squash" for the second of these two commits so it gets combined with the previous one with the same name. You should have 2 commits with "pick" and one with "squash". Then, edit the commit message when prompted to remove the redundancy. At the end of that process, I see: |
|
I've cherry-picked all of my commits from #559 and yours from here (after my rebase above) onto I'll continue with testing later on but that's what I've had time for so far. And I don't mean to be all criticism here. This is some very nice work! |
|
I noticed the hang as well, I will add a small exception to the code for that scenario |
|
@jamilgafur, you deleted your comment about rebasing. Did you figure that out? I'm still seeing extra commits so I'm guessing not. |
|
@xylar I did the git rebase; now when I do a git status I get: On branch parallel_processing_branch |
|
@jamilgafur, that might be expected, since |
|
@jamilgafur, what you just pushed here is definitely not correct. Maybe I can call you later today to see if I can help you get the rebase done correctly. |
|
how does 11:30 EST (3 hours) sound? |
|
I'm not sure if I can talk at 11:30 your time (MDT, not EST). That's 7:30 pm my time and it will depend on dinner plans, which I haven't made yet. Maybe send me a slack message or an email or something closer to that time and we can see what will work. |
|
Shot you an email for monday to go over this! |
|
I'm gone all of next week so today would definitely be better. |
|
Okay, so it seems like the problem is that you are somehow going back and forth between
In this way, |
|
@jamilgafur, to try to undo the damage, here's what I did. I started on the This was the commit hash 5 hours ago when I checked out your branch. I have it around on Grizzly so that's how I know. If I didn't, I'd need to do some sleuthing around in the git log to figure out what orphaned commits exist and figure out which is the one I wanted, so lucky for us that isn't necessary. Then, I cherry-pick your latest commit onto it. That's the only one from you I could find in the 733 commits since your current version of the branch diverged from Then, I rebase onto Then, I fix the commit message: Once this is done, I see the expected log: |
xylar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jamilgafur, I think maybe things didn't get fully tested after your switched to local, rather than global, variables. I wasn't able to run with the current branch, either with or without local parallelism.
For reference, I did the following to get the branch I tested on:
git reset --hard MPAS-Dev/MPAS-Model/ocean/develop
git merge MPAS-Dev/MPAS-Model/develop
# had to fix a merge conflict
vim testing_and_setup/compass/setup_testcase.py
git add testing_and_setup/compass/setup_testcase.py
git merge --continue
git merge jamilgafur/MPAS-Model/parallel_processing_branch
git merge xylar/MPAS-Model/ocean/add_prerequisite_tag
|
I'm following this PR, but I'll wait to take a look at it closely until issues from @xylar 's review are resolved. |
|
@matthewhoffman (and @mark-petersen), yep, once I've approved I'll ping you to have a look, too. Till then, I agree, one step at a time. |
|
when I run all regression tests pass except: ocean_sub_ice_shelf_2D_5km_restart_test I updated the code and fixed it based on your comments above. I tested it with the following parameters:
each of these run without any errors except the one stated above. |
|
@xylar I ran this code and it all seems to be working fine! |
|
Great! I'll do another round of reviewing as soon as I can, but that's likely not going to be until after Friday. |
xylar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jamilgafur, this is some great progress!
There were some troubles with missing imports and variables in the serial test that should be very easy to fix.
The parallel test isn't working for me. Once the tests complete, the loop didn't notice somehow and was stuck looping infinitely. I would suggest simplifying the code in the while loop significantly. This needs some formatting fixes (4-space indentation, for example) but this is what I had in mind and what seems to work for me:
iteration = 0
completed =[]
running = []
while len(completed) < len(testcase_data.keys()):
# adding phase
running_names = [running_proc[1] for running_proc in running]
print('running_names {}, completed {}'.format(running_names, completed))
for key in testcase_data.keys():
if not key in completed and not key in running_names:
if not testcase_data[key]['prereqs'] == [] and testcase_data[key]['runnable'] == False:
prereq_tests = [prereq['name'] for prereq in testcase_data[key]['prereqs']]
prereq_completed = all(prereq in completed for prereq in prereq_tests)
if prereq_completed:
testcase_data[key]['runnable'] = True
if number_of_procs >= testcase_data[key]['procs'] and testcase_data[key]['runnable'] == True:
number_of_procs = number_of_procs - testcase_data[key]['procs']
process , key = run(key)
running.append((process,key))
# processing phase
for process in running:
pid = process[0].pid
running_key = process[1]
if not psutil.pid_exists(pid):
running.remove(process)
print('DONE: {}'.format(running_key))
completed.append(running_key)
number_of_procs = number_of_procs + testcase_data[running_key]['procs']
else:
try:
process[0].wait(timeout=3)
except subprocess.TimeoutExpired:
continue
iteration = iteration +1Also, I think you had added code to check which parallel tests had failed and to report that but that seems to have been lost. Tests that fail now simply report that they are "DONE".
Along those lines, it seems like no attempt is made to check if prerequisites ran successfully or not, just that they completed. It doesn't make sense to run a dependency if the prerequisite test failed. Most dependencies should fail if their prerequisite didn't finish successfully but I can imagine circumstances where the dependency runs but produces bad results instead of failing when the prerequisite failed. I would have some sort of status of "successful" or "failed" for the tasks that ran and mark dependencies as "failed" and add them to "completed" if on of their prerequisites failed.
Co-authored-by: Xylar Asay-Davis <xylarstorm@gmail.com>
Co-authored-by: Xylar Asay-Davis <xylarstorm@gmail.com>
Co-authored-by: Xylar Asay-Davis <xylarstorm@gmail.com>
Co-authored-by: Xylar Asay-Davis <xylarstorm@gmail.com>
Co-authored-by: Xylar Asay-Davis <xylarstorm@gmail.com>
Co-authored-by: Xylar Asay-Davis <xylarstorm@gmail.com>
|
@xylar , I set the code up that runs as follows:
|
Co-authored-by: Xylar Asay-Davis <xylarstorm@gmail.com>
Co-authored-by: Xylar Asay-Davis <xylarstorm@gmail.com>
Co-authored-by: Xylar Asay-Davis <xylarstorm@gmail.com>
Co-authored-by: Xylar Asay-Davis <xylarstorm@gmail.com>
| local_parallel_code += " pid = process[0].pid\n" | ||
| local_parallel_code += " running_key = process[1]\n" | ||
| local_parallel_code += " if not psutil.pid_exists(pid):\n" | ||
| local_parallel_code += " error_out = subprocess.call(['grep', 'FAIL', running_key.replace(' ', '_')], stdout=subprocess.DEVNULL, stderr=subprocess.STDOUT)\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This checks to see if the run failed and what to do with it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this grep call is working as you want. And why not use python directly rather than a subprocess call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following up a bit:
filename = running_key.replace(' ', '_')
with open(filename) as f:
error_out = 'FAIL' in f.read()should accomplish the same thing without a subprocess call.
|
@jamilgafur, I was kind of waiting for you to ping me to review this again. Is it ready for me to test? |
|
@xylar sorry you are right, I think it is ready for review. |
|
@xylar any comments? |
|
@jamilgafur, I haven't had time to get to this and I probably won't for a few more days still. Sorry about that. |
|
@xylar thats fine! |
xylar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm seeing output that indicates that all tests failed, see below. In reality, the logs from the tests say that they passed. Stdout:
$ ./parallel_nightly_ocean_test_suite.py
Number of usable Processors: 144
** Running case Baroclinic Channel 10km - Default Test **
['time', '-p', './run_test.py']
we have enough to add: 1, 143, Baroclinic Channel 10km - Default Test
** Running case Global Ocean 240km - Init Test **
['time', '-p', './run.py']
we have enough to add: 4, 139, Global Ocean 240km - Init Test
** Running case Global Ocean 240km - RK4 Thread Test **
['time', '-p', './run.py']
we have enough to add: 4, 135, Global Ocean 240km - RK4 Thread Test
** Running case ZISO 20km - Smoke Test **
['time', '-p', './run_test.py']
we have enough to add: 1, 134, ZISO 20km - Smoke Test
** Running case ZISO 20km - Smoke Test with frazil **
['time', '-p', './run_test.py']
we have enough to add: 1, 133, ZISO 20km - Smoke Test with frazil
** Running case Baroclinic Channel 10km - Thread Test **
['time', '-p', './run_test.py']
we have enough to add: 1, 132, Baroclinic Channel 10km - Thread Test
** Running case Baroclinic Channel 10km - Decomp Test **
['time', '-p', './run_test.py']
we have enough to add: 1, 131, Baroclinic Channel 10km - Decomp Test
** Running case Baroclinic Channel 10km - Restart Test **
['time', '-p', './run_test.py']
we have enough to add: 4, 127, Baroclinic Channel 10km - Restart Test
** Running case sub-ice-shelf 2D - restart test **
['time', '-p', './run_test.py']
we have enough to add: 4, 123, sub-ice-shelf 2D - restart test
** Running case Periodic Planar 20km - LIGHT particle region reset test **
['time', '-p', './run_test.py']
we have enough to add: 1, 122, Periodic Planar 20km - LIGHT particle region reset test
** Running case Periodic Planar 20km - LIGHT particle time reset test **
['time', '-p', './run_test.py']
we have enough to add: 1, 121, Periodic Planar 20km - LIGHT particle time reset test
Failed: Global Ocean 240km - RK4 Thread Test
DONE: 1 121 Global Ocean 240km - RK4 Thread Test
Failed: Baroclinic Channel 10km - Restart Test
DONE: 1 125 Baroclinic Channel 10km - Restart Test
Failed: Periodic Planar 20km - LIGHT particle region reset test
DONE: 1 129 Periodic Planar 20km - LIGHT particle region reset test
** Running case Global Ocean 240km - RK4 Thread Test **
['time', '-p', './run.py']
we have enough to add: 4, 126, Global Ocean 240km - RK4 Thread Test
** Running case Baroclinic Channel 10km - Restart Test **
['time', '-p', './run_test.py']
we have enough to add: 4, 122, Baroclinic Channel 10km - Restart Test
** Running case Periodic Planar 20km - LIGHT particle region reset test **
['time', '-p', './run_test.py']
we have enough to add: 1, 121, Periodic Planar 20km - LIGHT particle region reset test
Failed: Baroclinic Channel 10km - Default Test
DONE: 1 121 Baroclinic Channel 10km - Default Test
Failed: ZISO 20km - Smoke Test with frazil
DONE: 1 122 ZISO 20km - Smoke Test with frazil
Failed: Baroclinic Channel 10km - Decomp Test
DONE: 1 123 Baroclinic Channel 10km - Decomp Test
** Running case Baroclinic Channel 10km - Default Test **
['time', '-p', './run_test.py']
we have enough to add: 1, 123, Baroclinic Channel 10km - Default Test
** Running case ZISO 20km - Smoke Test with frazil **
['time', '-p', './run_test.py']
we have enough to add: 1, 122, ZISO 20km - Smoke Test with frazil
** Running case Baroclinic Channel 10km - Decomp Test **
['time', '-p', './run_test.py']
we have enough to add: 1, 121, Baroclinic Channel 10km - Decomp Test
Failed: Baroclinic Channel 10km - Thread Test
DONE: 1 121 Baroclinic Channel 10km - Thread Test
Failed: Periodic Planar 20km - LIGHT particle time reset test
DONE: 1 122 Periodic Planar 20km - LIGHT particle time reset test
** Running case Baroclinic Channel 10km - Thread Test **
['time', '-p', './run_test.py']
we have enough to add: 1, 122, Baroclinic Channel 10km - Thread Test
** Running case Periodic Planar 20km - LIGHT particle time reset test **
['time', '-p', './run_test.py']
we have enough to add: 1, 121, Periodic Planar 20km - LIGHT particle time reset test
Failed: ZISO 20km - Smoke Test
DONE: 1 121 ZISO 20km - Smoke Test
Failed: Global Ocean 240km - RK4 Thread Test
DONE: 1 122 Global Ocean 240km - RK4 Thread Test
Failed: Periodic Planar 20km - LIGHT particle region reset test
DONE: 1 126 Periodic Planar 20km - LIGHT particle region reset test
Failed: Baroclinic Channel 10km - Decomp Test
DONE: 1 127 Baroclinic Channel 10km - Decomp Test
** Running case Global Ocean 240km - RK4 Thread Test **
['time', '-p', './run.py']
we have enough to add: 4, 124, Global Ocean 240km - RK4 Thread Test
** Running case ZISO 20km - Smoke Test **
['time', '-p', './run_test.py']
we have enough to add: 1, 123, ZISO 20km - Smoke Test
** Running case Baroclinic Channel 10km - Decomp Test **
['time', '-p', './run_test.py']
we have enough to add: 1, 122, Baroclinic Channel 10km - Decomp Test
** Running case Periodic Planar 20km - LIGHT particle region reset test **
['time', '-p', './run_test.py']
we have enough to add: 1, 121, Periodic Planar 20km - LIGHT particle region reset test
Failed: Baroclinic Channel 10km - Default Test
DONE: 1 121 Baroclinic Channel 10km - Default Test
** Running case Baroclinic Channel 10km - Default Test **
['time', '-p', './run_test.py']
we have enough to add: 1, 121, Baroclinic Channel 10km - Default Test
Failed: sub-ice-shelf 2D - restart test
DONE: 1 121 sub-ice-shelf 2D - restart test
Failed: ZISO 20km - Smoke Test with frazil
DONE: 1 125 ZISO 20km - Smoke Test with frazil
Failed: Global Ocean 240km - RK4 Thread Test
DONE: 1 126 Global Ocean 240km - RK4 Thread Test
Failed: Baroclinic Channel 10km - Decomp Test
DONE: 1 130 Baroclinic Channel 10km - Decomp Test
Failed: Baroclinic Channel 10km - Restart Test
DONE: 1 131 Baroclinic Channel 10km - Restart Test
Failed: Periodic Planar 20km - LIGHT particle time reset test
DONE: 1 135 Periodic Planar 20km - LIGHT particle time reset test
Failed: Periodic Planar 20km - LIGHT particle region reset test
DONE: 1 136 Periodic Planar 20km - LIGHT particle region reset test
Failed: Baroclinic Channel 10km - Thread Test
DONE: 1 137 Baroclinic Channel 10km - Thread Test
Failed: Baroclinic Channel 10km - Default Test
DONE: 1 138 Baroclinic Channel 10km - Default Test
Failed: ZISO 20km - Smoke Test
DONE: 1 139 ZISO 20km - Smoke Test
Failed: Global Ocean 240km - Init Test
DONE: 1 140 Global Ocean 240km - Init Test
runtime: 135.74175310134888
Wall clock time: 02:16
TEST RUNTIMES & ERROR:
00:11 Baroclinic_Channel_10km_-_Decomp_Test
00:07 Baroclinic_Channel_10km_-_Default_Test
00:18 Baroclinic_Channel_10km_-_Restart_Test
00:11 Baroclinic_Channel_10km_-_Thread_Test
02:16 Global_Ocean_240km_-_Init_Test
00:01 Global_Ocean_240km_-_RK4_Thread_Test
00:14 Periodic_Planar_20km_-_LIGHT_particle_region_reset_test
00:18 Periodic_Planar_20km_-_LIGHT_particle_time_reset_test
00:27 ZISO_20km_-_Smoke_Test
00:12 ZISO_20km_-_Smoke_Test_with_frazil
00:35 sub-ice-shelf_2D_-_restart_test
Total runtime 04:50
Example output from ZISO_20km_-_Smoke_Test:
* Running init_step1
srun: Warning: can't run 1 processes on 4 nodes, setting nnodes to 1
Complete
* Running init_step2
srun: Warning: can't run 1 processes on 4 nodes, setting nnodes to 1
Complete
* Running forward
Complete
real 26.97
user 0.85
sys 0.98
The only task that genuinely seems to have failed is Global_Ocean_240km_-_RK4_Thread_Test:
* Running 1thread_run
File graph.info does not exist!
Traceback (most recent call last):
File "./run.py", line 16, in <module>
subprocess.check_call(['gpmetis', 'graph.info', '4'])
File "/usr/projects/climate/SHARED_CLIMATE/anaconda_envs/base/envs/compass_0.1.8/lib/python3.8/subprocess.py", line 364, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['gpmetis', 'graph.info', '4']' returned non-zero exit status 254.
Traceback (most recent call last):
File "./run.py", line 50, in <module>
File "/usr/projects/climate/SHARED_CLIMATE/anaconda_envs/base/envs/compass_0.1.8/lib/python3.8/subprocess.py", line 364, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['./run.py']' returned non-zero exit status 1.
real 0.17
user 0.07
sys 0.04
This appears to be because it ran before its prerequisite, and that seems to be my fault: this test didn't get included in #560. I will fix this now.
But several tests did not run at all, possibly because they think their prerequisite wasn't successful. But there was no indication at all that they didn't run or why. These include:
Global Ocean 240km - Performance TestGlobal Ocean 240km - Restart TestGlobal Ocean 240km - Thread TestGlobal Ocean 240km - Analysis Test
Could you try to figure out:
- Why tests are saying the failed when they didn't?
- Why there is no output when a test fails because its prerequisite failed?
| for child in suite_tag: | ||
| for script in child: | ||
| run_scripts.append(script.attrib['name']) | ||
| print(run_scripts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This print statement still needs to be removed eventually.
| regression_script_code += 'import numpy as np\n' | ||
| regression_script_code += 'import time' | ||
| regression_script_code += '\n\n' | ||
| regression_script_code += 'start_time = time.time()' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is missing a \n
| local_parallel_code += "\tprint('\t\t** Running case {} **'.format(key))\n" | ||
| local_parallel_code += "\tos.chdir('./{}'.format(testcase_data[key]['path']))\n" | ||
| local_parallel_code += "\tcommand = ['time', '-p', './{}'.format(testcase_data[key]['filename'])]\n" | ||
| local_parallel_code += "\tprint('{}'.format(command))\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| local_parallel_code += "\tprint('{}'.format(command))\n" | |
| local_parallel_code += "\tprint('{}'.format(' '.join(command)))\n" |
If you want to keep this at all, it should be formatted just a s a command, not as a list of strings. I don't think it's useful, though, and I would just remove this line.
|
@jamilgafur, I think we're getting close to having this working! To test, I did the following (much simpler than before!) in a test branch that I will throw away once I'm done: Then, I built the code as normal and then I set up both serial and 4-node versions of the nightly regression test suite. Let me know if you have trouble following the same process. I added the prerequisite tag to the test case ( |
|
This PR will not be merged here. I have made a branch on my COMPASS fork with a version of this branch appropriate for that repo. When the time is right, we can revive this work on that repo: |
Added parallel capability to the manage_regession_suite for regression tests.