-
Notifications
You must be signed in to change notification settings - Fork 282
Refactor workflow to stateless run manifest architecture and SA input design coordination #3708
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
…g in read.ensemble.output to fix I/O bottleneck
…g in read.sa.output to fix I/O bottleneck
| #' @param input_design DEPRECATED. Use input_design_ens and input_design_sa instead | ||
| #' @param input_design_ens Input design matrix for ensemble analysis | ||
| #' @param input_design_sa Input design matrix for sensitivity analysis |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a fan of having this be three separate arguments. What if we kept input_design, expecting a list with entries ensemble and sensitivity and also accepting a single design object (interpreting it as being for the ensemble)?
(We discussed live that it'd be nice to encode SA and ensemble setup into the design such that run.write.configs doesn't even need to know which one it's working with, but that's still well in the future)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I 100% agree with @infotroph that we shouldn't have multiple input_design arguments. Where I disagree is that I don't think it makes sense to have ensemble and sensitivity elements. I think we should move to a place where a single call to run.write.configs runs a single input_design. If you want to run both an ensemble projection and a SA then you should call run.write.configs twice with different input_designs (one for SA one for ensemble). In other words, I think the SA/ensemble distinction should be deprecated at this point w.r.t. run.write.configs. For example, a Sobol SA is already actually executed via the "ensemble" pathway.
| invisible(settings) | ||
| return(settings) | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Please don't remove terminating newlines. These days the rule is mostly just to avoid needless extra changes, but there do exist some (mostly ancient 😅 ) tools that complain if it isn't there.
|
|
||
| manifest.file <- file.path(settings$outdir, "runs_manifest.csv") | ||
|
|
||
| if (nrow(run_manifest_df) > 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it OK for the manifest to be empty? If so I'd favor writing a zero-row file so it's easy to tell that afterwards; if not OK then should we throw an error here instead of skipping?
| if (overwrite && file.exists(manifest.file)) { | ||
| unlink(manifest.file) | ||
| } | ||
|
|
||
| utils::write.table(run_manifest_df, | ||
| file = manifest.file, | ||
| sep = ",", | ||
| row.names = FALSE, | ||
| col.names = !file.exists(manifest.file), | ||
| append = file.exists(manifest.file)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
write.table removes existing files unless append=TRUE, so this unlink isn't needed
| if (overwrite && file.exists(manifest.file)) { | |
| unlink(manifest.file) | |
| } | |
| utils::write.table(run_manifest_df, | |
| file = manifest.file, | |
| sep = ",", | |
| row.names = FALSE, | |
| col.names = !file.exists(manifest.file), | |
| append = file.exists(manifest.file)) | |
| utils::write.table(run_manifest_df, | |
| file = manifest.file, | |
| sep = ",", | |
| row.names = FALSE, | |
| col.names = !file.exists(manifest.file), | |
| append = !overwrite) |
| invisible(settings) | ||
| return(settings) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not your fault but fixing this while I see it -- invisible() was doing nothing here and usually only makes sense as a return value
| invisible(settings) | |
| return(settings) | |
| return(invisible(settings)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should know this, but what's actually been changed in the returned settings object. I'm wondering whether the settings is still the right thing to return at this point (e.g., are we moving to a place where the manifest might be the thing a user actually wants back?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed: at least ensemble IDs and PFT outdirs (if they weren't there in the input settings).
Return value: My gut is settings is still the right thing to return. I'm saying this without checking, but I think at the moment every runModule.*() takes a MultiSettings and returns a possibly modified MultiSettings, so changing that would be an even bigger overhaul. But maybe this is an argument for putting (a copy of) the manifest into the returned settings?
| for (i in seq_len(nrow(ens.run.ids))) { | ||
|
|
||
| # Handle column name difference (manifest has 'run_id', legacy 'ens.run.ids' might have 'id') | ||
| run.id <- if ("run_id" %in% names(ens.run.ids)) ens.run.ids$run_id[i] else ens.run.ids$id[i] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it work to iterate over the IDs directly instead of by index?
| for (i in seq_len(nrow(ens.run.ids))) { | |
| # Handle column name difference (manifest has 'run_id', legacy 'ens.run.ids' might have 'id') | |
| run.id <- if ("run_id" %in% names(ens.run.ids)) ens.run.ids$run_id[i] else ens.run.ids$id[i] | |
| # Handle column name difference (manifest has 'run_id', legacy 'ens.run.ids' might have 'id') | |
| run_ids <- ens.run.ids$run_id %||% ens.run.ids$id | |
| for (run_id in run_ids) { |
| if (nrow(subset_df) == 1) { | ||
| run.id <- subset_df$run_id | ||
| } else if (nrow(subset_df) > 1) { | ||
| PEcAn.logger::logger.warn("Multiple runs found for", trait, quantile, "- using the last one.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When would this happen and should it be an error instead of a warning?
| next | ||
| } | ||
|
|
||
| pass_pft <- switch(per.pft + 1, NULL, pft.name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Weirdly phrased -- I see this was existing code but I wonder if I'm missing a reason it wasn't pass_pft <- if(isTRUE(per.pft)) pft.name else NULL, which I'd find easier to read.
| pass_pft <- switch(per.pft + 1, NULL, pft.name) | ||
|
|
||
| # Pass ALL variables at once to avoid repeated file opening. And call read.output once | ||
| out.tmp <- PEcAn.utils::read.output( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be for a different PR, but since this is getting converted to data frame a few lines down it might be worth benchmarking the existing approach against passing dataframe = TRUE into read.output
| ens_input_design <- design_result$X | ||
| } | ||
|
|
||
| # handle SA design - check if already provided before generating |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is now turning into a fair amount of code repeated in both legs of the settings/multisettings if/else. Could we do some of this unconditionally outside the if? Or define a helper function to encapsulate it?
Description
fixes #3692
run.write.configsto generate aruns_manifest.csvfile containing run metadata (run_id, site_id, pft_name, trait, quantile, type) instead of appending to samples.Rdata.fixes #3693
read.sa.outputandread.ensemble.outputto pass the full vector of requested variables to read.output in a single call.Additionally this PR separates input design generation for SA and ensemble runs, fixing dimension mismatch errors in multisite workflows.
previously both SA and ensemble used a single
input_designmatrix, causing failures when:causes in dimension mismatch errors in input design coordination.
runModule.run.write.configsgenerates separateens_input_designandsa_input_designmatrices.run.write.configsaccepts bothinput_design_ensandinput_design_saparameters.Old
input_designparameter still works with deprecation warning.Motivation and Context
Review Time Estimate
Types of changes
Checklist: