Add functions for running code once per node #2992
Draft
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
We still have not added the capability for users to choose whether to run certain code (like saving) only once (main proc) vs. once per node (local_rank = 0), leading to issues like #2983.
This PR begins to solve this by adding code for running once per node, but a fuller solution requires some kind of mechanism for enabling the choice on the user side for things like checkpointing. However I'm not certain of the best path forward here. One option is an environmental variable like
SB_SAVE_ON_EVERY_NODEor something more pithy that controls whetherrun_on_mainand related functions get run onRANK=0vs.LOCAL_RANK=0. The limitation here is that this is all-or-nothing, you can't choose some functions to run once and others to run once-per-node without some serious jerry-rigging of the environment variable. On the other hand, a "once_per_node" flag argument could be added everywhere relevant (e.g. checkpointer) that would control the behavior, but this solution is much more involved.Any thoughts? Is the env var good enough for us for now?