Skip to content

Conversation

@pplantinga
Copy link
Collaborator

We still have not added the capability for users to choose whether to run certain code (like saving) only once (main proc) vs. once per node (local_rank = 0), leading to issues like #2983.

This PR begins to solve this by adding code for running once per node, but a fuller solution requires some kind of mechanism for enabling the choice on the user side for things like checkpointing. However I'm not certain of the best path forward here. One option is an environmental variable like SB_SAVE_ON_EVERY_NODE or something more pithy that controls whether run_on_main and related functions get run on RANK=0 vs. LOCAL_RANK=0. The limitation here is that this is all-or-nothing, you can't choose some functions to run once and others to run once-per-node without some serious jerry-rigging of the environment variable. On the other hand, a "once_per_node" flag argument could be added everywhere relevant (e.g. checkpointer) that would control the behavior, but this solution is much more involved.

Any thoughts? Is the env var good enough for us for now?

@pplantinga pplantinga self-assigned this Nov 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant