feat: distributed BFS via torchrun (NCCL), ~5x on S_13#188
Open
TryDotAtwo wants to merge 16 commits into
Open
Conversation
- Route CayleyGraph.bfs to BfsDistributed when torchrun sets RANK/WORLD_SIZE/LOCAL_RANK and WORLD_SIZE > 1, not only when num_gpus > 1, so one GPU per process can use the multi-process path. - Add _bfs_torchrun: shard by hash %% world_size, exchange with all_to_all_single. - Synchronize per-layer batch count across ranks (global max) so collectives are not called a mismatched number of times per rank. - Gather layers with all_gather_object when storing layers / stop_condition / hashes. - Helpers: _ensure_dist_initialized, _exchange_by_owner, _global_layer_size, etc. Observed ~5x speedup for S_13 vs single-process baseline in internal testing. Made-with: Cursor
Refactor duplicate removal logic in states to improve clarity and efficiency.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a multi-process BFS path using
torch.distributed(NCCL), intended for launches withtorchrun(one process per GPU). When the process group environment is present (RANK,WORLD_SIZE,LOCAL_RANK) andWORLD_SIZE > 1,CayleyGraph.bfs()routes to this implementation instead of only whennum_gpus > 1.Reported performance: ~5× speedup for S₁₃ (Coxeter / symmetric-group style workloads) vs the previous single-process baseline in our tests.
What changed
cayley_graph.pybfs()callsBfsDistributed.bfs()when:BfsDistributed._use_torchrun_backend()is true (RANK/WORLD_SIZE/LOCAL_RANKset andWORLD_SIZE > 1), ornum_gpus > 1(unchanged),unless
return_all_edges/disable_batchingforce the classicBfsAlgorithmpath.So a plain script under
torchrunwith one GPU per process no longer misses the distributed BFS entry point.algo/bfs_distributed.py_use_torchrun_backend()/_is_torchrun_env()— detect the torchrun /torch.distributed.runenvironment._bfs_torchrun— distributed BFS:hash % world_size == rank(after exchanges).all_to_all_singlefor counts and payloads (_exchange_by_owner), not a dense per-rank matrix of send buffers.num_local_batchesis reduced withall_reduce(MAX)so every rank runs the same number of batch steps per layer; ranks with no local frontier in a batch still participate with empty tensors — avoids mismatched collective counts / deadlocks.all_reducefor layer sizes and boolean stop flags;all_gather_objectto build full layers on rank when storing or evaluatingstop_condition._bfs_single_process— previous single-process multi-GPU path kept as-is (refactored out ofbfs())._encode_states_to_device— encode start states on the rank’s CUDA device so encoding does not depend on a broadgraph.devicewhen multiple GPUs are visible.How to use
Prerequisites
Launch
Or multi-node (see PyTorch elastic launch).
Each process should see one logical GPU (recommended:
torchrun/CUDA_VISIBLE_DEVICESsodevice_countis 1 per process, or passspecific_devices=[int(os.environ["LOCAL_RANK"])]when constructingCayleyGraph).Minimal pattern
torchrunsetsRANK,WORLD_SIZE,LOCAL_RANK; the library callsinit_process_group(backend="nccl")on first use inside the distributed BFS path.When the new path is not used
WORLD_SIZE == 1(single process) → existing single-process BFS.RANK/WORLD_SIZE/LOCAL_RANK→ no torchrun backend.return_all_edges/disable_batchinginbfs()kwargs → same as before (classic algorithm).Testing / review notes
torchrun --nproc_per_node=2job on a 2+ GPU machine.all_gather_object(CPU pickling); for huge layers this may be a memory/latency tradeoff — follow-up optimizations possible.Related
main.