Skip to content

feat: distributed BFS via torchrun (NCCL), ~5x on S_13#188

Open
TryDotAtwo wants to merge 16 commits into
cayleypy:mainfrom
TryDotAtwo:feature/bfs-torchrun-distributed
Open

feat: distributed BFS via torchrun (NCCL), ~5x on S_13#188
TryDotAtwo wants to merge 16 commits into
cayleypy:mainfrom
TryDotAtwo:feature/bfs-torchrun-distributed

Conversation

@TryDotAtwo
Copy link
Copy Markdown

Summary

Adds a multi-process BFS path using torch.distributed (NCCL), intended for launches with torchrun (one process per GPU). When the process group environment is present (RANK, WORLD_SIZE, LOCAL_RANK) and WORLD_SIZE > 1, CayleyGraph.bfs() routes to this implementation instead of only when num_gpus > 1.

Reported performance: ~ speedup for S₁₃ (Coxeter / symmetric-group style workloads) vs the previous single-process baseline in our tests.


What changed

cayley_graph.py

  • bfs() calls BfsDistributed.bfs() when:
    • BfsDistributed._use_torchrun_backend() is true (RANK / WORLD_SIZE / LOCAL_RANK set and WORLD_SIZE > 1), or
    • legacy multi-GPU: num_gpus > 1 (unchanged),
      unless return_all_edges / disable_batching force the classic BfsAlgorithm path.

So a plain script under torchrun with one GPU per process no longer misses the distributed BFS entry point.

algo/bfs_distributed.py

  • _use_torchrun_backend() / _is_torchrun_env() — detect the torchrun / torch.distributed.run environment.
  • _bfs_torchrun — distributed BFS:
    • Sharding: each rank owns states with hash % world_size == rank (after exchanges).
    • Communication: all_to_all_single for counts and payloads (_exchange_by_owner), not a dense per-rank matrix of send buffers.
    • Batch loop: num_local_batches is reduced with all_reduce(MAX) so every rank runs the same number of batch steps per layer; ranks with no local frontier in a batch still participate with empty tensors — avoids mismatched collective counts / deadlocks.
    • Global sizes / stop: all_reduce for layer sizes and boolean stop flags; all_gather_object to build full layers on rank when storing or evaluating stop_condition.
  • _bfs_single_process — previous single-process multi-GPU path kept as-is (refactored out of bfs()).
  • _encode_states_to_device — encode start states on the rank’s CUDA device so encoding does not depend on a broad graph.device when multiple GPUs are visible.

How to use

Prerequisites

  • CUDA and NCCL (typical Linux + NVIDIA; Windows NCCL support varies).
  • Install cayleypy from this branch / fork.

Launch

torchrun --nproc_per_node=4 python your_script.py

Or multi-node (see PyTorch elastic launch).

Each process should see one logical GPU (recommended: torchrun / CUDA_VISIBLE_DEVICES so device_count is 1 per process, or pass specific_devices=[int(os.environ["LOCAL_RANK"])] when constructing CayleyGraph).

Minimal pattern

import os
from cayleypy import CayleyGraph, PermutationGroups

local_rank = int(os.environ["LOCAL_RANK"])
graph = CayleyGraph(
    PermutationGroups.coxeter(n),
    device="cuda",
    specific_devices=[local_rank],
)
result = graph.bfs()
if int(os.environ.get("RANK", "0")) == 0:
    print(result.layer_sizes)

torchrun sets RANK, WORLD_SIZE, LOCAL_RANK; the library calls init_process_group(backend="nccl") on first use inside the distributed BFS path.

When the new path is not used

  • WORLD_SIZE == 1 (single process) → existing single-process BFS.
  • Missing RANK / WORLD_SIZE / LOCAL_RANK → no torchrun backend.
  • return_all_edges / disable_batching in bfs() kwargs → same as before (classic algorithm).

Testing / review notes

  • Please run existing GPU tests and a short smoke torchrun --nproc_per_node=2 job on a 2+ GPU machine.
  • Large layers use all_gather_object (CPU pickling); for huge layers this may be a memory/latency tradeoff — follow-up optimizations possible.

Related

  • Closes nothing automatically (no issue linked); opened for feature review and merge into main.

- Route CayleyGraph.bfs to BfsDistributed when torchrun sets RANK/WORLD_SIZE/LOCAL_RANK
  and WORLD_SIZE > 1, not only when num_gpus > 1, so one GPU per process can use the
  multi-process path.
- Add _bfs_torchrun: shard by hash %% world_size, exchange with all_to_all_single.
- Synchronize per-layer batch count across ranks (global max) so collectives are not
  called a mismatched number of times per rank.
- Gather layers with all_gather_object when storing layers / stop_condition / hashes.
- Helpers: _ensure_dist_initialized, _exchange_by_owner, _global_layer_size, etc.

Observed ~5x speedup for S_13 vs single-process baseline in internal testing.

Made-with: Cursor
Refactor duplicate removal logic in states to improve clarity and efficiency.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant