feat: distributed BFS via torchrun (NCCL), ~5x on S_13 by TryDotAtwo · Pull Request #188 · cayleypy/cayleypy

TryDotAtwo · 2026-04-04T09:20:45Z

Summary

Adds a multi-process BFS path using torch.distributed (NCCL), intended for launches with torchrun (one process per GPU). When the process group environment is present (RANK, WORLD_SIZE, LOCAL_RANK) and WORLD_SIZE > 1, CayleyGraph.bfs() routes to this implementation instead of only when num_gpus > 1.

Reported performance: ~5× speedup for S₁₃ (Coxeter / symmetric-group style workloads) vs the previous single-process baseline in our tests.

What changed

`cayley_graph.py`

bfs() calls BfsDistributed.bfs() when:
- BfsDistributed._use_torchrun_backend() is true (RANK / WORLD_SIZE / LOCAL_RANK set and WORLD_SIZE > 1), or
- legacy multi-GPU: num_gpus > 1 (unchanged),
  unless return_all_edges / disable_batching force the classic BfsAlgorithm path.

So a plain script under torchrun with one GPU per process no longer misses the distributed BFS entry point.

`algo/bfs_distributed.py`

_use_torchrun_backend() / _is_torchrun_env() — detect the torchrun / torch.distributed.run environment.
_bfs_torchrun — distributed BFS:
- Sharding: each rank owns states with hash % world_size == rank (after exchanges).
- Communication: all_to_all_single for counts and payloads (_exchange_by_owner), not a dense per-rank matrix of send buffers.
- Batch loop: num_local_batches is reduced with all_reduce(MAX) so every rank runs the same number of batch steps per layer; ranks with no local frontier in a batch still participate with empty tensors — avoids mismatched collective counts / deadlocks.
- Global sizes / stop: all_reduce for layer sizes and boolean stop flags; all_gather_object to build full layers on rank when storing or evaluating stop_condition.
_bfs_single_process — previous single-process multi-GPU path kept as-is (refactored out of bfs()).
_encode_states_to_device — encode start states on the rank’s CUDA device so encoding does not depend on a broad graph.device when multiple GPUs are visible.

How to use

Prerequisites

CUDA and NCCL (typical Linux + NVIDIA; Windows NCCL support varies).
Install cayleypy from this branch / fork.

Launch

torchrun --nproc_per_node=4 python your_script.py

Or multi-node (see PyTorch elastic launch).

Each process should see one logical GPU (recommended: torchrun / CUDA_VISIBLE_DEVICES so device_count is 1 per process, or pass specific_devices=[int(os.environ["LOCAL_RANK"])] when constructing CayleyGraph).

Minimal pattern

import os
from cayleypy import CayleyGraph, PermutationGroups

local_rank = int(os.environ["LOCAL_RANK"])
graph = CayleyGraph(
    PermutationGroups.coxeter(n),
    device="cuda",
    specific_devices=[local_rank],
)
result = graph.bfs()
if int(os.environ.get("RANK", "0")) == 0:
    print(result.layer_sizes)

torchrun sets RANK, WORLD_SIZE, LOCAL_RANK; the library calls init_process_group(backend="nccl") on first use inside the distributed BFS path.

When the new path is not used

WORLD_SIZE == 1 (single process) → existing single-process BFS.
Missing RANK / WORLD_SIZE / LOCAL_RANK → no torchrun backend.
return_all_edges / disable_batching in bfs() kwargs → same as before (classic algorithm).

Testing / review notes

Please run existing GPU tests and a short smoke torchrun --nproc_per_node=2 job on a 2+ GPU machine.
Large layers use all_gather_object (CPU pickling); for huge layers this may be a memory/latency tradeoff — follow-up optimizations possible.

- Route CayleyGraph.bfs to BfsDistributed when torchrun sets RANK/WORLD_SIZE/LOCAL_RANK and WORLD_SIZE > 1, not only when num_gpus > 1, so one GPU per process can use the multi-process path. - Add _bfs_torchrun: shard by hash %% world_size, exchange with all_to_all_single. - Synchronize per-layer batch count across ranks (global max) so collectives are not called a mismatched number of times per rank. - Gather layers with all_gather_object when storing layers / stop_condition / hashes. - Helpers: _ensure_dist_initialized, _exchange_by_owner, _global_layer_size, etc. Observed ~5x speedup for S_13 vs single-process baseline in internal testing. Made-with: Cursor

Refactor duplicate removal logic in states to improve clarity and efficiency.

TryDotAtwo added 16 commits April 4, 2026 12:18

Add torchrun multi-GPU beam search

0fd3e59

Update beam_search_multigpu.py

2ca685e

Update __init__.py

43b8a38

Update beam_search_multigpu.py

d6c5a0e

Update beam_search_multigpu.py

9611195

Update beam_search_multigpu.py

311484e

Refactor chunk size synchronization and handling

aafb100

Update beam_search_multigpu.py

54a5f28

Update beam_search_multigpu.py

b7b405e

Create CayleyGraph_Multi.py

bdb433b

Update beam_search_multigpu.py

e52c3f4

Simplify duplicate removal in states

563e84f

Refactor duplicate removal logic in states to improve clarity and efficiency.

Update beam_search_multigpu.py

ee89005

Update beam_search_multigpu.py

641badb

Update beam_search_multigpu.py

b3cf07a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: distributed BFS via torchrun (NCCL), ~5x on S_13#188

feat: distributed BFS via torchrun (NCCL), ~5x on S_13#188
TryDotAtwo wants to merge 16 commits into
cayleypy:mainfrom
TryDotAtwo:feature/bfs-torchrun-distributed

TryDotAtwo commented Apr 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

TryDotAtwo commented Apr 4, 2026

Summary

What changed

cayley_graph.py

algo/bfs_distributed.py

How to use

Prerequisites

Launch

Minimal pattern

When the new path is not used

Testing / review notes

Related

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`cayley_graph.py`

`algo/bfs_distributed.py`