Skip to content

test_post_localSGD_optimizer_parity_with_hierarchical_sgd error #74995

@KyleCZH

Description

@KyleCZH

A recent PR #74668 created a few new test cases including:
test_post_localSGD_optimizer_parity_with_hierarchical_sgd (main.TestDistBackendWithSpawn)
test_post_localSGD_optimizer_parity_with_hierarchical_sgd_grad_is_view (main.TestDistBackendWithSpawn)

which raises errors with machines that have 4 ROCm GPUs

full log: https://ossci-raw-job-status.s3.amazonaws.com/log/5758883353
error signature:

2022-03-30T20:15:55.5332760Z   test_post_localSGD_optimizer_parity_with_hierarchical_sgd (__main__.TestDistBackendWithSpawn) ... INFO:torch.testing._internal.common_distributed:Started process 0 with pid 739
2022-03-30T20:15:55.5375975Z INFO:torch.testing._internal.common_distributed:Started process 1 with pid 740
2022-03-30T20:15:55.5412252Z INFO:torch.testing._internal.common_distributed:Started process 2 with pid 741
2022-03-30T20:15:55.5413718Z [W Module.cpp:498] Warning: Disabling benchmark mode for MIOpen is NOT supported. Overriding value to True (function operator())
2022-03-30T20:15:56.5969442Z INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 0
2022-03-30T20:15:56.5970827Z INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 2
2022-03-30T20:15:56.6345267Z INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 1
2022-03-30T20:15:56.6347506Z INFO:torch.distributed.distributed_c10d:Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 3 nodes.
2022-03-30T20:15:56.6382563Z INFO:torch.distributed.distributed_c10d:Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 3 nodes.
2022-03-30T20:15:56.6384701Z INFO:torch.distributed.distributed_c10d:Rank 2: Completed store-based barrier for key:store_based_barrier_key:1 with 3 nodes.
2022-03-30T20:15:56.6395266Z INFO:torch.testing._internal.common_distributed:Starting event listener thread for rank 0
2022-03-30T20:15:56.6396504Z INFO:torch.testing._internal.common_distributed:Starting event listener thread for rank 2
2022-03-30T20:15:56.6398505Z INFO:torch.distributed.algorithms.model_averaging.hierarchical_model_averager:Model averaging hierarchy:
2022-03-30T20:15:56.6400921Z INFO:torch.distributed.algorithms.model_averaging.hierarchical_model_averager:	Each group that has 2 processes average parameters every 2 iterations, if no higher-level averaging.
2022-03-30T20:15:56.6402727Z INFO:torch.distributed.algorithms.model_averaging.hierarchical_model_averager:Model averaging hierarchy:
2022-03-30T20:15:56.6405006Z INFO:torch.distributed.algorithms.model_averaging.hierarchical_model_averager:	Each group that has 2 processes average parameters every 2 iterations, if no higher-level averaging.
2022-03-30T20:15:56.6445181Z ERROR:torch.testing._internal.common_distributed:Caught exception: 
2022-03-30T20:15:56.6446144Z Traceback (most recent call last):
2022-03-30T20:15:56.6447777Z   File "/opt/conda/lib/python3.7/site-packages/torch/testing/_internal/common_distributed.py", line 601, in run_test
2022-03-30T20:15:56.6448806Z     getattr(self, test_name)()
2022-03-30T20:15:56.6449968Z   File "/opt/conda/lib/python3.7/site-packages/torch/testing/_internal/common_distributed.py", line 486, in wrapper
2022-03-30T20:15:56.6450761Z     fn()
2022-03-30T20:15:56.6451862Z   File "/opt/conda/lib/python3.7/site-packages/torch/testing/_internal/common_distributed.py", line 131, in wrapper
2022-03-30T20:15:56.6452687Z     return func(*args, **kwargs)
2022-03-30T20:15:56.6454130Z   File "/opt/conda/lib/python3.7/site-packages/torch/testing/_internal/distributed/distributed_test.py", line 4864, in test_post_localSGD_optimizer_parity_with_hierarchical_sgd
2022-03-30T20:15:56.6455309Z     period_group_size_dict=period_group_size_dict, warmup_steps=4
2022-03-30T20:15:56.6457160Z   File "/opt/conda/lib/python3.7/site-packages/torch/distributed/algorithms/model_averaging/hierarchical_model_averager.py", line 124, in __init__
2022-03-30T20:15:56.6458214Z     group_size=group_size, group=self.process_group)
2022-03-30T20:15:56.6459441Z   File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 3102, in new_subgroups
2022-03-30T20:15:56.6460621Z     raise ValueError("The world size must be divisible by 'group_size'")
2022-03-30T20:15:56.6461637Z ValueError: The world size must be divisible by 'group_size'
2022-03-30T20:15:56.6462478Z  exiting process 0 with exit code: 10
2022-03-30T20:15:56.6463256Z ERROR:torch.testing._internal.common_distributed:Caught exception: 

command to reproduce
BACKEND=nccl WORLD_SIZE=3 python3.7 distributed/test_distributed_spawn.py -v TestDistBackendWithSpawn.test_post_localSGD_optimizer_parity_with_hierarchical_sgd

@jithunnair-amd @jeffdaily

cc @ezyang @gchanan @zou3519 @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @SciPioneer @H-Huang @jeffdaily @sunway513 @jithunnair-amd @ROCmSupport @KyleCZH @mruberry @vincentqb @jbschlosser

Metadata

Metadata

Assignees

No one assigned

    Labels

    high prioritymodule: rocmAMD GPU support for Pytorchmodule: testsIssues related to tests (not the torch.testing module)oncall: distributedAdd this issue/PR to distributed oncall triage queuetriage reviewtriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions