Skip to content

[RFC] Add Windows support to torch.distributed package #42095

@gunandrose4u

Description

@gunandrose4u

🚀 Feature

Enable torch.distributed package supported on windows platform, this feature is only the first step, limited features supported compare to linux platform.

  1. Function broadcast and all_reduce supported on CPU and GPU
  2. Gloo backend supported only, for Gloo, libuv supported only first
  3. torch.nn.parallel.DistributedDataParallel() supported
  4. Shared file-system init_method supported only

Motivation

This RFC is a refined version of #37068.

As users are continually asking for supporting torch.distributed package on windows platform, we want to enable basic features for distributed package on windows platform to unblock users.

    import os
    import torch
    import torch.distributed as dist
    from torch.multiprocessing import Process
    import torch.nn as nn
    import torch.optim as optim
    from torch.nn.parallel import DistributedDataParallel as DDP

    def run_ddp(rank, world_size):
        # create local model
        model = nn.Linear(10, 10).to(rank)
        #.to(rank)
        # construct DDP model
        ddp_model = DDP(model, device_ids=[rank])
        # define loss function and optimizer
        loss_fn = nn.MSELoss()
        optimizer = optim.SGD(ddp_model.parameters(), lr=0.001)

        # forward pass
        outputs = ddp_model(torch.randn(20, 10).to(rank))#.to(rank)
        labels = torch.randn(20, 10).to(rank)#.to(rank)
        # backward pass
        loss_fn(outputs, labels).backward()
        # update parameters
        optimizer.step()

    def all_reduce(rank, size):
        """ Distributed function to be implemented later. """
        group = dist.new_group([0, 1, 2])
        tensor = torch.ones(3)
        print('{} : Before allreducce: Rank {} has data {}'\
        .format(os.getpid(), rank,tensor))
        dist.all_reduce(tensor, op = dist.ReduceOp.SUM, group= group)
        print('{} : After allreduce: Rank {} has data {}'\
        .format(os.getpid(), rank,tensor))

    def broadcast(rank, size):
        group = dist.new_group([0,1])
        if rank == 0:
            tensor = torch.zeros(3)
        else:
            tensor = torch.ones(3)
        print('{} : Before braodcasting: Rank {} has data {}'\
        .format(os.getpid(), rank,tensor))
        dist.broadcast(tensor, src = 0, group= group)
        print('{} : After braodcasting: Rank {} has data {}'\
        .format(os.getpid(), rank,tensor))


    def init_process(rank, size, fn, backend='gloo'):
        """ Initialize the distributed environment. """
        dist.init_process_group(backend, init_method=r"file://filesharemachine/pytorchtmp/test.log", rank=rank, world_size=size)
        print(f"rank:{rank}")
        fn(rank, size)

    def test():
        size = 2
        processes = []
        for rank in range(size):
            p = Process(target=init_process, args=(rank, size, run_ddp))
            p.start()
            processes.append(p)

        for p in processes:
            p.join()

    if __name__ == "__main__":
        print(dist.is_available())
        test()

Project and modules will be touched

Gloo

  1. Add MSVC to detect whether compiling for windows platform
  2. Force to use libuv as transport way
  3. Exclude src directories, mpi, nccl, transport/tcp, transport/ibverbs
  4. Disable compile option like USE_REDIS_DEFAULT, USE_IBVERBS_DEFAULT etc.
  5. Add libuv library (build for windows platform)
  6. Add a new example to show all_reduce function and libuv as transport since currently example is leverage tcp as transport
  7. Port unsupported gloo source code to MSVC
  8. Port unsupported gloo/cuda* source code to MSVC
  9. Port test case and bench mark related to all_reduce and broadcast function to MSVC

Torch

  1. Add windows as one of distributed supported system
  2. Force to use gloo when distributed set ON and disable USE_MPI & USE_TENSORPIPE
  3. Enable thirdparty/gloo build when distributed set ON and MSVC detected
  4. Port related test case to MSVC

Torch\lib\c10d

  1. Add c10d src folder when distributed set ON
  2. Exclude HashStore, ProcessGroupRoundRobin, TcpStore source file from c10d project
  3. DO NOT export unsupported header file for HashStore, TcpStore
  4. Disable tcputils
  5. Enable GlooDeviceFactory for windows platform
  6. Port unsupported source code to MSVC
  7. Port related test case to MSVC

Torch\csrc\distributed

  1. Decouple tensorpipe from rpc
  2. Add python distributed sources to torch python sources
  3. Disable HashStore, ProcessGroupRoundRobin, TcpStore
  4. Enable ProcessGroupGloo create default device function
  5. Disable tensorpipe agent
  6. Disable TcpStore in rendezvous
  7. Port unsupported source code to MSVC
  8. Port related test case to MSVC

cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @xush6528 @osalpekar @jiayisuse @agolynski @peterjc123 @maxluk @nbcsm @guyang3532 @gunandrose4u @smartcat2010 @mszhanyi

Metadata

Metadata

Assignees

Labels

module: windowsWindows support for PyTorchoncall: distributedAdd this issue/PR to distributed oncall triage queuetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate modulewindows-triaged

Type

No type

Projects

Status

Backlog

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions