[RFC] Add Windows support to torch.distributed package

## 🚀 Feature
Enable torch.distributed package supported on windows platform, this feature is only the first step, limited features supported compare to linux platform.  

1. Function broadcast and all_reduce supported on CPU and GPU
2. Gloo backend supported only, for Gloo, libuv supported only first
3. torch.nn.parallel.DistributedDataParallel() supported 
4. Shared file-system init_method supported only

## Motivation
This RFC is a refined version of #37068.

As users are continually asking for supporting torch.distributed package on windows platform, we want to enable basic features for distributed package on windows platform to unblock users.

```python
    import os
    import torch
    import torch.distributed as dist
    from torch.multiprocessing import Process
    import torch.nn as nn
    import torch.optim as optim
    from torch.nn.parallel import DistributedDataParallel as DDP

    def run_ddp(rank, world_size):
        # create local model
        model = nn.Linear(10, 10).to(rank)
        #.to(rank)
        # construct DDP model
        ddp_model = DDP(model, device_ids=[rank])
        # define loss function and optimizer
        loss_fn = nn.MSELoss()
        optimizer = optim.SGD(ddp_model.parameters(), lr=0.001)

        # forward pass
        outputs = ddp_model(torch.randn(20, 10).to(rank))#.to(rank)
        labels = torch.randn(20, 10).to(rank)#.to(rank)
        # backward pass
        loss_fn(outputs, labels).backward()
        # update parameters
        optimizer.step()

    def all_reduce(rank, size):
        """ Distributed function to be implemented later. """
        group = dist.new_group([0, 1, 2])
        tensor = torch.ones(3)
        print('{} : Before allreducce: Rank {} has data {}'\
        .format(os.getpid(), rank,tensor))
        dist.all_reduce(tensor, op = dist.ReduceOp.SUM, group= group)
        print('{} : After allreduce: Rank {} has data {}'\
        .format(os.getpid(), rank,tensor))

    def broadcast(rank, size):
        group = dist.new_group([0,1])
        if rank == 0:
            tensor = torch.zeros(3)
        else:
            tensor = torch.ones(3)
        print('{} : Before braodcasting: Rank {} has data {}'\
        .format(os.getpid(), rank,tensor))
        dist.broadcast(tensor, src = 0, group= group)
        print('{} : After braodcasting: Rank {} has data {}'\
        .format(os.getpid(), rank,tensor))


    def init_process(rank, size, fn, backend='gloo'):
        """ Initialize the distributed environment. """
        dist.init_process_group(backend, init_method=r"file://filesharemachine/pytorchtmp/test.log", rank=rank, world_size=size)
        print(f"rank:{rank}")
        fn(rank, size)

    def test():
        size = 2
        processes = []
        for rank in range(size):
            p = Process(target=init_process, args=(rank, size, run_ddp))
            p.start()
            processes.append(p)

        for p in processes:
            p.join()

    if __name__ == "__main__":
        print(dist.is_available())
        test()
```

## Project and modules will be touched

### Gloo
1. Add MSVC to detect whether compiling for windows platform
2. Force to use libuv as transport way
3. Exclude src directories, mpi, nccl, transport/tcp, transport/ibverbs
4. Disable compile option like USE_REDIS_DEFAULT, USE_IBVERBS_DEFAULT etc.
5. Add libuv library (build for windows platform)
6. Add a new example to show all_reduce function and libuv as transport since currently example is leverage tcp as transport
7. Port unsupported gloo source code to MSVC
8. Port unsupported gloo/cuda* source code to MSVC
9. Port test case and bench mark related to all_reduce and broadcast function to MSVC

### Torch
1. Add windows as one of distributed supported system
2. Force to use gloo when distributed set ON and disable USE_MPI & USE_TENSORPIPE
3. Enable thirdparty/gloo build when distributed set ON and MSVC detected
4. Port related test case to MSVC

### Torch\lib\c10d
1. Add c10d src folder when distributed set ON
2. Exclude HashStore, ProcessGroupRoundRobin, TcpStore source file from c10d project
3. DO NOT export unsupported header file for HashStore, TcpStore
4. Disable tcputils
5. Enable GlooDeviceFactory for windows platform
6. Port unsupported source code to MSVC
7. Port related test case to MSVC

### Torch\csrc\distributed

1. Decouple tensorpipe from rpc
2. Add python distributed sources to torch python sources
3. Disable HashStore, ProcessGroupRoundRobin, TcpStore
4. Enable ProcessGroupGloo create default device function
5. Disable tensorpipe agent
6. Disable TcpStore in rendezvous
7. Port unsupported source code to MSVC
8. Port related test case to MSVC


cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @xush6528 @osalpekar @jiayisuse @agolynski @peterjc123 @maxluk @nbcsm @guyang3532 @gunandrose4u @smartcat2010 @mszhanyi

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC] Add Windows support to torch.distributed package #42095

🚀 Feature

Motivation

Project and modules will be touched

Gloo

Torch

Torch\lib\c10d

Torch\csrc\distributed

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC] Add Windows support to torch.distributed package #42095

Description

🚀 Feature

Motivation

Project and modules will be touched

Gloo

Torch

Torch\lib\c10d

Torch\csrc\distributed

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions