-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Open
Labels
module: windowsWindows support for PyTorchWindows support for PyTorchoncall: distributedAdd this issue/PR to distributed oncall triage queueAdd this issue/PR to distributed oncall triage queuetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate modulewindows-triaged
Description
🚀 Feature
Enable torch.distributed package supported on windows platform, this feature is only the first step, limited features supported compare to linux platform.
- Function broadcast and all_reduce supported on CPU and GPU
- Gloo backend supported only, for Gloo, libuv supported only first
- torch.nn.parallel.DistributedDataParallel() supported
- Shared file-system init_method supported only
Motivation
This RFC is a refined version of #37068.
As users are continually asking for supporting torch.distributed package on windows platform, we want to enable basic features for distributed package on windows platform to unblock users.
import os
import torch
import torch.distributed as dist
from torch.multiprocessing import Process
import torch.nn as nn
import torch.optim as optim
from torch.nn.parallel import DistributedDataParallel as DDP
def run_ddp(rank, world_size):
# create local model
model = nn.Linear(10, 10).to(rank)
#.to(rank)
# construct DDP model
ddp_model = DDP(model, device_ids=[rank])
# define loss function and optimizer
loss_fn = nn.MSELoss()
optimizer = optim.SGD(ddp_model.parameters(), lr=0.001)
# forward pass
outputs = ddp_model(torch.randn(20, 10).to(rank))#.to(rank)
labels = torch.randn(20, 10).to(rank)#.to(rank)
# backward pass
loss_fn(outputs, labels).backward()
# update parameters
optimizer.step()
def all_reduce(rank, size):
""" Distributed function to be implemented later. """
group = dist.new_group([0, 1, 2])
tensor = torch.ones(3)
print('{} : Before allreducce: Rank {} has data {}'\
.format(os.getpid(), rank,tensor))
dist.all_reduce(tensor, op = dist.ReduceOp.SUM, group= group)
print('{} : After allreduce: Rank {} has data {}'\
.format(os.getpid(), rank,tensor))
def broadcast(rank, size):
group = dist.new_group([0,1])
if rank == 0:
tensor = torch.zeros(3)
else:
tensor = torch.ones(3)
print('{} : Before braodcasting: Rank {} has data {}'\
.format(os.getpid(), rank,tensor))
dist.broadcast(tensor, src = 0, group= group)
print('{} : After braodcasting: Rank {} has data {}'\
.format(os.getpid(), rank,tensor))
def init_process(rank, size, fn, backend='gloo'):
""" Initialize the distributed environment. """
dist.init_process_group(backend, init_method=r"file://filesharemachine/pytorchtmp/test.log", rank=rank, world_size=size)
print(f"rank:{rank}")
fn(rank, size)
def test():
size = 2
processes = []
for rank in range(size):
p = Process(target=init_process, args=(rank, size, run_ddp))
p.start()
processes.append(p)
for p in processes:
p.join()
if __name__ == "__main__":
print(dist.is_available())
test()Project and modules will be touched
Gloo
- Add MSVC to detect whether compiling for windows platform
- Force to use libuv as transport way
- Exclude src directories, mpi, nccl, transport/tcp, transport/ibverbs
- Disable compile option like USE_REDIS_DEFAULT, USE_IBVERBS_DEFAULT etc.
- Add libuv library (build for windows platform)
- Add a new example to show all_reduce function and libuv as transport since currently example is leverage tcp as transport
- Port unsupported gloo source code to MSVC
- Port unsupported gloo/cuda* source code to MSVC
- Port test case and bench mark related to all_reduce and broadcast function to MSVC
Torch
- Add windows as one of distributed supported system
- Force to use gloo when distributed set ON and disable USE_MPI & USE_TENSORPIPE
- Enable thirdparty/gloo build when distributed set ON and MSVC detected
- Port related test case to MSVC
Torch\lib\c10d
- Add c10d src folder when distributed set ON
- Exclude HashStore, ProcessGroupRoundRobin, TcpStore source file from c10d project
- DO NOT export unsupported header file for HashStore, TcpStore
- Disable tcputils
- Enable GlooDeviceFactory for windows platform
- Port unsupported source code to MSVC
- Port related test case to MSVC
Torch\csrc\distributed
- Decouple tensorpipe from rpc
- Add python distributed sources to torch python sources
- Disable HashStore, ProcessGroupRoundRobin, TcpStore
- Enable ProcessGroupGloo create default device function
- Disable tensorpipe agent
- Disable TcpStore in rendezvous
- Port unsupported source code to MSVC
- Port related test case to MSVC
cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @xush6528 @osalpekar @jiayisuse @agolynski @peterjc123 @maxluk @nbcsm @guyang3532 @gunandrose4u @smartcat2010 @mszhanyi
Ownmarc, XiaomoWu, sanctifier, victor-ab, outk and 1 moremrshenli, pritamdamania87, jspisak, guyang3532, victor-ab and 2 morerohan-varma, dzhulgakov, jspisak, guyang3532, victor-ab and 1 more
Metadata
Metadata
Assignees
Labels
module: windowsWindows support for PyTorchWindows support for PyTorchoncall: distributedAdd this issue/PR to distributed oncall triage queueAdd this issue/PR to distributed oncall triage queuetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate modulewindows-triaged
Type
Projects
Status
Backlog