-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Add ReplicatedParameter and ReplicatedTensor to DDP. #74787
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ReplicatedParameter and ReplicatedTensor to DDP. #74787
Conversation
As per the design in #72138, introduce ReplicatedParameter and replace DDP parameters with ReplicatedParameters and DDP buffers as ReplicatedTensors. Differential Revision: [D35162813](https://our.internmc.facebook.com/intern/diff/D35162813/) [ghstack-poisoned]
🔗 Helpful links
💊 CI failures summary and remediationsAs of commit b248f29 (more details on the Dr. CI page):
🕵️ 8 new failures recognized by patternsThe following CI failures do not appear to be due to upstream breakages:
|
| Job | Step | Action |
|---|---|---|
| Test | 🔁 rerun | |
| Fail if there were any warnings | 🔁 rerun |
This comment was automatically generated by Dr. CI (expand for details).
Please report bugs/suggestions to the (internal) Dr. CI Users group.
As per the design in #72138, introduce ReplicatedParameter and replace DDP parameters with ReplicatedParameters and DDP buffers as ReplicatedTensors. Differential Revision: [D35162813](https://our.internmc.facebook.com/intern/diff/D35162813/) ghstack-source-id: 152286552 Pull Request resolved: #74787
|
|
||
| return True | ||
|
|
||
| class ReplicatedParameter(ReplicatedTensor, torch.nn.Parameter): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@albanD I couldn't pass in a ReplicatedTensor to torch.nn.Parameter and have isinstance() still return ReplicatedTensor. As a result, I had to do something messy like this. Not sure if there is a better way to do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that's the plan from #73459
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @jbschlosser
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit delayed due to my medical leave, but wanted to leave an update here for posterity: #73459 is in now - please try it out if you'd like :)
mrshenli
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a test or do we plan to land this soon? If it's the latter, shall we do some micro-benchmarks to make sure there is no perf regression?
| ) | ||
|
|
||
| # After syncing, tag them as replicated. | ||
| from torch.distributed._shard.replicated_tensor import ReplicatedTensor, ReplicatedParameter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we check if each param or buffer is replicated before converting them?
Similarly, shouldn't we record which ones we wrapped and use that when unwrapping, instead of relying on parameters_to_ignore.
Another thing, is using parameters_to_ignore to filter buffers just a naming artifact?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, shouldn't we record which ones we wrapped and use that when unwrapping, instead of relying on parameters_to_ignore.
So parameters_to_ignore is slightly orthogonal to tagging something as replicated or not since that option is actually controlling which parameters DDP ignores completely. That is why we always need to rely on it and ensure we don't tag parameters_to_ignore as Replicated at all.
Another thing, is using parameters_to_ignore to filter buffers just a naming artifact?
It's an unfortunate name since it is filtering buffers as well in other places. I'll probably fix this name in a follow up PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, I added self.parameters_to_ignore a couple of years ago and agree that a better name is parameters_and_buffers_to_ignore.
As per the design in #72138, introduce ReplicatedParameter and replace DDP parameters with ReplicatedParameters and DDP buffers as ReplicatedTensors. Differential Revision: [D35162813](https://our.internmc.facebook.com/intern/diff/D35162813/) [ghstack-poisoned]
|
@mrshenli The plan is to hopefully land this change, however I think there are some tricky cases to take care of first. I'll probably ensure I can make CI happy first (tagged the PR as WIP) and then try a benchmark as well to ensure no regressions. |
As per the design in #72138, introduce ReplicatedParameter and replace DDP parameters with ReplicatedParameters and DDP buffers as ReplicatedTensors. Differential Revision: [D35162813](https://our.internmc.facebook.com/intern/diff/D35162813/) [ghstack-poisoned]
Pull Request resolved: #74787 As per the design in #72138, introduce ReplicatedParameter and replace DDP parameters with ReplicatedParameters and DDP buffers as ReplicatedTensors. ghstack-source-id: 152544758 Differential Revision: [D35162813](https://our.internmc.facebook.com/intern/diff/D35162813/)
|
@pytorchbot retest this please |
As per the design in #72138, introduce ReplicatedParameter and replace DDP parameters with ReplicatedParameters and DDP buffers as ReplicatedTensors. Differential Revision: [D35162813](https://our.internmc.facebook.com/intern/diff/D35162813/) [ghstack-poisoned]
As per the design in #72138, introduce ReplicatedParameter and replace DDP parameters with ReplicatedParameters and DDP buffers as ReplicatedTensors. Differential Revision: [D35162813](https://our.internmc.facebook.com/intern/diff/D35162813/) [ghstack-poisoned]
As per the design in #72138, introduce ReplicatedParameter and replace DDP parameters with ReplicatedParameters and DDP buffers as ReplicatedTensors. Differential Revision: [D35162813](https://our.internmc.facebook.com/intern/diff/D35162813/) [ghstack-poisoned]
As per the design in #72138, introduce ReplicatedParameter and replace DDP parameters with ReplicatedParameters and DDP buffers as ReplicatedTensors. Differential Revision: [D35162813](https://our.internmc.facebook.com/intern/diff/D35162813/) [ghstack-poisoned]
Pull Request resolved: #74787 As per the design in #72138, introduce ReplicatedParameter and replace DDP parameters with ReplicatedParameters and DDP buffers as ReplicatedTensors. ghstack-source-id: 152632043 Differential Revision: [D35162813](https://our.internmc.facebook.com/intern/diff/D35162813/)
rohan-varma
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we considering enabling this change with a flag such as DDP_USE_REPLICATED to phase the rollout? I would advocate for this unless we're very confident about performance implications and reliability of ReplicatedTensor, as it would also have a large blast radius in OSS, and we may not be completely aware of it until the 1.12 release happens.
| self.assertIsInstance(out, ReplicatedTensor) | ||
|
|
||
| # Validate buffers not replicated after forward. | ||
| self.assertNotIsInstance(ddp.module.foo, ReplicatedTensor) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we synchronize the buffers so it should be replicated?
| params = list(model.parameters()) | ||
| self.assertEqual(2, len(params)) | ||
| self.assertEqual(model.weight, params[0]) | ||
| self.assertEqual(model.bias, params[1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we add self.assertTrue(all(isinstance(replicatedParam, param) for param in model.parameters())?
| self.gradient_as_bucket_view = gradient_as_bucket_view | ||
| if hasattr(module, "_ddp_params_and_buffers_to_ignore"): | ||
| self.parameters_to_ignore = module._ddp_params_and_buffers_to_ignore | ||
| if hasattr(self.module, "_ddp_params_and_buffers_to_ignore"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious why this change is needed?
| # In debug mode, build a mapping of parameter index -> parameter. | ||
| param_to_name_mapping = self._build_debug_param_to_name_mapping(parameters) | ||
| # Sync params and buffers. Ensures all DDP models start off at the same value. | ||
| self._sync_params_and_buffers(authoritative_rank=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we only mark this after the sync completes successfully?
| if not buffers_only: | ||
| for param_name, param in module.named_parameters(recurse=False): | ||
| if (param.requires_grad and f'{module_name}.{param_name}' | ||
| not in self.parameters_to_ignore and not isinstance(param, ReplicatedParameter)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does the latter check mean that it is possible to call this function on already replicated parameters? Should we try to make it so that the application doesn't do this because it would be unnecessary?
| ) | ||
|
|
||
| # After syncing, tag them as replicated. | ||
| from torch.distributed._shard.replicated_tensor import ReplicatedTensor, ReplicatedParameter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, I added self.parameters_to_ignore a couple of years ago and agree that a better name is parameters_and_buffers_to_ignore.
| for param_name, param in module.named_parameters(recurse=False): | ||
| if (param.requires_grad and f'{module_name}.{param_name}' | ||
| not in self.parameters_to_ignore and not isinstance(param, ReplicatedParameter)): | ||
| module.register_parameter(param_name, ReplicatedParameter(param)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we not de-register the original, non-replicated parameters?
|
Closing in favor of: #75299 |
Stack from ghstack (oldest at bottom):
As per the design in #72138,
introduce ReplicatedParameter and replace DDP parameters with
ReplicatedParameters and DDP buffers as ReplicatedTensors.
Differential Revision: D35162813