Skip to content

Conversation

@bdhirsh
Copy link
Contributor

@bdhirsh bdhirsh commented Dec 16, 2022

This should fix hf_Longformer, AllenaiLongformerBase, and tacotron2 with dynamic shapes. Example repro:

TORCHDYNAMO_DYNAMIC_SHAPES=1 AOT_DYNAMIC_SHAPES=1 python benchmarks/dynamo/torchbench.py --accuracy --backend aot_eager --training --only hf_Longformer

used to fail with:

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [4, 1024, 12, 513]], which is output 0
 of AsStridedBackward0, is at version 6; expected version 4 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient,
with torch.autograd.set_detect_anomaly(True).

The problem is that:

(1) when we have a tensor from the forward, whose sizes are needed the backward, we were saving the actual tensor for backward, and directly grabbing the sizes off of it inside of the backward graph (bad for perf)

(2) If that tensor happens to be a graph input that gets mutated, we end up with the above error. Autograd yells at you if you try to save a tensor for backward, and later mutate it.

I confirmed that this problem doesn't happen for the min cut partitioner.

Stack from ghstack (oldest at bottom):

@pytorch-bot
Copy link

pytorch-bot bot commented Dec 16, 2022

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/91012

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit dd9173c:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

bdhirsh added a commit that referenced this pull request Dec 16, 2022
…en possible

ghstack-source-id: d1ffeac
Pull Request resolved: #91012
for user in backward_usages:
saved_sym_nodes.append(user)
else:
saved_values.append(node)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @wconstab / @Chillee if you have feedback around the partitioner

Copy link
Contributor

@ezyang ezyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah OK, seems fine lol

@bdhirsh
Copy link
Contributor Author

bdhirsh commented Dec 16, 2022

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Dec 16, 2022
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@facebook-github-bot facebook-github-bot deleted the gh/bdhirsh/361/head branch June 8, 2023 15:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged release notes: torch.func release notes category for torch.vmap or torch.func.* APIs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants