Skip to content

Conversation

@robieta
Copy link
Contributor

@robieta robieta commented Oct 18, 2022

Stack from ghstack (oldest at bottom):

A recurring problem with assigning Tensor IDs is that we want to preserve identity when storage changes but we don't observe TensorImpl destruction so identity assignment is not robust to the ABA problem with respect to TensorImpl*. ~TensorImpl is far too hot to instrument; even adding a call to a no-op function in a different compilation unit increases overhead by tens of percent. (OSS builds do not have any sort of LTO.)

Fortunately there is a solution. A PyTorch Tensor is a c10::intrusive_ptr<c10::TensorImpl>, which in turn holds a storage. (Which is a c10::intrusive_ptr<c10::StorageImpl>) c10::intrusive_ptr has a c10::weak_intrusive_ptr class for taking non-owning references to the underlying object. The implementation involves both a strong refcount and weak refcount in c10::intrusive_ptr. If the strong refcount of an intrusive_ptr goes to zero and there are no weak references then everything is deleted. However if there is a weak reference then the intrusive_ptr calls release_resources() but not delete.

This has the effect of freeing the underlying resources (ensuring that program semantics are unchanged) but leaves behind an empty shell of an intrusive_ptr that the weak_intrusive_ptrs use to check status. And herein lies the solution: as long as we hold a weak reference to a TensorImpl we will block deletion and prevent the TensorImpl* from being reused.

This PR uses a c10::weak_intrusive_ptr<c10::TensorImpl> to store the address of profiled TensorImpls and then converts it to a raw pointer (or rather, a TensorImplAddress) during post processing when we no longer care about blocking address reuse.

Differential Revision: D40492848

Just unit test for now.

Differential Revision: [D40492848](https://our.internmc.facebook.com/intern/diff/D40492848/)

[ghstack-poisoned]
@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label Oct 18, 2022
@pytorch-bot
Copy link

pytorch-bot bot commented Oct 18, 2022

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/87244

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 Failures

As of commit cac925b:

The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Just unit test for now.

Differential Revision: [D40492848](https://our.internmc.facebook.com/intern/diff/D40492848/)

[ghstack-poisoned]
robieta pushed a commit that referenced this pull request Oct 18, 2022
Pull Request resolved: #87244

Just unit test for now.
ghstack-source-id: 170823361

Differential Revision: [D40492848](https://our.internmc.facebook.com/intern/diff/D40492848/)
@robieta
Copy link
Contributor Author

robieta commented Oct 18, 2022

Just testing for now.

Just unit test for now.

Differential Revision: [D40492848](https://our.internmc.facebook.com/intern/diff/D40492848/)

[ghstack-poisoned]
…ss reuse during profiling."


A recurring problem with assigning Tensor IDs is that we want to preserve identity when storage changes but we don't observe TensorImpl destruction so identity assignment is not robust to the ABA problem with respect to TensorImpl*. ~TensorImpl is far too hot to instrument; even adding a call to a no-op function in a different compilation unit increases overhead by tens of percent. (OSS builds do not have any sort of LTO.)

Fortunately there is a solution. A PyTorch Tensor is a `c10::intrusive_ptr<c10::TensorImpl>`, which in turn holds a storage. (Which is a `c10::intrusive_ptr<c10::StorageImpl>`) `c10::intrusive_ptr` has a `c10::weak_intrusive_ptr` class for taking non-owning references to the underlying object. The implementation involves both a strong refcount and weak refcount in `c10::intrusive_ptr`. If the strong refcount of an intrusive_ptr goes to zero and there are no weak references then everything is deleted. However if there is a weak reference then the intrusive_ptr calls `release_resources()` but not delete.

This has the effect of freeing the underlying resources (ensuring that program semantics are unchanged) but leaves behind an empty shell of an `intrusive_ptr` that the `weak_intrusive_ptr`s use to check status. And herein lies the solution: as long as we hold a weak reference to a TensorImpl we will block deletion and prevent the `TensorImpl*` from being reused.

This PR uses a `c10::weak_intrusive_ptr<c10::TensorImpl>` to store the address of profiled TensorImpls and then converts it to a raw pointer (or rather, a `TensorImplAddress`) during post processing when we no longer care about blocking address reuse.

Differential Revision: [D40492848](https://our.internmc.facebook.com/intern/diff/D40492848/)

[ghstack-poisoned]
@robieta robieta added the release notes: profiler release notes category label Oct 19, 2022
Copy link
Contributor

@slgong-fb slgong-fb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!


struct TensorMetadata : public RawTensorMetadataBase {
explicit TensorMetadata(const RawTensorMetadata& r)
: RawTensorMetadataBase(r), impl_{r.weak_self_->_unsafe_get_target()} {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is _unsage_get_target() for? Is this just getting address from weak_intrusive_ptr?

I am also not sure how we can block the address reuse from codes. Is it part of weak_intrusive_ptr implementation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is _unsage_get_target() for? Is this just getting address from weak_intrusive_ptr?

Yeah. weak_intrusive_ptr<T> only has T* as a data member which is what _unsage_get_target returns.

I am also not sure how we can block the address reuse from codes. Is it part of weak_intrusive_ptr implementation?

Yeah. The key part is

if (!should_delete) {
// justification for const_cast: release_resources is basically a
// destructor and a destructor always mutates the object, even for const
// objects. NOLINTNEXTLINE(cppcoreguidelines-pro-type-const-cast)
const_cast<std::remove_const_t<TTarget>*>(target_)->release_resources();
should_delete =
detail::atomic_weakcount_decrement(target_->weakcount_) == 0;
}
if (should_delete) {
delete target_;
}

(which is called from ~intrusive_ptr)

release_resources is called when the strong refcount reaches zero, but the actual delete call which gives the address back is gated on strong and weak refcount.

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 19, 2022
@robieta
Copy link
Contributor Author

robieta commented Oct 19, 2022

I'm going to add some comments before landing this one since it's kind of subtle.

@robieta
Copy link
Contributor Author

robieta commented Oct 19, 2022

Also benchmark actually shows a modest reduction in overhead, although it could be noise. This PR adds an atomic incref (for the weak count), but removes a call to delete. So not too surprising that they more or less cancel out.

Taylor Robie added 3 commits October 19, 2022 20:53
…ss reuse during profiling."


A recurring problem with assigning Tensor IDs is that we want to preserve identity when storage changes but we don't observe TensorImpl destruction so identity assignment is not robust to the ABA problem with respect to TensorImpl*. ~TensorImpl is far too hot to instrument; even adding a call to a no-op function in a different compilation unit increases overhead by tens of percent. (OSS builds do not have any sort of LTO.)

Fortunately there is a solution. A PyTorch Tensor is a `c10::intrusive_ptr<c10::TensorImpl>`, which in turn holds a storage. (Which is a `c10::intrusive_ptr<c10::StorageImpl>`) `c10::intrusive_ptr` has a `c10::weak_intrusive_ptr` class for taking non-owning references to the underlying object. The implementation involves both a strong refcount and weak refcount in `c10::intrusive_ptr`. If the strong refcount of an intrusive_ptr goes to zero and there are no weak references then everything is deleted. However if there is a weak reference then the intrusive_ptr calls `release_resources()` but not delete.

This has the effect of freeing the underlying resources (ensuring that program semantics are unchanged) but leaves behind an empty shell of an `intrusive_ptr` that the `weak_intrusive_ptr`s use to check status. And herein lies the solution: as long as we hold a weak reference to a TensorImpl we will block deletion and prevent the `TensorImpl*` from being reused.

This PR uses a `c10::weak_intrusive_ptr<c10::TensorImpl>` to store the address of profiled TensorImpls and then converts it to a raw pointer (or rather, a `TensorImplAddress`) during post processing when we no longer care about blocking address reuse.

Differential Revision: [D40492848](https://our.internmc.facebook.com/intern/diff/D40492848/)

[ghstack-poisoned]
…ss reuse during profiling."


A recurring problem with assigning Tensor IDs is that we want to preserve identity when storage changes but we don't observe TensorImpl destruction so identity assignment is not robust to the ABA problem with respect to TensorImpl*. ~TensorImpl is far too hot to instrument; even adding a call to a no-op function in a different compilation unit increases overhead by tens of percent. (OSS builds do not have any sort of LTO.)

Fortunately there is a solution. A PyTorch Tensor is a `c10::intrusive_ptr<c10::TensorImpl>`, which in turn holds a storage. (Which is a `c10::intrusive_ptr<c10::StorageImpl>`) `c10::intrusive_ptr` has a `c10::weak_intrusive_ptr` class for taking non-owning references to the underlying object. The implementation involves both a strong refcount and weak refcount in `c10::intrusive_ptr`. If the strong refcount of an intrusive_ptr goes to zero and there are no weak references then everything is deleted. However if there is a weak reference then the intrusive_ptr calls `release_resources()` but not delete.

This has the effect of freeing the underlying resources (ensuring that program semantics are unchanged) but leaves behind an empty shell of an `intrusive_ptr` that the `weak_intrusive_ptr`s use to check status. And herein lies the solution: as long as we hold a weak reference to a TensorImpl we will block deletion and prevent the `TensorImpl*` from being reused.

This PR uses a `c10::weak_intrusive_ptr<c10::TensorImpl>` to store the address of profiled TensorImpls and then converts it to a raw pointer (or rather, a `TensorImplAddress`) during post processing when we no longer care about blocking address reuse.

Differential Revision: [D40492848](https://our.internmc.facebook.com/intern/diff/D40492848/)

[ghstack-poisoned]
…ss reuse during profiling."


A recurring problem with assigning Tensor IDs is that we want to preserve identity when storage changes but we don't observe TensorImpl destruction so identity assignment is not robust to the ABA problem with respect to TensorImpl*. ~TensorImpl is far too hot to instrument; even adding a call to a no-op function in a different compilation unit increases overhead by tens of percent. (OSS builds do not have any sort of LTO.)

Fortunately there is a solution. A PyTorch Tensor is a `c10::intrusive_ptr<c10::TensorImpl>`, which in turn holds a storage. (Which is a `c10::intrusive_ptr<c10::StorageImpl>`) `c10::intrusive_ptr` has a `c10::weak_intrusive_ptr` class for taking non-owning references to the underlying object. The implementation involves both a strong refcount and weak refcount in `c10::intrusive_ptr`. If the strong refcount of an intrusive_ptr goes to zero and there are no weak references then everything is deleted. However if there is a weak reference then the intrusive_ptr calls `release_resources()` but not delete.

This has the effect of freeing the underlying resources (ensuring that program semantics are unchanged) but leaves behind an empty shell of an `intrusive_ptr` that the `weak_intrusive_ptr`s use to check status. And herein lies the solution: as long as we hold a weak reference to a TensorImpl we will block deletion and prevent the `TensorImpl*` from being reused.

This PR uses a `c10::weak_intrusive_ptr<c10::TensorImpl>` to store the address of profiled TensorImpls and then converts it to a raw pointer (or rather, a `TensorImplAddress`) during post processing when we no longer care about blocking address reuse.

Differential Revision: [D40492848](https://our.internmc.facebook.com/intern/diff/D40492848/)

[ghstack-poisoned]
Taylor Robie added 2 commits October 24, 2022 10:21
…ss reuse during profiling."


A recurring problem with assigning Tensor IDs is that we want to preserve identity when storage changes but we don't observe TensorImpl destruction so identity assignment is not robust to the ABA problem with respect to TensorImpl*. ~TensorImpl is far too hot to instrument; even adding a call to a no-op function in a different compilation unit increases overhead by tens of percent. (OSS builds do not have any sort of LTO.)

Fortunately there is a solution. A PyTorch Tensor is a `c10::intrusive_ptr<c10::TensorImpl>`, which in turn holds a storage. (Which is a `c10::intrusive_ptr<c10::StorageImpl>`) `c10::intrusive_ptr` has a `c10::weak_intrusive_ptr` class for taking non-owning references to the underlying object. The implementation involves both a strong refcount and weak refcount in `c10::intrusive_ptr`. If the strong refcount of an intrusive_ptr goes to zero and there are no weak references then everything is deleted. However if there is a weak reference then the intrusive_ptr calls `release_resources()` but not delete.

This has the effect of freeing the underlying resources (ensuring that program semantics are unchanged) but leaves behind an empty shell of an `intrusive_ptr` that the `weak_intrusive_ptr`s use to check status. And herein lies the solution: as long as we hold a weak reference to a TensorImpl we will block deletion and prevent the `TensorImpl*` from being reused.

This PR uses a `c10::weak_intrusive_ptr<c10::TensorImpl>` to store the address of profiled TensorImpls and then converts it to a raw pointer (or rather, a `TensorImplAddress`) during post processing when we no longer care about blocking address reuse.

Differential Revision: [D40492848](https://our.internmc.facebook.com/intern/diff/D40492848/)

[ghstack-poisoned]
…ss reuse during profiling."


A recurring problem with assigning Tensor IDs is that we want to preserve identity when storage changes but we don't observe TensorImpl destruction so identity assignment is not robust to the ABA problem with respect to TensorImpl*. ~TensorImpl is far too hot to instrument; even adding a call to a no-op function in a different compilation unit increases overhead by tens of percent. (OSS builds do not have any sort of LTO.)

Fortunately there is a solution. A PyTorch Tensor is a `c10::intrusive_ptr<c10::TensorImpl>`, which in turn holds a storage. (Which is a `c10::intrusive_ptr<c10::StorageImpl>`) `c10::intrusive_ptr` has a `c10::weak_intrusive_ptr` class for taking non-owning references to the underlying object. The implementation involves both a strong refcount and weak refcount in `c10::intrusive_ptr`. If the strong refcount of an intrusive_ptr goes to zero and there are no weak references then everything is deleted. However if there is a weak reference then the intrusive_ptr calls `release_resources()` but not delete.

This has the effect of freeing the underlying resources (ensuring that program semantics are unchanged) but leaves behind an empty shell of an `intrusive_ptr` that the `weak_intrusive_ptr`s use to check status. And herein lies the solution: as long as we hold a weak reference to a TensorImpl we will block deletion and prevent the `TensorImpl*` from being reused.

This PR uses a `c10::weak_intrusive_ptr<c10::TensorImpl>` to store the address of profiled TensorImpls and then converts it to a raw pointer (or rather, a `TensorImplAddress`) during post processing when we no longer care about blocking address reuse.

Differential Revision: [D40492848](https://our.internmc.facebook.com/intern/diff/D40492848/)

[ghstack-poisoned]
Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why there are classes that are almost duplicated. But I'm sure there is a good reason. So sounds good.

with profile(profile_memory=True, record_shapes=True) as p:
for _ in range(repeats):
torch.ones((1,))
gc.collect()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be needed? And will slow down this loop quite significantly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been pretty liberal with gc.collect() to reduce flakiness since my understanding is CPython gets to decide when objects are actually destroyed. (At least that's been my experience putting cleanup code in __del__) Or is that only cycle breaking GC?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although as I look at it there's no need for it to be in the inner loop, and 1000 gc collect calls is not very neighborly of me...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CPython will actually deallocate your object as soon as the refcount goes to 0.
If there is a cycle involved, the refcount will never get to 0 and so yes, you'll have to wait for the gc to kick in.

So unless you know you're creating cycles, you don't need gc collect in general.

strong::boolean>;

// ============================================================================
// == weak_intrusive_ptr and the ABA problem for TensorImpl* ==================
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI @ezyang weak_intrusive_ptr can't go away anymore :p

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tell that to @swolchok

@robieta
Copy link
Contributor Author

robieta commented Oct 26, 2022

I'm not sure why there are classes that are almost duplicated.

One is for collection (No ID to reduce space and default constructable for std::array storage), the other is what's materialized during post processing and less performance sensitive. (#87825 further differentiates them.) I should add a comment as I agree it's not obvious.

@robieta robieta added the ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR label Oct 26, 2022
…ss reuse during profiling."


A recurring problem with assigning Tensor IDs is that we want to preserve identity when storage changes but we don't observe TensorImpl destruction so identity assignment is not robust to the ABA problem with respect to TensorImpl*. ~TensorImpl is far too hot to instrument; even adding a call to a no-op function in a different compilation unit increases overhead by tens of percent. (OSS builds do not have any sort of LTO.)

Fortunately there is a solution. A PyTorch Tensor is a `c10::intrusive_ptr<c10::TensorImpl>`, which in turn holds a storage. (Which is a `c10::intrusive_ptr<c10::StorageImpl>`) `c10::intrusive_ptr` has a `c10::weak_intrusive_ptr` class for taking non-owning references to the underlying object. The implementation involves both a strong refcount and weak refcount in `c10::intrusive_ptr`. If the strong refcount of an intrusive_ptr goes to zero and there are no weak references then everything is deleted. However if there is a weak reference then the intrusive_ptr calls `release_resources()` but not delete.

This has the effect of freeing the underlying resources (ensuring that program semantics are unchanged) but leaves behind an empty shell of an `intrusive_ptr` that the `weak_intrusive_ptr`s use to check status. And herein lies the solution: as long as we hold a weak reference to a TensorImpl we will block deletion and prevent the `TensorImpl*` from being reused.

This PR uses a `c10::weak_intrusive_ptr<c10::TensorImpl>` to store the address of profiled TensorImpls and then converts it to a raw pointer (or rather, a `TensorImplAddress`) during post processing when we no longer care about blocking address reuse.

Differential Revision: [D40492848](https://our.internmc.facebook.com/intern/diff/D40492848/)

[ghstack-poisoned]
@robieta
Copy link
Contributor Author

robieta commented Oct 27, 2022

@pytorchbot merge -f "XLA failure is unrelated"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@github-actions
Copy link
Contributor

Hey @robieta.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

kulinseth pushed a commit to kulinseth/pytorch that referenced this pull request Nov 5, 2022
…ring profiling. (pytorch#87244)

A recurring problem with assigning Tensor IDs is that we want to preserve identity when storage changes but we don't observe TensorImpl destruction so identity assignment is not robust to the ABA problem with respect to TensorImpl*. ~TensorImpl is far too hot to instrument; even adding a call to a no-op function in a different compilation unit increases overhead by tens of percent. (OSS builds do not have any sort of LTO.)

Fortunately there is a solution. A PyTorch Tensor is a `c10::intrusive_ptr<c10::TensorImpl>`, which in turn holds a storage. (Which is a `c10::intrusive_ptr<c10::StorageImpl>`) `c10::intrusive_ptr` has a `c10::weak_intrusive_ptr` class for taking non-owning references to the underlying object. The implementation involves both a strong refcount and weak refcount in `c10::intrusive_ptr`. If the strong refcount of an intrusive_ptr goes to zero and there are no weak references then everything is deleted. However if there is a weak reference then the intrusive_ptr calls `release_resources()` but not delete.

This has the effect of freeing the underlying resources (ensuring that program semantics are unchanged) but leaves behind an empty shell of an `intrusive_ptr` that the `weak_intrusive_ptr`s use to check status. And herein lies the solution: as long as we hold a weak reference to a TensorImpl we will block deletion and prevent the `TensorImpl*` from being reused.

This PR uses a `c10::weak_intrusive_ptr<c10::TensorImpl>` to store the address of profiled TensorImpls and then converts it to a raw pointer (or rather, a `TensorImplAddress`) during post processing when we no longer care about blocking address reuse.

Differential Revision: [D40492848](https://our.internmc.facebook.com/intern/diff/D40492848/)
Pull Request resolved: pytorch#87244
Approved by: https://github.com/slgong-fb, https://github.com/albanD
kulinseth pushed a commit to kulinseth/pytorch that referenced this pull request Dec 10, 2022
…ring profiling. (pytorch#87244)

A recurring problem with assigning Tensor IDs is that we want to preserve identity when storage changes but we don't observe TensorImpl destruction so identity assignment is not robust to the ABA problem with respect to TensorImpl*. ~TensorImpl is far too hot to instrument; even adding a call to a no-op function in a different compilation unit increases overhead by tens of percent. (OSS builds do not have any sort of LTO.)

Fortunately there is a solution. A PyTorch Tensor is a `c10::intrusive_ptr<c10::TensorImpl>`, which in turn holds a storage. (Which is a `c10::intrusive_ptr<c10::StorageImpl>`) `c10::intrusive_ptr` has a `c10::weak_intrusive_ptr` class for taking non-owning references to the underlying object. The implementation involves both a strong refcount and weak refcount in `c10::intrusive_ptr`. If the strong refcount of an intrusive_ptr goes to zero and there are no weak references then everything is deleted. However if there is a weak reference then the intrusive_ptr calls `release_resources()` but not delete.

This has the effect of freeing the underlying resources (ensuring that program semantics are unchanged) but leaves behind an empty shell of an `intrusive_ptr` that the `weak_intrusive_ptr`s use to check status. And herein lies the solution: as long as we hold a weak reference to a TensorImpl we will block deletion and prevent the `TensorImpl*` from being reused.

This PR uses a `c10::weak_intrusive_ptr<c10::TensorImpl>` to store the address of profiled TensorImpls and then converts it to a raw pointer (or rather, a `TensorImplAddress`) during post processing when we no longer care about blocking address reuse.

Differential Revision: [D40492848](https://our.internmc.facebook.com/intern/diff/D40492848/)
Pull Request resolved: pytorch#87244
Approved by: https://github.com/slgong-fb, https://github.com/albanD
@facebook-github-bot facebook-github-bot deleted the gh/robieta/141/head branch June 8, 2023 18:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/trunk Trigger trunk jobs on your pull request Merged release notes: profiler release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants