[Profiler] Hold weak reference to prevent TensorImpl address reuse during profiling. #87244

robieta · 2022-10-18T22:26:30Z

Stack from ghstack (oldest at bottom):

A recurring problem with assigning Tensor IDs is that we want to preserve identity when storage changes but we don't observe TensorImpl destruction so identity assignment is not robust to the ABA problem with respect to TensorImpl*. ~TensorImpl is far too hot to instrument; even adding a call to a no-op function in a different compilation unit increases overhead by tens of percent. (OSS builds do not have any sort of LTO.)

Fortunately there is a solution. A PyTorch Tensor is a c10::intrusive_ptr<c10::TensorImpl>, which in turn holds a storage. (Which is a c10::intrusive_ptr<c10::StorageImpl>) c10::intrusive_ptr has a c10::weak_intrusive_ptr class for taking non-owning references to the underlying object. The implementation involves both a strong refcount and weak refcount in c10::intrusive_ptr. If the strong refcount of an intrusive_ptr goes to zero and there are no weak references then everything is deleted. However if there is a weak reference then the intrusive_ptr calls release_resources() but not delete.

This has the effect of freeing the underlying resources (ensuring that program semantics are unchanged) but leaves behind an empty shell of an intrusive_ptr that the weak_intrusive_ptrs use to check status. And herein lies the solution: as long as we hold a weak reference to a TensorImpl we will block deletion and prevent the TensorImpl* from being reused.

This PR uses a c10::weak_intrusive_ptr<c10::TensorImpl> to store the address of profiled TensorImpls and then converts it to a raw pointer (or rather, a TensorImplAddress) during post processing when we no longer care about blocking address reuse.

Differential Revision: D40492848

Just unit test for now. Differential Revision: [D40492848](https://our.internmc.facebook.com/intern/diff/D40492848/) [ghstack-poisoned]

pytorch-bot · 2022-10-18T22:26:37Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/87244

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 Failures

As of commit cac925b:

The following jobs have failed:

linux-bionic-py3_7-clang8-xla / test (xla, 1, 1, linux.2xlarge)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Just unit test for now. Differential Revision: [D40492848](https://our.internmc.facebook.com/intern/diff/D40492848/) [ghstack-poisoned]

Pull Request resolved: #87244 Just unit test for now. ghstack-source-id: 170823361 Differential Revision: [D40492848](https://our.internmc.facebook.com/intern/diff/D40492848/)

robieta · 2022-10-18T23:29:51Z

Just testing for now.

Just unit test for now. Differential Revision: [D40492848](https://our.internmc.facebook.com/intern/diff/D40492848/) [ghstack-poisoned]

…ss reuse during profiling." A recurring problem with assigning Tensor IDs is that we want to preserve identity when storage changes but we don't observe TensorImpl destruction so identity assignment is not robust to the ABA problem with respect to TensorImpl*. ~TensorImpl is far too hot to instrument; even adding a call to a no-op function in a different compilation unit increases overhead by tens of percent. (OSS builds do not have any sort of LTO.) Fortunately there is a solution. A PyTorch Tensor is a `c10::intrusive_ptr<c10::TensorImpl>`, which in turn holds a storage. (Which is a `c10::intrusive_ptr<c10::StorageImpl>`) `c10::intrusive_ptr` has a `c10::weak_intrusive_ptr` class for taking non-owning references to the underlying object. The implementation involves both a strong refcount and weak refcount in `c10::intrusive_ptr`. If the strong refcount of an intrusive_ptr goes to zero and there are no weak references then everything is deleted. However if there is a weak reference then the intrusive_ptr calls `release_resources()` but not delete. This has the effect of freeing the underlying resources (ensuring that program semantics are unchanged) but leaves behind an empty shell of an `intrusive_ptr` that the `weak_intrusive_ptr`s use to check status. And herein lies the solution: as long as we hold a weak reference to a TensorImpl we will block deletion and prevent the `TensorImpl*` from being reused. This PR uses a `c10::weak_intrusive_ptr<c10::TensorImpl>` to store the address of profiled TensorImpls and then converts it to a raw pointer (or rather, a `TensorImplAddress`) during post processing when we no longer care about blocking address reuse. Differential Revision: [D40492848](https://our.internmc.facebook.com/intern/diff/D40492848/) [ghstack-poisoned]

slgong-fb

LGTM!

slgong-fb · 2022-10-19T17:36:25Z

torch/csrc/profiler/collection.h

+
+struct TensorMetadata : public RawTensorMetadataBase {
+  explicit TensorMetadata(const RawTensorMetadata& r)
+      : RawTensorMetadataBase(r), impl_{r.weak_self_->_unsafe_get_target()} {}


what is _unsage_get_target() for? Is this just getting address from weak_intrusive_ptr?

I am also not sure how we can block the address reuse from codes. Is it part of weak_intrusive_ptr implementation?

what is _unsage_get_target() for? Is this just getting address from weak_intrusive_ptr?

Yeah. weak_intrusive_ptr<T> only has T* as a data member which is what _unsage_get_target returns.

I am also not sure how we can block the address reuse from codes. Is it part of weak_intrusive_ptr implementation?

Yeah. The key part is

pytorch/c10/util/intrusive_ptr.h

Lines 282 to 292 in f3cc588

if (!should_delete) {

// justification for const_cast: release_resources is basically a

// destructor and a destructor always mutates the object, even for const

// objects. NOLINTNEXTLINE(cppcoreguidelines-pro-type-const-cast)

const_cast<std::remove_const_t<TTarget>*>(target_)->release_resources();

should_delete =

detail::atomic_weakcount_decrement(target_->weakcount_) == 0;

}

if (should_delete) {

delete target_;

}

(which is called from ~intrusive_ptr)

release_resources is called when the strong refcount reaches zero, but the actual delete call which gives the address back is gated on strong and weak refcount.

test/profiler/test_profiler.py

robieta · 2022-10-19T19:40:31Z

I'm going to add some comments before landing this one since it's kind of subtle.

robieta · 2022-10-19T21:24:55Z

Also benchmark actually shows a modest reduction in overhead, although it could be noise. This PR adds an atomic incref (for the weak count), but removes a call to delete. So not too surprising that they more or less cancel out.

…ss reuse during profiling." A recurring problem with assigning Tensor IDs is that we want to preserve identity when storage changes but we don't observe TensorImpl destruction so identity assignment is not robust to the ABA problem with respect to TensorImpl*. ~TensorImpl is far too hot to instrument; even adding a call to a no-op function in a different compilation unit increases overhead by tens of percent. (OSS builds do not have any sort of LTO.) Fortunately there is a solution. A PyTorch Tensor is a `c10::intrusive_ptr<c10::TensorImpl>`, which in turn holds a storage. (Which is a `c10::intrusive_ptr<c10::StorageImpl>`) `c10::intrusive_ptr` has a `c10::weak_intrusive_ptr` class for taking non-owning references to the underlying object. The implementation involves both a strong refcount and weak refcount in `c10::intrusive_ptr`. If the strong refcount of an intrusive_ptr goes to zero and there are no weak references then everything is deleted. However if there is a weak reference then the intrusive_ptr calls `release_resources()` but not delete. This has the effect of freeing the underlying resources (ensuring that program semantics are unchanged) but leaves behind an empty shell of an `intrusive_ptr` that the `weak_intrusive_ptr`s use to check status. And herein lies the solution: as long as we hold a weak reference to a TensorImpl we will block deletion and prevent the `TensorImpl*` from being reused. This PR uses a `c10::weak_intrusive_ptr<c10::TensorImpl>` to store the address of profiled TensorImpls and then converts it to a raw pointer (or rather, a `TensorImplAddress`) during post processing when we no longer care about blocking address reuse. Differential Revision: [D40492848](https://our.internmc.facebook.com/intern/diff/D40492848/) [ghstack-poisoned]

albanD

I'm not sure why there are classes that are almost duplicated. But I'm sure there is a good reason. So sounds good.

albanD · 2022-10-26T19:14:27Z

test/profiler/test_profiler.py

+        with profile(profile_memory=True, record_shapes=True) as p:
+            for _ in range(repeats):
+                torch.ones((1,))
+                gc.collect()


This shouldn't be needed? And will slow down this loop quite significantly.

I've been pretty liberal with gc.collect() to reduce flakiness since my understanding is CPython gets to decide when objects are actually destroyed. (At least that's been my experience putting cleanup code in __del__) Or is that only cycle breaking GC?

Although as I look at it there's no need for it to be in the inner loop, and 1000 gc collect calls is not very neighborly of me...

CPython will actually deallocate your object as soon as the refcount goes to 0.
If there is a cycle involved, the refcount will never get to 0 and so yes, you'll have to wait for the gc to kick in.

So unless you know you're creating cycles, you don't need gc collect in general.

albanD · 2022-10-26T19:35:14Z

torch/csrc/profiler/data_flow.h

+    strong::boolean>;
+
+// ============================================================================
+// == weak_intrusive_ptr and the ABA problem for TensorImpl* ==================


FYI @ezyang weak_intrusive_ptr can't go away anymore :p

tell that to @swolchok

robieta · 2022-10-26T23:48:01Z

I'm not sure why there are classes that are almost duplicated.

One is for collection (No ID to reduce space and default constructable for std::array storage), the other is what's materialized during post processing and less performance sensitive. (#87825 further differentiates them.) I should add a comment as I agree it's not obvious.

…ss reuse during profiling." A recurring problem with assigning Tensor IDs is that we want to preserve identity when storage changes but we don't observe TensorImpl destruction so identity assignment is not robust to the ABA problem with respect to TensorImpl*. ~TensorImpl is far too hot to instrument; even adding a call to a no-op function in a different compilation unit increases overhead by tens of percent. (OSS builds do not have any sort of LTO.) Fortunately there is a solution. A PyTorch Tensor is a `c10::intrusive_ptr<c10::TensorImpl>`, which in turn holds a storage. (Which is a `c10::intrusive_ptr<c10::StorageImpl>`) `c10::intrusive_ptr` has a `c10::weak_intrusive_ptr` class for taking non-owning references to the underlying object. The implementation involves both a strong refcount and weak refcount in `c10::intrusive_ptr`. If the strong refcount of an intrusive_ptr goes to zero and there are no weak references then everything is deleted. However if there is a weak reference then the intrusive_ptr calls `release_resources()` but not delete. This has the effect of freeing the underlying resources (ensuring that program semantics are unchanged) but leaves behind an empty shell of an `intrusive_ptr` that the `weak_intrusive_ptr`s use to check status. And herein lies the solution: as long as we hold a weak reference to a TensorImpl we will block deletion and prevent the `TensorImpl*` from being reused. This PR uses a `c10::weak_intrusive_ptr<c10::TensorImpl>` to store the address of profiled TensorImpls and then converts it to a raw pointer (or rather, a `TensorImplAddress`) during post processing when we no longer care about blocking address reuse. Differential Revision: [D40492848](https://our.internmc.facebook.com/intern/diff/D40492848/) [ghstack-poisoned]

robieta · 2022-10-27T06:36:26Z

@pytorchbot merge -f "XLA failure is unrelated"

pytorchmergebot · 2022-10-27T06:38:07Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

github-actions · 2022-10-27T06:38:45Z

Hey @robieta.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

…ring profiling. (pytorch#87244) A recurring problem with assigning Tensor IDs is that we want to preserve identity when storage changes but we don't observe TensorImpl destruction so identity assignment is not robust to the ABA problem with respect to TensorImpl*. ~TensorImpl is far too hot to instrument; even adding a call to a no-op function in a different compilation unit increases overhead by tens of percent. (OSS builds do not have any sort of LTO.) Fortunately there is a solution. A PyTorch Tensor is a `c10::intrusive_ptr<c10::TensorImpl>`, which in turn holds a storage. (Which is a `c10::intrusive_ptr<c10::StorageImpl>`) `c10::intrusive_ptr` has a `c10::weak_intrusive_ptr` class for taking non-owning references to the underlying object. The implementation involves both a strong refcount and weak refcount in `c10::intrusive_ptr`. If the strong refcount of an intrusive_ptr goes to zero and there are no weak references then everything is deleted. However if there is a weak reference then the intrusive_ptr calls `release_resources()` but not delete. This has the effect of freeing the underlying resources (ensuring that program semantics are unchanged) but leaves behind an empty shell of an `intrusive_ptr` that the `weak_intrusive_ptr`s use to check status. And herein lies the solution: as long as we hold a weak reference to a TensorImpl we will block deletion and prevent the `TensorImpl*` from being reused. This PR uses a `c10::weak_intrusive_ptr<c10::TensorImpl>` to store the address of profiled TensorImpls and then converts it to a raw pointer (or rather, a `TensorImplAddress`) during post processing when we no longer care about blocking address reuse. Differential Revision: [D40492848](https://our.internmc.facebook.com/intern/diff/D40492848/) Pull Request resolved: pytorch#87244 Approved by: https://github.com/slgong-fb, https://github.com/albanD

[Profiler] Test TensorImpl* duplication

f78ea24

Just unit test for now. Differential Revision: [D40492848](https://our.internmc.facebook.com/intern/diff/D40492848/) [ghstack-poisoned]

pytorch-bot bot added the topic: not user facing topic category label Oct 18, 2022

robieta added the with-ssh label Oct 18, 2022

Update on "[Profiler] Test TensorImpl* duplication"

3b7d705

Just unit test for now. Differential Revision: [D40492848](https://our.internmc.facebook.com/intern/diff/D40492848/) [ghstack-poisoned]

Update on "[Profiler] Test TensorImpl* duplication"

5aaf1cc

Just unit test for now. Differential Revision: [D40492848](https://our.internmc.facebook.com/intern/diff/D40492848/) [ghstack-poisoned]

robieta removed with-ssh topic: not user facing topic category labels Oct 19, 2022

robieta changed the title ~~[Profiler] Test TensorImpl* duplication~~ [Profiler] Hold weak reference to prevent TensorImpl address reuse during profiling. Oct 19, 2022

robieta requested review from aaronenyeshi, chaekit and slgong-fb October 19, 2022 15:07

robieta added the release notes: profiler release notes category label Oct 19, 2022

slgong-fb approved these changes Oct 19, 2022

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 19, 2022

Taylor Robie added 3 commits October 19, 2022 20:53

This was referenced Oct 23, 2022

[Profiler] Memory profiler part 6: Mark gradients and temporary intermediates. #87566

Closed

[Profiler] Memory profiler part 7: Mark inputs #87567

Closed

[Profiler] Memory profiler part 8: Mark parameters. #87568

Closed

Taylor Robie added 2 commits October 24, 2022 10:21

This was referenced Oct 25, 2022

[Profiler][Trivial] Add hashing struct for pairs and tuples. #87668

Closed

[Profiler][Trivial] Move ID assignment code to data_flow.cpp #87670

Closed

robieta requested a review from albanD October 25, 2022 18:22

albanD approved these changes Oct 26, 2022

View reviewed changes

robieta mentioned this pull request Oct 26, 2022

[Profiler] Restructure inputs and capture TensorLists. #87825

Closed

robieta added the ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR label Oct 26, 2022

pytorchmergebot added the Merged label Oct 27, 2022

pytorchmergebot closed this in b16b5fb Oct 27, 2022

facebook-github-bot deleted the gh/robieta/141/head branch June 8, 2023 18:33

	if (!should_delete) {
	// justification for const_cast: release_resources is basically a
	// destructor and a destructor always mutates the object, even for const
	// objects. NOLINTNEXTLINE(cppcoreguidelines-pro-type-const-cast)
	const_cast<std::remove_const_t<TTarget>*>(target_)->release_resources();
	should_delete =
	detail::atomic_weakcount_decrement(target_->weakcount_) == 0;
	}
	if (should_delete) {
	delete target_;
	}

[Profiler] Hold weak reference to prevent TensorImpl address reuse during profiling. #87244

[Profiler] Hold weak reference to prevent TensorImpl address reuse during profiling. #87244

Uh oh!

Conversation

robieta commented Oct 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/87244

❌ 1 Failures

Uh oh!

robieta commented Oct 18, 2022

Uh oh!

slgong-fb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

robieta commented Oct 19, 2022

Uh oh!

robieta commented Oct 19, 2022

Uh oh!

albanD left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

robieta commented Oct 26, 2022

Uh oh!

robieta commented Oct 27, 2022

Uh oh!

pytorchmergebot commented Oct 27, 2022

Merge started

Uh oh!

github-actions bot commented Oct 27, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

robieta commented Oct 18, 2022 •

edited

Loading

pytorch-bot bot commented Oct 18, 2022 •

edited

Loading