-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Use ComplexHolder pointer in c10:Scalar #53247
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use ComplexHolder pointer in c10:Scalar #53247
Conversation
Summary: Test Plan: USE_CUDA=0 BUILD_TEST=0 python setup.py install Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]
💊 CI failures summary and remediationsAs of commit 95fee44 (more details on the Dr. CI page):
🕵️ 11 new failures recognized by patternsThe following CI failures do not appear to be due to upstream breakages:
|
|
Confirmed the Scalar size is reduced to 16 bytes with this PR. However, the following micro-benchmark shows ~1% regression. Will dig into. from torch.utils.benchmark import Timer
counts = Timer(
stmt="""auto out = t.add(a, b);""",
setup="auto t = torch::ones({1}); auto a = torch::Scalar(3); auto b = torch::Scalar(1);",
language="cpp",
).collect_callgrind(number=1_000)
print(counts)Use ComplexHolder in Scalar (4bd9eeb) : 29,024,819 |
| } | ||
| } | ||
|
|
||
| Scalar(Scalar&& rhs) noexcept : tag(rhs.tag), v(rhs.v) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is likely a bug causing a double free. After this, you now have two Scalar objects that both are going to destroy the same complex. Could be fixed by setting the *this scalar to not store a complex anymore similar to what you're doing in the move assignment operator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as a side note: Having all four copy/move constructors and assignment operators implemented explicitly is some code duplication with the chance of bugs like this one. There's a nice pattern where you can implement all four of those in terms of just one implementation of swap, and then you have all the refcounting or destroying of state in just one function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@smessmer : Nice catch! The RHS complex will be cleaned up in move assignment but forget to to the same cleanup in move constructor.
Summary: Test Plan: USE_CUDA=0 BUILD_TEST=0 python setup.py install Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D26810622](https://our.internmc.facebook.com/intern/diff/D26810622) [ghstack-poisoned]
| return *this; | ||
| } | ||
|
|
||
| C10_ALWAYS_INLINE Scalar& operator=(Scalar&& rhs) & noexcept { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any specific reason for using C10_ALWAYS_INLINE here? In almost all cases, the compiler knows better and forcing them to inline is a bad idea. Without this, for example, the compiler can make different decisions like say inlining it in server code but not on mobile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@smessmer : Got it. Originally when I start to write these move/copy methods, I followed the corresponding implementation in IValue and didn't reove C10_ALWAYS_INLINE:
pytorch/aten/src/ATen/core/ivalue.h
Line 197 in 56f8379
| C10_ALWAYS_INLINE IValue& operator=(IValue&& rhs) & noexcept { |
Side note: during benchmark, sometimes I found manually inline (i.e. copy the method impl) or some (seemingly unrelated) method code change can result in tiny (0.02%- 0.04%) instruction count noise . (is it due to inline decisions at certain callsites?) So a small potential small benefit of C10_ALWAYS_INLINE is to make the result more deterministic? Not a big advantage , though :)
|
Superseded by the pass-by reference solution (#53583) due to overhead of non-trivial constructors. |
Stack from ghstack:
Summary:
Test Plan:
USE_CUDA=0 BUILD_TEST=0 python setup.py install
Reviewers:
Subscribers:
Tasks:
Tags:
Differential Revision: D26810622