Skip to content

Improve how we acquire ThreadContext to eliminate null refs.#5940

Merged
headius merged 2 commits intojruby:masterfrom
headius:improved_threadcontext_acquisition
Oct 27, 2019
Merged

Improve how we acquire ThreadContext to eliminate null refs.#5940
headius merged 2 commits intojruby:masterfrom
headius:improved_threadcontext_acquisition

Conversation

@headius
Copy link
Member

@headius headius commented Oct 23, 2019

For #5936 we saw some cases where the SoftReference created here
came back as null, even after the "adopt" logic had run. While I
have not isolated an exact cause, there are thread locals, soft
references, and GC interacting here. Since the original logic did
not guarantee non-null references throughout the code, I've
modified it to always check and only proceed once it truly has
non-null references and eventually context in hand.

@headius headius added this to the JRuby 9.2.9.0 milestone Oct 23, 2019
@headius headius requested review from enebo and kares October 23, 2019 17:54
@headius
Copy link
Member Author

headius commented Oct 23, 2019

@enebo @kares Take a quick peek at this. Critical functionality, I just wanted more eyes on it before shipping.

@kares
Copy link
Member

kares commented Oct 23, 2019

previous version ended up recursing instead of always reaching a loop - otherwise seems the same?
not sure what changed here - we never ended up with null on production apps for a long time.
really lack an understanding what precisely is fixed here ...

Copy link
Member

@enebo enebo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The actual scenario where you adopt and it returns null is pretty mysterious to me (very low memory?). If it does that once it feels like it could happen endlessly (and if so then perhaps we will get an issue showing just that). None the less this definitely gives it more of a chance to eventually adopt a TC.

@headius
Copy link
Member Author

headius commented Oct 23, 2019

@kares In the later comments of #5936 both @ahorek and I saw NPE when attempting to get the ThreadContext out of the SoftReference we acquired from the ThreadLocal. Somehow, that SoftReference was coming out as null even after calling adoptCurrentThread. I could not suss out how that would be possible, exactly, but there's a complicated dance of references and threads and GC in play.

The new logic basically just adds in extra null checking, in case the stars align to allow such to happen, and avoids recursing to re-attempt the acquisition of the context. It eliminates the NPE we saw.

@headius
Copy link
Member Author

headius commented Oct 23, 2019

@enebo Yeah I couldn't trace a path that would result in getting a null SoftReference after adoption, but I could not exactly rule it out.

There's two situations where that ThreadLocal might contain null:

  • The current thread has never been adopted
  • The current thread was adopted but the reference went away (soft ref, memory pressure trigger)

In both cases the current thread should go into the logic that assigns a new SoftReference before proceeding.

Copy link
Member

@kares kares left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor improvement: do { ... } while since we always execute the loop at least once
that is also the way this method used to be before (the loop elimination).

@headius
Copy link
Member Author

headius commented Oct 27, 2019

@kares I will make that change, thank you.

I want to understand if there was a reason you preferred recursion...something I'm missing in my version?

My only concern with the recursion was that under some heavy GC scenario it may recurse many times, failing to acquire a context. It is not likely to happen or to cause stack issues, since we're talking soft references instead of weak references, but it nagged at the back of my brain a bit. The change was not altogether related to this patch, but it made sense to me.

I don't like the overall complexity of the adoption logic, so I want to be absolutely sure I understand your version of this method.

For jruby#5936 we saw some cases where the SoftReference created here
came back as null, even after the "adopt" logic had run. While I
have not isolated an exact cause, there are thread locals, soft
references, and GC interacting here. Since the original logic did
not guarantee non-null references throughout the code, I've
modified it to always check and only proceed once it truly has
non-null references and eventually context in hand.
@headius headius force-pushed the improved_threadcontext_acquisition branch from 8d37390 to b529298 Compare October 27, 2019 07:08
@headius
Copy link
Member Author

headius commented Oct 27, 2019

I've made the change and rebased to pick up fixes from master.

@kares
Copy link
Member

kares commented Oct 27, 2019

I want to understand if there was a reason you preferred recursion...something I'm missing in my version?

think I did a micro-benchmark where it showed up that always entering a loop was 'slower'.
was looking for that benchmark but did not find (not sure what I actually did - whether trying to stress this from a known getCurrentContext .rb site or simply a JMH).

anyway, its more important to have this working (even in edge cases) rather than 'optimized' to the ms which won't matter much in the end.

@headius headius merged commit 76ceb12 into jruby:master Oct 27, 2019
@headius headius deleted the improved_threadcontext_acquisition branch October 27, 2019 07:50
@headius
Copy link
Member Author

headius commented Oct 27, 2019

@kares That seems plausible, but I think it shouldn't make much difference these days. It boils down to a branch versus a recursive invocation, and in both cases the rare path that fails to acquire a context should end up being profiled out. If you do see a difference, though, we can revisit it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants