-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Spanner: Add trace+timeout for waiting for sessions #5703
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spanner: Add trace+timeout for waiting for sessions #5703
Conversation
|
|
||
| private PooledSession take() throws SpannerException { | ||
| SessionOrError s = Uninterruptibles.takeUninterruptibly(waiter); | ||
| SessionOrError s = takeUninterruptiblyWithTimeout(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how come this needs to change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Uninterruptibles.takeUninterruptibly(waiter) will block indefinitely if the resource pool is exhausted because of a session leak. In order to be able to log that a session request is blocked, we need a trigger that will end the trace span and do the actual logging. That is what this timeout does.
This timeout is currently also propagated to the user as a DEADLINE_EXCEEDED SpannerException. We could choose to handle this internally and retry the session request after logging the timeout. That would retain the current behavior where the application will potentially block indefinitely in case of a session leak. My opinion is however that it would be better to propagate this error to the user in order to give the user more information on what is going on.
| this.writeSessionsFraction = builder.writeSessionsFraction; | ||
| this.actionOnExhaustion = builder.actionOnExhaustion; | ||
| this.actionOnSessionNotFound = builder.actionOnSessionNotFound; | ||
| this.waitForSessionTimeoutMillis = builder.blockForSessionTimeoutMillis; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we make these variables all 'wait...' or all 'block..'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to 'wait...'.
Codecov Report
@@ Coverage Diff @@
## master #5703 +/- ##
============================================
+ Coverage 46.71% 47.43% +0.71%
- Complexity 24630 27180 +2550
============================================
Files 2351 2522 +171
Lines 256127 274264 +18137
Branches 29325 31355 +2030
============================================
+ Hits 119650 130089 +10439
- Misses 127558 134565 +7007
- Partials 8919 9610 +691
Continue to review full report at Codecov.
|
| throw newSpannerException(s.e); | ||
| long currentTimeout = options.getInitialWaitForSessionTimeoutMillis(); | ||
| while (true) { | ||
| Tracer tracer = Tracing.getTracer(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: tracer can be made a private static field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed.
| <groupId>io.opencensus</groupId> | ||
| <artifactId>opencensus-impl</artifactId> | ||
| <version>${opencensus.version}</version> | ||
| <scope>test</scope> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be runtime only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to runtime.
| // End the span with status DEADLINE_EXCEEDED and retry. | ||
| tracer | ||
| .getCurrentSpan() | ||
| .end(EndSpanOptions.builder().setStatus(Status.DEADLINE_EXCEEDED).build()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| .end(EndSpanOptions.builder().setStatus(Status.DEADLINE_EXCEEDED).build()); | |
| .setStatus(Status.DEADLINE_EXCEEDED); |
You don't need to end the span here. When scope is closed, span will be ended automatically. See https://github.com/census-instrumentation/opencensus-java/blob/916adcbaba580cb223cb37ca34cf1f902903ede9/api/src/main/java/io/opencensus/trace/CurrentSpanUtils.java#L97-L103.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed.
| if (s.e != null) { | ||
| throw newSpannerException(s.e); | ||
| } | ||
| tracer.getCurrentSpan().end(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here. You don't need to call this explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed
| long currentTimeout = options.getInitialWaitForSessionTimeoutMillis(); | ||
| while (true) { | ||
| Tracer tracer = Tracing.getTracer(); | ||
| Scope waitScope = tracer.spanBuilder(WAIT_FOR_SESSION).startScopedSpan(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The standard way to create a scoped span is in a try block and then you don't explicitly have to close the scope [example].
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to try-with-resources block.
| SessionOrError s = pollUninterruptiblyWithTimeout(currentTimeout); | ||
| if (s == null) { | ||
| // End the span with status DEADLINE_EXCEEDED and retry. | ||
| tracer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of creating a span each time pollUninterruptiblyWithTimeout() is called, an alternative is to create a single span in take() then each time around the loop, add a trace annotation to indicate that the operation timed out. You can also add attributes to the annotation and which gives you the option to record the current timeout at each failure [example].
There might be a good reason why you chose to do multiple spans and I'd like to understand it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason that I chose to create multiple spans, is that the take() method may block indefinitely, which would mean that the span would never be ended. Does a span that never ends, but that does add a trace annotation every 30 seconds, get flushed and sent to the server each time an annotation is added? If so, then I agree that having a single span in the take() would be the better solution. But to my (limited) knowledge of OpenCensus, the span needs to be closed before it is flushed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked with @g-easy and you're right. The span will never be flushed if take() can block indefinitely so your implementation is the only way to get the trace information in this case.
| if (s == null) { | ||
| // End the span with status DEADLINE_EXCEEDED and retry. | ||
| tracer | ||
| .getCurrentSpan() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit] Here and below, instead of calling tracer.getCurrentSpan() repeatedly, you can save it in a variable above all the if/else logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After the other changes after the review, the number of calls to tracer.getCurrentSpan() is now reduced to two, and both are only called if an exception occurs, so I would prefer to keep these two calls. But I agree with you that in the previous version of the code, that would have been better.
| @Override | ||
| public void export(Collection<SpanData> spanDataList) { | ||
| for (SpanData sd : spanDataList) { | ||
| if (sd.getStatus() == Status.DEADLINE_EXCEEDED) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this counter only be incremented for spans that match WAIT_FOR_SESSION?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, that makes the test case less prone to false failures.
| // finish. | ||
| checkedOutSession.close(); | ||
| fut.get(); | ||
| executor.shutdown(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
close() and shutdown() should be done in a finally as we should be cleaning up regardless of whether the timeout happened or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed.
skuruppu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @olavloite for making the changes. This PR looks good from my perspective but I'll let @kolea2 do the final approval.
skuruppu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kolea2 gave me the go ahead to approve this change.
|
@kolea2 Apparently, GitHub requires your explicit approval of this PR to allow merging :-) |
Adds a trace for waiting for a session from the session pool. The maximum wait time is default 30 seconds, and this time can be changed through SessionPoolOptions#setBlockForSessionTimeoutMillis(long).