-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Spanner: Add BatchCreateSessions rpc and use rotating channel number #6185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spanner: Add BatchCreateSessions rpc and use rotating channel number #6185
Conversation
snehashah16
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would do this implementation slightly differently
- let SessionPool.initPool should call BatchCreateSessions for each channel..
- lets also validate the requested sessions were returned or then restart request with the (requested - response) session count.
- After all sessions are available, the pool should then be shuffled to scatter the sessions (so that not all sessions for a given channel are sequential in the list). This avoids passing the channel ID in all higher APIs. channel is a very a gRPC -layer concept and its best not exposed above that.
Additionally,
- U may also need to refactor some code to replenish sessions if below minSessions (this can be optionally done by batch create sessions) and
- delete unused sessions (to bring it <= minSessions) - which should be done from the tail of the list.
| @Override | ||
| public BatchReadOnlyTransaction batchReadOnlyTransaction(TimestampBound bound) { | ||
| SessionImpl session = spanner.createSession(db); | ||
| int channel = channelCounter.getAndIncrement() % spanner.getOptions().getNumChannels(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should not expose the channel to such high-level methods.. these are meant to be light-weight..
| Span span = tracer.spanBuilder(BATCH_CREATE_SESSIONS).startSpan(); | ||
| try (Scope s = tracer.withSpan(span)) { | ||
| List<com.google.spanner.v1.Session> sessions = | ||
| gapicRpc.batchCreateSessions( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this may not return as many sessions as requested by sessionCount.
You should loop until all sessions are returned
| @Override | ||
| public BatchReadOnlyTransaction batchReadOnlyTransaction(BatchTransactionId batchTransactionId) { | ||
| SessionImpl session = spanner.sessionWithId(checkNotNull(batchTransactionId).getSessionId()); | ||
| int channel = channelCounter.getAndIncrement() % spanner.getOptions().getNumChannels(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
| } | ||
|
|
||
| SessionImpl sessionWithId(String name) { | ||
| List<SessionImpl> batchCreateSessions(final DatabaseId db, final int sessionCount, int channel) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
curious where is this method called ?
62f27b6 to
f1bbdcb
Compare
Codecov Report
@@ Coverage Diff @@
## master #6185 +/- ##
=========================================
Coverage 47.11% 47.11%
- Complexity 27378 27393 +15
=========================================
Files 2524 2524
Lines 277617 277617
Branches 31984 31979 -5
=========================================
Hits 130786 130786
Misses 137056 137056
Partials 9775 9775Continue to review full report at Codecov.
|
|
@snehashah16 The previous version of this PR only added the rpc definition for BatchCreateSessions (my intention was to break the change into somewhat smaller pieces), but I can see how that introduces more questions than convenience. I have now updated PR to include the changes to the session pool as well. I also refactored the code quite extensively based on your suggestions. The details for spreading the calls over all channels is now encapsulated in the |
skuruppu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for promptly making the changes we discussed @olavloite. I've only managed to review SessionClient.java and SessionPool.java and I have a couple of minor comments. Hoping to review the rest later today.
...cloud-clients/google-cloud-spanner/src/main/java/com/google/cloud/spanner/SessionClient.java
Show resolved
Hide resolved
| } | ||
|
|
||
| /** | ||
| * Callback interface to be used for BatchCreateSessions. When sessions come available or session |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"... sessions become available ..."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| createSession(); | ||
| int sessionCount = options.getMinSessions() - (totalSessions() + numSessionsBeingCreated); | ||
| if (sessionCount > 0) { | ||
| createSessions(sessionCount); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For all the calls to createSessions() aside from the one for initPool() I have a bit of a concern. For initPool(), we'll evenly distribute the sessions over the channels. But when sessions are replenished, in theory, we may be closing sessions in a given channel but then redistributing the replenished sessions across all channels. Over time, I can imagine that there will be an imbalance on the number of sessions in each channel. Especially if sessions are closed one at a time, then the new sessions to replace those will always end up in the first channel. I don't know if this is an issue in practice or how much more likely this is to happen with this implementation compared to the random distribution of sessions over channel.
@olavloite, do you have any thoughts on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The SessionClient keeps track of the last channel hint that was used for the previous sesson creation call, so also in situations where sessions are created one-by-one or in smaller batches, they will be distributed over all channels. It is also implemented in such a way that when the number of sessions that is to be created is not divisible by the number of channels, it will create 1 more session for the first remainder of the channels than for the rest. A couple of examples (I've translated the channel hint into the actual channel number in the examples for simplicity. The actual channel is the remainder of the division channelHint / numChannels):
- Current channel hint is 0 and 1 session is created: The session is created for channel 0 and the channel hint is 1 after the creation.
- Current channel hint is 1 and 3 sessions are created: The sessions are created for channels 1 to 3 and the channel hint is now 0.
- Current channel hint is 0 and 2 sessions are created: The sessions are created for channels 0 and 1 and the channel hint is 2.
- Current channel hint is 2 and 6 sessions are created: 2 sessions are created for channel 2 and 3 and 1 session is created for channel 0 and 1. The channel hint is now 2.
So the session creation is largely distributed as we would want.
Closing sessions is a little bit different, as it is a lot more random. Sessions are closed in the following situations:
- The session pool contains more idle sessions than the MaxIdleSessions and MinSessions config values and the maintainer is running: The maintainer will delete (approx) 1/10 of the superfluous sessions. These are currently polled from the head of the session pool. (Note: This should be changed to the end of the session pool, as we have changed the order of the session pool to LIFO, and it is better to close session that have less recently been used.) The channel affiliation of these sessions is completely random.
- Sessions that have been deemed to be invalid (i.e. have returned a 'Session not found' error) are also removed from the pool. The channel affiliation of these sessions are also completely random.
I.e. which sessions are closed, and which channels that are affiliated with those, is inherently random. That could in theory create an imbalance in the number of sessions per channel.
| numSessionsBeingCreated += sessionCount; | ||
| try { | ||
| // Create a batch of sessions. The actual session creation can be split into | ||
| // multiple gRPC calls and the sessions are returned by the enumerator as they come |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"the sessions are returned by the enumerator" should be "the session consumer consumes the returned sessions as they become available".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| allSessions.add(pooledSession); | ||
| // Release the session to a random position in the pool to prevent that a | ||
| // batch of sessions that are affiliated with the same channel are all placed | ||
| // sequentially in the pool. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"to prevent the case that a batch of sessions that are affiliated ..."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
...cloud-clients/google-cloud-spanner/src/main/java/com/google/cloud/spanner/SessionClient.java
Show resolved
Hide resolved
...nts/google-cloud-spanner/src/test/java/com/google/cloud/spanner/BatchCreateSessionsTest.java
Show resolved
Hide resolved
| } | ||
|
|
||
| @Test | ||
| public void testSpannerReturnsNoSessions() throws InterruptedException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite understand the logic in this test. I think it's trying to test for the backend only returning all the sessions available to it (maxServerSessions) rather than returning no sessions. Did I misunderstand?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, you're right. The name is somewhat misleading. It tests what happens if the server first returns all the sessions available to it, and then stops returning any sessions (i.e. returns zero sessions and no error). I've renamed the test method.
| i | ||
| < Math.min( | ||
| maxTotalSessions - sessions.size(), | ||
| Math.min(maxNumSessionsInOneBatch, request.getSessionCount())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For readability, it would be better if this for loop statement was changed to:
int maxSessionsToCreate = Math.min(maxNumSessionsInOneBatch, request.getSessionCount());
for (int i = 0; i < Math.min(maxTotalSessions - sessions.size(), maxSessionsToCreate); i++) { }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| public void onSessionCreateFailure(Throwable t, int createFailureForSessionCount) {} | ||
| }; | ||
| // We want 100 sessions, but each rpc will only return 5. We should still get 100 sessions from | ||
| // the enumerator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // the enumerator. | |
| // the consumer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| // runner. | ||
| // The creation of the new transaction failed with a | ||
| // SessionNotFoundException, | ||
| // and the session was re-created before the run method was called. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please fix the line wrapping here and above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| // The rollback will also cause a SessionNotFoundException, but this is caught, logged | ||
| // and | ||
| // further ignored by the library, meaning that the session will not be re-created for | ||
| // retry. Hence rollback at call 1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The line wrapping needs to be fixed here as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
snehashah16
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have provided high level comments..
| } catch (Throwable t) { | ||
| consumer.onSessionCreateFailure(t, remainingSessionsToCreate); | ||
| break; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would we not retry ?
If retries are already handled by gax layer, could u verify what error codes are retried ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gax handles the retries, and will do so based on the settings here: https://github.com/googleapis/googleapis/blob/6b2ba2ae3124c22ecb56af7102c78110b8576671/google/spanner/v1/spanner_gapic.yaml#L93
That means that the method will retry on:
- Server returns error code Unavailable
- Deadline_Exceeded (client side) after 60 seconds, and it will give up retrying after 600 seconds.
| newSpannerException(ErrorCode.UNKNOWN, "Server did not return any sessions"), | ||
| remainingSessionsToCreate); | ||
| break; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when would this happen ?
Background context: we removed session limits and therefore this seems highly unlikely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This happened once while stress testing the functionality on the staging environment, so I added this as a safeguard against an infinite loop. I have not seen this happen on the production environment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would still retry the request (same logic as numRequested > 0)
Exposing an unknown error is degradation/error to their subsequent read / write requests..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to retry also when the number of returned sessions is 0.
...cloud-clients/google-cloud-spanner/src/main/java/com/google/cloud/spanner/SessionClient.java
Show resolved
Hide resolved
| synchronized (this) { | ||
| for (int i = 0; i < spanner.getOptions().getNumChannels(); i++) { | ||
| // Create one more session for the first X calls to fill up the remainder of the division. | ||
| int createCountForChannel = sessionCountPerChannel + (i < remainder ? 1 : 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would simplify this
sessionCountPerChannel = sessionCount / spanner.getOptions().getNumChannels();
remainder = sessionCount % spanner.getOptions().getNumChannels();
// just add remainder to the requested count for last or first channel.
for ( channel i : channels) {
if ( i == 0 OR i == (numChannels - 1)) {
sessionCountRequested = sessionCountPerChannel + remainder
}
}
Evenly distributing is great but wont buy us much optimization. Simple code will definitely improve readability and avoid bugs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
This change also has the advantage that requests for numSessions < numChannels will be executed in one rpc call instead of numSessions rpc calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks !!
it has a be a huge config mistake if customers have "numSessions < numChannels"
Though this change guards against using all those number of channels.
| if (sessionReplacementCount > MAX_SESSION_REPLACES) { | ||
| throw SpannerExceptionFactory.newSpannerException( | ||
| e.getErrorCode(), | ||
| "Could not replace invalidated session because the maximum number of times the session may be replaced was exceeded", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a good experience for our customers since this degrades their read / writes.
When would this happen ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, this shouldn't have been in this PR and has been removed.
It could happen if a large number of sessions in the pool have been invalidated by the server without the client knowing about it, and this would safeguard against a large number of retries that all fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sessions in the pool are kept alive using "GetSession() rpc" or "SELECT 1" query.
Hopefully we shouldnt have this catastrophic event
| } | ||
| } catch (SessionNotFoundException e) { | ||
| sessionReplacementCount++; | ||
| if (sessionReplacementCount > MAX_SESSION_REPLACES) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above.. just leave this comment open until there is a better strategy defined for prev instance
| break; | ||
| } catch (SessionNotFoundException e) { | ||
| sessionReplacementCount++; | ||
| if (sessionReplacementCount > MAX_SESSION_REPLACES) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above.. just leave this comment open until there is a better strategy defined for prev instance
...e-cloud-clients/google-cloud-spanner/src/main/java/com/google/cloud/spanner/SessionPool.java
Show resolved
Hide resolved
...e-cloud-clients/google-cloud-spanner/src/main/java/com/google/cloud/spanner/SessionPool.java
Show resolved
Hide resolved
...e-cloud-clients/google-cloud-spanner/src/main/java/com/google/cloud/spanner/SessionPool.java
Show resolved
Hide resolved
|
@skuruppu - i have reviewed the crux of the CL (mostly SessionClient, SpannerImpl and SessionPool.java) |
|
@olavloite - LGTM! feel free to check-in when @skuruppu is done with the review. QQ: When the LRU session is returned to the pool, when does the BeginTxn rpc get called ? |
36a02e6 to
d41e83c
Compare
When a session is released into the pool (be it a session that is returned from a client, a session that has just been created or a session that has just executed a keep-alive query), the session pool will check whether it needs to prepare a transaction or not. So what happens is:
So the BeginTransaction rpc is not executed as part of the write request path. However, if the write fraction setting of the session pool is too low for the number of read/write transactions that is executed, the BeginTransaction rpc will be a bottleneck, as any read/write transaction that wants a session will be placed in the wait queue and will wait for a BeginTransaction rpc to finish. The BeginTransaction rpc can also be a bottleneck in a very high load read/write transaction situation for another reason:
|
Adds the BatchCreateSessions rpc to the Spanner client and uses this for session creation. A new
SessionClientis added that encapsulates the details of how the session creation is evenly distributed over all available channels and that multiple calls might be necessary before all sessions have been created.The sessions are returned by the
SessionClientas anEnumerationthat will yield results as soon as sessions are available. This way, the session pool does not have to wait until all the requested sessions have been created, but can start to serve sessions as soon as some have come available.Updates #6169