Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
adb81f2
chore: rationalize SessionImpl field visibility
igorbernstein2 May 27, 2026
0854204
chore: migrate pool scheduling to a hashed-wheel timer
igorbernstein2 Jun 12, 2026
54c287e
chore: introduce VOperation as the head of the middleware chain
igorbernstein2 Jun 12, 2026
c88862a
chore: introduce OpExecutor; plumb VRpcCallContext.getExecutor
igorbernstein2 Jun 12, 2026
8f99778
chore: move callback dispatch from RetryingVRpc to VRpcImpl
igorbernstein2 Jun 13, 2026
2105669
fix: arm heartbeat tick only while a vRPC is in flight
igorbernstein2 Jun 17, 2026
baf7ac7
chore: apply fmt-maven-plugin
igorbernstein2 Jun 22, 2026
3d5a1f9
chore: add session SynchronizationContext alongside the existing lock
igorbernstein2 Jun 14, 2026
e519990
chore: make startRpc and cancelRpc async via sessionSyncContext
igorbernstein2 Jun 14, 2026
de8531a
chore: remove synchronized(lock) from SessionImpl
igorbernstein2 May 28, 2026
8bce7ac
chore: abort session on uncaught exception in sessionSyncContext
igorbernstein2 Jun 4, 2026
851adef
chore: assert sessionSyncContext on all stream-callback handlers
igorbernstein2 Jun 22, 2026
932ad80
chore: apply fmt-maven-plugin
igorbernstein2 Jun 22, 2026
2964e1f
chore: isolate user-callback executor on a cached thread pool
igorbernstein2 Jun 14, 2026
bcee9c4
chore: back OpExecutor with SerializingExecutor(userCallbackExecutor)
igorbernstein2 Jun 14, 2026
812931c
chore: configure gRPC session streams with DirectExecutor
igorbernstein2 May 28, 2026
5d22de9
chore: replace fully-qualified type names with imports
igorbernstein2 Jun 22, 2026
5c4f476
chore: apply fmt-maven-plugin
igorbernstein2 Jun 22, 2026
20fa479
test: add 30-second @Timeout to all session/pool integration tests
igorbernstein2 May 27, 2026
1c0153f
chore: route PendingVRpc per-op state through the op executor
igorbernstein2 May 28, 2026
4165506
chore: consolidate cancel trampolines at VOperationImpl
igorbernstein2 Jun 14, 2026
fb74f48
chore: replace OpExecutor backing with an inline-capable queue; tight…
igorbernstein2 Jun 14, 2026
1e5fbc5
chore: add @GuardedBy annotations to OpExecutor
igorbernstein2 Jun 22, 2026
03e2838
chore: hoist mutable-state-ownership comment to class level on Retryi…
igorbernstein2 Jun 22, 2026
af3a0d9
chore: drop defensive tracer.onOperationStart try/catch in RetryingVRpc
igorbernstein2 Jun 22, 2026
5605c27
chore: split SessionHandle.onVRpcFinish into completion + cancellatio…
igorbernstein2 Jun 22, 2026
1d18231
chore: apply fmt-maven-plugin
igorbernstein2 Jun 22, 2026
0dbd9c8
fix: drain SessionPools before tearing down userCallbackExecutor on C…
igorbernstein2 Jun 15, 2026
42394fa
fix: deliver terminal onClose to Scheduled retries pending at Client.…
igorbernstein2 Jun 15, 2026
a343ed6
fix: serialize open*/close so racing opens cannot create orphan pools
igorbernstein2 Jun 15, 2026
1a8ef60
fix: ShimImpl uses shutdownAndAwait for userCallbackExecutor
igorbernstein2 Jun 15, 2026
de62fa1
fix: lift shim loader throws to failed futures via SessionPoolMap
igorbernstein2 Jun 16, 2026
9607245
test: cover pool close while a vRPC is queued in pendingRpcs
igorbernstein2 Jun 23, 2026
b82a480
chore: record debug tag before throwing on unexpected drainedFuture f…
igorbernstein2 Jun 23, 2026
2286e0e
refactor: add State.onExit and localize per-attempt tracer pairing to…
igorbernstein2 Jun 16, 2026
c503a03
refactor: replace CleanupListener.closed flag with chain.isDone()
igorbernstein2 Jun 16, 2026
bd152b1
docs: capture deferred review findings and design rationale
igorbernstein2 Jun 16, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
59 changes: 59 additions & 0 deletions java-bigtable/TODO.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
# java-bigtable — deferred work

## Three fallback closeReason synthesizers in SessionImpl

**File:** `google-cloud-bigtable/src/main/java/com/google/cloud/bigtable/data/v2/internal/session/SessionImpl.java`
**Symptom:** the invariant "a session reaching CLOSED carries a closeReason" is enforced by belt-and-suspenders synthesizers at three sites — `abortFromUncaughtException` (overwrites unconditionally), `startGracefulClose` (logs warning + synthesizes if null), `notifyTerminalClose` (added in `bfa99cd0d9e` as last-resort guard). Each was added in response to a discovered miss. A new code path that transitions toward CLOSED must remember to set `closeReason` or rely on a downstream synthesizer firing.

**Fix sketch:** couple setter to transition. Add an overload `updateState(SessionState newState, CloseSessionRequest closeReason)` for terminal-phase transitions that requires the reason as an argument. The plain `updateState(newState)` overload stays for NEW→STARTING→READY. The three synthesizers go away.

**Risk:** a bug that misses the setter becomes a loud `IllegalArgumentException` instead of silent metric corruption. Probably better but it's a behavior shift — any "should never happen" path that hit a synthesizer today would now crash.

## drainedFuture completion duplicated across two sites

**File:** `google-cloud-bigtable/src/main/java/com/google/cloud/bigtable/data/v2/internal/session/SessionPoolImpl.java`
**Symptom:** `drainedFuture.complete(null)` fires from two unrelated sites — `close()` (line 308) when the pool was already empty at close time, and `onSessionClose` (line 538-540) when the last session drains while `poolState == CLOSED`. Both re-check `sessions.getAllSessions().isEmpty()`. A future code path that removes the last session by any other route (admin force-drain, AFE shutdown, per-session abort that bypasses the standard listener chain) leaves `drainedFuture` uncompleted — `Client.close` hangs for the full `POOL_DRAIN_TIMEOUT` (6 min) per pool with no log indication.

**Fix sketch:** derive the completion from a single state event — e.g., a `maybeCompleteDrain()` helper called from every site that transitions a session out of the pool, with one check `poolState == CLOSED && sessions.getAllSessions().isEmpty()`. Removes the duplicated check and gives a single chokepoint to audit.

**Risk:** low. The condition lives in one place; future session-removal paths just need to call the helper.

## Client.shutdownAndAwait is a public-static helper in the wrong place

**File:** `google-cloud-bigtable/src/main/java/com/google/cloud/bigtable/data/v2/internal/api/Client.java`
**Symptom:** `shutdownAndAwait(ExecutorService)` is a public static helper on `Client` (line 394), duplicating Guava's `MoreExecutors.shutdownAndAwaitTermination` semantics. Promoted to public-static so `ShimImpl` (different package) can reuse it. The user-callback executor's lifecycle policy ("cached pool drained with 5s grace, then `shutdownNow`") belongs to the executor's owning type, not a free function on `Client` that every owner must remember to import.

**Fix sketch:** introduce a `UserCallbackExecutor` (or similar) type that wraps the cached `ExecutorService` and owns its lifecycle (`close()` does the shutdown-and-await dance). Both `Client.create` and `ShimImpl` construct one. `Client.shutdownAndAwait` goes away.

**Risk:** small ripple to construction sites. Reduces pressure to add more "shutdown subtleties" (configurable timeout, interrupt restoration variants, etc.) as more public statics on `Client`.

## RetryingVRpc.start guard relies on unenforced op-executor affinity

**File:** `google-cloud-bigtable/src/main/java/com/google/cloud/bigtable/data/v2/internal/middleware/RetryingVRpc.java`
**Symptom:** `started`, `isCancelling`, `currentState` are plain non-volatile fields. The comment claims "VOperationImpl trampolines every inbound call onto opExecutor, so no synchronization is needed here" — but `start()` and `cancel()` don't assert that contract. A caller that bypasses `VOperationImpl` (a test using `VRpcCallContext.create`'s 3-arg overload with a `t -> {}` swallowing handler, or any new direct consumer) gets torn reads and silent state corruption rather than a clear precondition failure. Added importance after the `chain.isDone()` change (#8 fix) — that field is also unsynchronized and now externally observable.

**Fix sketch:** add `context.getExecutor().throwIfNotInThisExecutor()` at the top of `start()` and `cancel()` (and `isDone()` if we want to be strict). Each becomes a one-line guard. Tests that exercise these methods directly must construct a real OpExecutor and trampoline through it, matching production usage.

**Risk:** test rewrites for any test that calls `RetryingVRpc.start()` directly without going through `VOperationImpl`. Check the existing `RetryingVRpcTest` and `VRpcTracerTest` for impacted call sites.

## Long-delay Scheduled retries are cancelled on close, not awaited

**File:** `google-cloud-bigtable/src/main/java/com/google/cloud/bigtable/data/v2/internal/middleware/RetryingVRpc.java` + `Client.java`
**Symptom:** Today (post `4a80a8284fb`) a `Scheduled` retry pending at `Client.close` is driven to a CANCELLED Done via the `BigtableTimer.onStop` hook. That delivers a terminal to the user — but it abandons whatever the retry was waiting for. A truly graceful shutdown would let the retry complete naturally if the deadline still has room. We explicitly chose cancel-on-close because per-op tracking at the Client level was rejected as too heavy.

**Fix sketch (if requirements change):** add a Client-level (or per-pool) registry of in-flight `VOperation`s. `Client.close` Phase 2 would extend `drainedFuture` to also wait for in-flight ops to terminate, bounded by `POOL_DRAIN_TIMEOUT`. Op registry maintained by `VOperationImpl.start` / `Done.onStart`.

**Risk:** real structural change — every `VOperation` registers/unregisters; Client gains a new responsibility. Only worth it if a customer reports the cancel-on-close behavior as wrong. Today the cancel path delivers a clean terminal, which is sufficient for shutdown correctness.

## Idle session heartbeat ticks burn CPU

**File:** `google-cloud-bigtable/src/main/java/com/google/cloud/bigtable/data/v2/internal/session/SessionImpl.java`
**Symptom:** `checkHeartbeat` re-arms a 100 ms wheel tick on every fire, regardless of whether a vRPC is in flight. For idle sessions, `nextHeartbeat` sits at `now + FUTURE_TIME` (30 min) but the tick still fires 10×/sec, syncs onto `sessionSyncContext`, compares, and re-arms. 100 idle sessions = ~1,000 wakeups/sec per Client of pure background noise.

**Fix sketch:** couple heartbeat arming to `currentRpc` lifecycle.
- In `startRpc` (after setting `nextHeartbeat = now + heartbeatInterval`): arm `scheduleHeartbeatCheck()` if `heartbeatTimeout == null`.
- In `handleVRpcResponse` / `handleVRpcErrorResponse` (after pushing `nextHeartbeat` to `FUTURE_TIME`): `cancelHeartbeatTimeout()`.
- In `checkHeartbeat`: re-arm only when `currentRpc != null`.
- Drop the unconditional `scheduleHeartbeatCheck()` from `handleOpenSessionResponse`.

**Risk:** heartbeat becomes per-vRPC instead of per-session. Future multiplexing or any state where `nextHeartbeat` is meaningful without `currentRpc != null` needs explicit arming. Stuck-session shutdown detection is already covered by the watchdog + `Client.POOL_DRAIN_TIMEOUT`.
Original file line number Diff line number Diff line change
Expand Up @@ -25,14 +25,15 @@
import com.google.cloud.bigtable.data.v2.internal.channels.ChannelPool;
import com.google.cloud.bigtable.data.v2.internal.csm.Metrics;
import com.google.cloud.bigtable.data.v2.internal.csm.attributes.ClientInfo;
import com.google.cloud.bigtable.data.v2.internal.session.BigtableTimer;
import com.google.cloud.bigtable.data.v2.internal.session.SessionPool;
import com.google.cloud.bigtable.data.v2.internal.session.VRpcDescriptor;
import com.google.cloud.bigtable.data.v2.internal.util.ClientConfigurationManager;
import io.grpc.CallOptions;
import io.grpc.Deadline;
import java.io.Closeable;
import java.util.concurrent.CompletableFuture;
import java.util.concurrent.ScheduledExecutorService;
import java.util.concurrent.Executor;

public class AuthorizedViewAsync implements AutoCloseable, Closeable {

Expand All @@ -48,7 +49,8 @@ static AuthorizedViewAsync createAndStart(
String viewId,
Permission permission,
Metrics metrics,
ScheduledExecutorService executorService) {
BigtableTimer timer,
Executor userCallbackExecutor) {

AuthorizedViewName viewName =
AuthorizedViewName.builder()
Expand Down Expand Up @@ -78,7 +80,8 @@ static AuthorizedViewAsync createAndStart(
callOptions,
viewName.toString(),
metrics,
executorService);
timer,
userCallbackExecutor);

return new AuthorizedViewAsync(base);
}
Expand Down
Loading
Loading