Skip to content

fix: session resumption reconnection loop never iterates#5007

Open
brucearctor wants to merge 5 commits intogoogle:mainfrom
brucearctor:fix/live-session-resumption-4996
Open

fix: session resumption reconnection loop never iterates#5007
brucearctor wants to merge 5 commits intogoogle:mainfrom
brucearctor:fix/live-session-resumption-4996

Conversation

@brucearctor
Copy link
Copy Markdown

Summary

Fixes #4996 — the run_live() reconnection loop was unreachable due to unconditional re-raising of exceptions, redundant history transmission on reconnection, and ignored goAway server messages.

Changes

1. Exception handlers continue instead of raise (base_llm_flow.py)

  • ConnectionClosed / ConnectionClosedOK handler now continues the while True loop when live_session_resumption_handle is present
  • Added APIError handling (genai SDK wraps ConnectionClosed as APIError) with the same continue-on-handle logic

2. Skip send_history on reconnection (base_llm_flow.py)

  • Added guard: if llm_request.contents and not invocation_context.live_session_resumption_handle
  • Server already has the session context via the resumption handle — no need to resend

3. Surface go_away messages (gemini_llm_connection.py + llm_response.py)

  • GeminiLlmConnection.receive() now detects message.go_away and yields it as LlmResponse.go_away
  • Added go_away: Optional[LiveServerGoAway] field to LlmResponse
  • Enables proactive reconnection ~60s before server terminates the connection

Testing

Suite Tests Status
test_gemini_llm_connection.py 27 (incl. new test_receive_go_away)
test_run_live_reconnection.py 7 (new)
tests/unittests/flows/llm_flows/ (all) 364

New reconnection tests cover:

  • Loop continues on ConnectionClosedOK / APIError when resumption handle exists
  • Exceptions propagate without handle (preserves old behavior)
  • Non-APIError exceptions always propagate, even with handle
  • send_history skipped with handle, called without

Three fixes for the run_live() reconnection loop:

1. Exception handlers in base_llm_flow.py now continue instead of
   raise when a session resumption handle is available. This covers
   both ConnectionClosed (from websockets) and APIError (from genai
   SDK wrapping ConnectionClosed).

2. send_history is skipped on reconnection — the server already has
   the session context via the resumption handle.

3. go_away messages from the server are now surfaced through
   gemini_llm_connection.py's receive() as LlmResponse.go_away,
   enabling proactive reconnection before server termination.

Fixes: google#4996
Seven tests covering the exception handling and reconnection behavior
of the outer while-True loop in run_live():

- test_reconnects_on_connection_closed_with_handle
- test_reconnects_on_api_error_with_handle
- test_raises_connection_closed_without_handle
- test_raises_api_error_without_handle
- test_raises_non_api_error_with_handle
- test_skips_history_on_reconnect
- test_sends_history_without_handle

Uses a _LoopBreak sentinel and _make_connect_fn helper for
deterministic loop termination in tests.
@adk-bot adk-bot added the live [Component] This issue is related to live, voice and video chat label Mar 26, 2026
@rohityan rohityan self-assigned this Mar 26, 2026
logger.info(
'Connection closed (%s), reconnecting with session handle.', e
)
continue
Copy link
Copy Markdown
Collaborator

@rohityan rohityan Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great for brief network glitches. How do you think this would behave if the server was down for a few minutes?

Copy link
Copy Markdown
Author

@brucearctor brucearctor Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question! Right now the loop relies on the inherent connection timeout from llm.connect() to throttle reconnection attempts. For brief glitches that's sufficient, but you're right that a multi-minute outage would result in aggressive retries.

Makes me think:

  • Add exponential backoff with jitter (e.g., 1s → 2s → 4s → ... capped at ~30s) between reconnection attempts
    AND
  • Add a max retry count and raise after N failures

I'll add that to this PR. Or, would you suggest other?

@rohityan rohityan added the request clarification [Status] The maintainer need clarification or more information from the author label Mar 26, 2026
)
async with llm.connect(llm_request) as llm_connection:
if llm_request.contents:
if llm_request.contents and not invocation_context.live_session_resumption_handle:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the resumption handle is rejected by the server?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point. Looking at the Gemini Live API behavior: if the handle is rejected, the server sends back a session_resumption_update with resumable=False (or simply doesn't echo back a new_handle). It does not silently drop the context.

However, to be safe, we could:

Clear the handle on rejection: When we receive a session_resumption_update where resumable is False, clear live_session_resumption_handle so the next reconnection falls back to sending full history
Add a fallback: If the connection succeeds but no session_resumption_update arrives within a timeout, assume the handle was rejected and resend history
I think option (1) is already partially covered by the existing _receive_from_model logic that updates the handle from session_resumption_update.

Want me to verify the server behavior and add an explicit guard for the rejection case?

Add exponential backoff with jitter (1s base, 30s max) and a retry cap
(10 attempts) to the run_live() reconnection loop. This prevents
aggressive reconnection attempts during extended server outages.

- Backoff delay: min(1s * 2^(attempt-1), 30s) + random(0,1) jitter
- Max retries: 10 (configurable via MAX_RECONNECT_ATTEMPTS)
- Attempt counter resets on successful connection
- New tests: test_raises_after_max_retries_connection_closed,
  test_raises_after_max_retries_api_error

Addresses reviewer feedback on PR google#5007.
Instead of re-raising the raw ConnectionClosedOK/APIError when max
retries are exhausted, wrap it in a ConnectionError with a clear
message and chain the original exception via 'from e'. This lets
callers distinguish 'reconnection was attempted and exhausted' from
a single unexpected disconnect.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

live [Component] This issue is related to live, voice and video chat request clarification [Status] The maintainer need clarification or more information from the author

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Session resumption reconnection loop in run_live() never iterates

3 participants