Skip to content

Conversation

@seratch
Copy link
Contributor

@seratch seratch commented Jul 21, 2021

Summary

This pull request fixes #1065 by improving the internals of AsyncSocketModeClient.

As described in the bug report, the reconnection occasionally may fail to properly replace the active WebSocket connection in the connect_to_new_endpoint method.

If both the connection monitoring code and other code try to reconnect at the same time, a race condition issue can arise. In this case, the internal state self.wss_uri and self.current_session can be invalid (=stale). The invalid state is automatically repaired after a certain amount of time (say, less than 1 minute with the default setting) but the app can be unresponsive to any incoming requests until the recovery.

It's not so easy to reproduce the situation in unit tests but you can easily reproduce the situation by running the following script:

import logging
logging.basicConfig(level=logging.DEBUG)

import asyncio
import os
# pip install slack-sdk
from slack_sdk.socket_mode.response import SocketModeResponse
from slack_sdk.socket_mode.request import SocketModeRequest
# pip install "aiohttp>=3,<4"
from slack_sdk.web.async_client import AsyncWebClient
from slack_sdk.socket_mode.aiohttp import SocketModeClient

async def main():
    client = SocketModeClient(
        app_token=os.environ.get("SLACK_APP_TOKEN"),
        web_client=AsyncWebClient(token=os.environ.get("SLACK_BOT_TOKEN")),
    )
    async def process(client: SocketModeClient, req: SocketModeRequest):
        if req.type == "events_api":
            response = SocketModeResponse(envelope_id=req.envelope_id)
            await client.send_socket_mode_response(response)
            if req.payload["event"]["type"] == "message":
                await client.web_client.reactions_add(
                    name="eyes",
                    channel=req.payload["event"]["channel"],
                    timestamp=req.payload["event"]["ts"],
                )
    client.socket_mode_request_listeners.append(process)

    for _ in range(5):
        # concurrently reconnecting to new endpoints
        asyncio.ensure_future(client.connect_to_new_endpoint())
    # just for keeping this process running
    await asyncio.sleep(float("inf"))

asyncio.run(main())

To resolve this, we can do the same with the sync version of SocketModeClient. It always acquires a lock when replacing wss_uri and current_session to make the operation safe. The synchronization affects only the reconnection. The overall performance of the WeSocket message handlers is never affected.

If I remember correctly, there was no strong reason why I didn't have the lock in the async Socket Mode clients. I'm thinking that I just missed adding it during the initial development #883

Category (place an x in each of the [ ])

  • slack_sdk.web.WebClient (sync/async) (Web API client)
  • slack_sdk.webhook.WebhookClient (sync/async) (Incoming Webhook, response_url sender)
  • slack_sdk.socket_mode (Socket Mode client)
  • slack_sdk.signature (Request Signature Verifier)
  • slack_sdk.oauth (OAuth Flow Utilities)
  • slack_sdk.models (UI component builders)
  • slack_sdk.scim (SCIM API client)
  • slack_sdk.audit_logs (Audit Logs API client)
  • slack_sdk.rtm_v2 (RTM client)
  • /docs-src (Documents, have you run ./docs.sh?)
  • /docs-src-v2 (Documents, have you run ./docs-v2.sh?)
  • /tutorial (PythOnBoardingBot tutorial)
  • tests/integration_tests (Automated tests for this library)

Requirements (place an x in each [ ])

  • I've read and understood the Contributing Guidelines and have done my best effort to follow them.
  • I've read and agree to the Code of Conduct.
  • I've run python3 -m venv .venv && source .venv/bin/activate && ./scripts/run_validation.sh after making the changes.

@seratch seratch added bug M-T: A confirmed bug report. Issues are confirmed when the reproduction steps are documented Version: 3x socket-mode labels Jul 21, 2021
@seratch seratch added this to the 3.9.0 milestone Jul 21, 2021
@codecov
Copy link

codecov bot commented Jul 21, 2021

Codecov Report

Merging #1067 (70ae26c) into main (8bbb5d4) will increase coverage by 0.01%.
The diff coverage is 91.30%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1067      +/-   ##
==========================================
+ Coverage   84.14%   84.15%   +0.01%     
==========================================
  Files          99       99              
  Lines        9239     9255      +16     
==========================================
+ Hits         7774     7789      +15     
- Misses       1465     1466       +1     
Impacted Files Coverage Δ
slack_sdk/socket_mode/async_client.py 71.28% <84.61%> (+1.39%) ⬆️
slack_sdk/socket_mode/aiohttp/__init__.py 57.04% <100.00%> (+1.24%) ⬆️
slack_sdk/socket_mode/websockets/__init__.py 61.32% <100.00%> (+1.51%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8bbb5d4...70ae26c. Read the comment docs.

Copy link
Contributor Author

@seratch seratch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comments for reviewers

channel=req.payload["event"]["channel"],
timestamp=req.payload["event"]["ts"],
)
if req.payload["event"]["type"] == "message":
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed the issue where this example may fail if the app subscribes to other events like reaction_added

else:
await asyncio.sleep(consecutive_error_count)

async def is_connected(self) -> bool:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For easier access to the state and better alignment with the sync SocketModeClient, I've added this flag.

await self.connect()
async def connect_to_new_endpoint(self, force: bool = False):
try:
await self.connect_operation_lock.acquire()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lock acquisition and release inside this method is the main change in this PR. With this way, other concurrent method calls are synchronized. The client can safely check sef.is_connected() inside and decide if the client still needs to replace the WSS URL and the underlying WebSocket session.

try:
if message.get("type") == "disconnect":
await self.connect_to_new_endpoint()
await self.connect_to_new_endpoint(force=True)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sync client does the same. When we receive a disconnect type message from the Socket Mode server-side, we should immediately reconnect even if the current session is still active.

auto_reconnect_enabled: bool
default_auto_reconnect_enabled: bool
closed: bool
connect_operation_lock: Lock
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the only changes that are necessary in other client implementations. I'll have a migration guide section in the next version's release notes.

@seratch
Copy link
Contributor Author

seratch commented Jul 22, 2021

@stevengill Thanks for your review!

@seratch
Copy link
Contributor Author

seratch commented Jul 28, 2021

Just in case, I have been running my example app with this change for a few days but I haven't detected any issues so far. Let me merge this change now.

@seratch seratch merged commit bd99038 into slackapi:main Jul 28, 2021
@seratch seratch deleted the issue-1065-async-reconnection branch July 28, 2021 06:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug M-T: A confirmed bug report. Issues are confirmed when the reproduction steps are documented socket-mode Version: 3x

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Socket mode disconnections with AsyncApp

2 participants