Fix #887 Enable automatic retry by a handy way#1084
Conversation
| retry_response: Optional[RetryHttpResponse] = None | ||
| response_body = "" | ||
|
|
||
| if self.logger.level <= logging.DEBUG: |
There was a problem hiding this comment.
Moved this debug logging to print this every time this client does a retry.
| user_agent_prefix: Optional[str] = None, | ||
| user_agent_suffix: Optional[str] = None, | ||
| logger: Optional[logging.Logger] = None, | ||
| retry_handlers: List[RetryHandler] = async_default_handlers, |
There was a problem hiding this comment.
The default one consists of only the (Async)ConnectionErrorRetryHandler instance with its defaults.
| default_interval_calculator = BackoffRetryIntervalCalculator() | ||
|
|
||
|
|
||
| class RetryHandler: |
There was a problem hiding this comment.
This class is the main interface in this pull request
| from slack_sdk.http_retry.handler import RetryHandler, default_interval_calculator | ||
|
|
||
|
|
||
| class MyRetryHandler(RetryHandler): |
There was a problem hiding this comment.
Custom retry handler for testing
| if pattern == "rate_limited": | ||
| self.send_response(429) | ||
| self.send_header("Retry-After", 30) | ||
| self.send_header("Retry-After", 1) |
There was a problem hiding this comment.
Changed the value for faster test execution (we really don't want to wait for 30 seconds
31d7027 to
f9385f7
Compare
Codecov Report
@@ Coverage Diff @@
## main #1084 +/- ##
==========================================
+ Coverage 85.62% 86.09% +0.46%
==========================================
Files 99 110 +11
Lines 9324 9847 +523
==========================================
+ Hits 7984 8478 +494
- Misses 1340 1369 +29
Continue to review full report at Codecov.
|
|
Applied the following changes:
|
5e418a6 to
84b12e1
Compare
84b12e1 to
85eab23
Compare
|
I plan on reviewing today - it is a big PR so I did not end up having time yesterday. |
|
@filmaj Thanks! No rush at all. I know this includes so many changes. I am thinking that this pull request can add more unit tests covering rate limited error patterns for safety. |
seratch
left a comment
There was a problem hiding this comment.
Added more comments for reviewers
| http_verb=http_verb, | ||
| url=url, | ||
| body_params=body_params, | ||
| body=body_params, |
There was a problem hiding this comment.
As _perform_http_request is an internal method, we can safely rename this arg name
| counter_for_safety = 0 | ||
| while counter_for_safety < 100: |
There was a problem hiding this comment.
We may want to remove this counter for simplicity. while True here should be safe enough as retry_state.next_attempt_requested is usually False
| error_types: List[Exception] = [ | ||
| ServerConnectionError, | ||
| ServerDisconnectedError, | ||
| # ClientOSError: [Errno 104] Connection reset by peer | ||
| ClientOSError, | ||
| ], |
There was a problem hiding this comment.
These are aiohttp specific exceptions
| return False | ||
|
|
||
|
|
||
| class ServerErrorRetryHandler(RetryHandler): |
There was a problem hiding this comment.
I've added this one as a reference implementation but it's unused. We may want to remove this for now.
There was a problem hiding this comment.
I would suggest removing it. I am not sure it is a good practice to blindly retry if a request yields an HTTP 500 response; I think it could lead to undesirable network saturation in certain cases like a legitimate outage on Slack's side.
There was a problem hiding this comment.
Yes, this is fair enough 👍
There was a problem hiding this comment.
I would suggest removing it. I am not sure it is a good practice to blindly retry if a request yields an HTTP 500 response; I think it could lead to undesirable network saturation in certain cases like a legitimate outage on Slack's side.
Sorry to necro, but I think it's worth reconsidering this decision. While it may be undesirable from Slack's perspective to exponentially backoff when their API is returning 5xxs, I think this is what the SDK consumers will want. And I think it makes sense to do what the consumer wants here, because if we don't, the consumer is just going to implement their own exponential backoff logic that includes 5xxs. This is my plan anyway.
Thanks for your work here!
There was a problem hiding this comment.
This feels like a good time to mention the Circuit Breaker pattern.
| duration += random.random() | ||
| else: | ||
| duration = ( | ||
| int(response.headers.get(retry_after_header_name)[0]) + random.random() |
There was a problem hiding this comment.
The random.random() is a random jitter but it might not be necessary here. This is not backoff
| from .interval_calculator import RetryIntervalCalculator | ||
|
|
||
|
|
||
| class FixedValueRetryIntervalCalculator(RetryIntervalCalculator): |
There was a problem hiding this comment.
Just as a reference. Unused with the default setting. I should add some test for this
There was a problem hiding this comment.
If it's unused, can probably not worry about the tests. Unless you want to keep the coverage scores high 😆
There was a problem hiding this comment.
Haha, yeah, I always like better coverage!
filmaj
left a comment
There was a problem hiding this comment.
Wow, lots of work and you did a great job! Thanks for involving me in the review.
I left a few comments, mostly for my own learning and education.
| return False | ||
|
|
||
|
|
||
| class ServerErrorRetryHandler(RetryHandler): |
There was a problem hiding this comment.
I would suggest removing it. I am not sure it is a good practice to blindly retry if a request yields an HTTP 500 response; I think it could lead to undesirable network saturation in certain cases like a legitimate outage on Slack's side.
| duration = ( | ||
| int(response.headers.get(retry_after_header_name)[0]) + random.random() | ||
| ) | ||
| time.sleep(duration) |
There was a problem hiding this comment.
Theoretically, using the synchronous client, if the API responds with a relatively large value in the Retry-After header (e.g. the docs for this header show an example value of 30) - would this freeze the entire process?
There was a problem hiding this comment.
would this freeze the entire process?
When it comes to the same thread, yes. Thinking about the behavior as the whole app. it depends on how the app is implemented. In the case of Bolt for Python, all the code except ack() will be executed in a background thread. It does not result in 3 secound timeout.
By default, we don't enable rate limited error retries. Developers should turn it on with great understanding of the potential long pause.
| from slack_sdk.http_retry.handler import RetryHandler, default_interval_calculator | ||
|
|
||
|
|
||
| class AsyncConnectionErrorRetryHandler(RetryHandler): |
There was a problem hiding this comment.
Since this async implementation relies on the same base class that is shared with the sync implementation, and the base RetryHandler class' prepare_for_next_request uses the built-in Python's sleep method - could this lead to a situation where we block the process even when using an async handler?
I am not very familiar with aiohttp, but it seems like it is based on the asyncio library which has its own async-friendly sleep implementation (or, at least, this aiohttp document page implies that such an async sleep exists - search for asyncio on this page for the relevant section).
I am posing this question from a place of ignorance and a desire to learn so it is likely I am completely off. But asking dumb questions is helpful for me to learn 🤪
There was a problem hiding this comment.
@filmaj Ah, this is a great point! Yes, we should use asyncio.sleep instead here and I was aware of it. But somehow I forgot to override the method. We can have a base class RetryHandler, which uses asyncio'sleep method. All the methods in it will be async methods. I will update this part shortly.
| header = self.headers["Authorization"] | ||
| if header is not None and "xoxp-" in header: | ||
| pattern = str(header).split("xoxp-", 1)[1] | ||
| if "remote_disconnected" in pattern: |
There was a problem hiding this comment.
Very nice pattern, I like this a lot!
| from slack_sdk.http_retry.handler import RetryHandler, default_interval_calculator | ||
|
|
||
|
|
||
| class AsyncConnectionErrorRetryHandler(RetryHandler): |
There was a problem hiding this comment.
@filmaj Ah, this is a great point! Yes, we should use asyncio.sleep instead here and I was aware of it. But somehow I forgot to override the method. We can have a base class RetryHandler, which uses asyncio'sleep method. All the methods in it will be async methods. I will update this part shortly.
| return False | ||
|
|
||
|
|
||
| class ServerErrorRetryHandler(RetryHandler): |
There was a problem hiding this comment.
Yes, this is fair enough 👍
| duration = ( | ||
| int(response.headers.get(retry_after_header_name)[0]) + random.random() | ||
| ) | ||
| time.sleep(duration) |
There was a problem hiding this comment.
would this freeze the entire process?
When it comes to the same thread, yes. Thinking about the behavior as the whole app. it depends on how the app is implemented. In the case of Bolt for Python, all the code except ack() will be executed in a background thread. It does not result in 3 secound timeout.
By default, we don't enable rate limited error retries. Developers should turn it on with great understanding of the potential long pause.
| from .interval_calculator import RetryIntervalCalculator | ||
|
|
||
|
|
||
| class FixedValueRetryIntervalCalculator(RetryIntervalCalculator): |
There was a problem hiding this comment.
Haha, yeah, I always like better coverage!
|
Fixed all the issues in the latest revision. Let me merge this PR now. I will release an RC version for getting feedback from communities. |
Summary
This pull request fixes #887 by adding the new
RetryHandlerfeature to all the API clients (except legacy ones underslackpackage).With the default settings, the API clients do one retry only for connectivity issues like the "Connection reset by peer" error. For the intervals of retries, the built-in retry handlers behave in the manner of exponential backoff and jitter.
To customize the behavior, you can pass your own
retry_handlersargument to API client constructors:If an API client with retry handlers encounters an error, it runs each handler's
def can_retry(args) -> boolmethod. If any of the method executions returns True, the client runs itsdef prepare_for_next_retry(args) -> Nonemethod to wait for the right timing. Then, the same API request will be performed until the client hits the handler'smax_retry_count.In this pull request, I've updated the following API clients:
slack_sdk.web.WebClientslack_sdk.webhook.WebhookClientslack_sdk.audit_logs.AuditLogsClientslack_sdk.scim.SCIMClientslack_sdk.web.async_client.AsyncWebClient(aiohttp/asyncio compatible)slack_sdk.webhook.async_client.AsyncWebhookClient(aiohttp/asyncio compatible)slack_sdk.audit_logs.async_client.AsyncAuditLogsClient(aiohttp/asyncio compatible)slack_sdk.scim.async_client.AsyncSCIMClient(aiohttp/asyncio compatible)You can reuse retry handlers across the above API clients:
TODOs
Category (place an
xin each of the[ ])/docs-src(Documents, have you run./docs.sh?)/docs-src-v2(Documents, have you run./docs-v2.sh?)/tutorial(PythOnBoardingBot tutorial)tests/integration_tests(Automated tests for this library)Requirements (place an
xin each[ ])python3 -m venv .venv && source .venv/bin/activate && ./scripts/run_validation.shafter making the changes.