-
Notifications
You must be signed in to change notification settings - Fork 135
Retry attempts that fail due to a connection timeout #24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Jesse Whitehouse <jesse@whitehouse.dev>
Signed-off-by: Jesse Whitehouse <jesse@whitehouse.dev>
Signed-off-by: Jesse Whitehouse <jesse@whitehouse.dev>
Signed-off-by: Jesse Whitehouse <jesse@whitehouse.dev>
It doesn't like retry policy bounds == None. Signed-off-by: Jesse Whitehouse <jesse@whitehouse.dev>
Signed-off-by: Jesse Whitehouse <jesse@whitehouse.dev>
Other OSError's like `EINTR` could indicate a call was interrupted after it was received by the server, which would potentially not be idempotent Signed-off-by: Jesse Whitehouse <jesse@whitehouse.dev>
Signed-off-by: Jesse Whitehouse <jesse@whitehouse.dev>
Signed-off-by: Jesse Whitehouse <jesse@whitehouse.dev>
Only make it non-null for retryable requests Signed-off-by: Jesse Whitehouse <jesse@whitehouse.dev>
every time. Whereas the previous approach they passed in ten seconds. Signed-off-by: Jesse Whitehouse <jesse@whitehouse.dev>
sander-goos
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments. Could you please link in the description the resources that assures that the errors we catch are safe to retry (and platform independent).
tests/e2e/driver_tests.py
Outdated
| ) | ||
|
|
||
|
|
||
| with self.assertRaises(OperationalError) as cm: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can be more specific to assert RequestError
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually removed the e2e tests for this behaviour because they were no more useful than unit tests. I'll add an e2e test for this scenario after we merge support for http proxies. Then we can simulate real timeouts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the unit tests asserts on RequestError btw.
Signed-off-by: Jesse Whitehouse <jesse@whitehouse.dev>
Signed-off-by: Jesse Whitehouse <jesse@whitehouse.dev>
Signed-off-by: Jesse Whitehouse <jesse@whitehouse.dev>
Add retry_delay_default to use in this case. Signed-off-by: Jesse Whitehouse <jesse@whitehouse.dev>
Emit warnings for unexpected OSError codes Signed-off-by: Jesse Whitehouse <jesse@whitehouse.dev>
Signed-off-by: Jesse Whitehouse <jesse@whitehouse.dev>
DeprecationWarning: The 'warn' function is deprecated, use 'warning' instead Signed-off-by: Jesse Whitehouse <jesse@whitehouse.dev>
Signed-off-by: Jesse Whitehouse <jesse@whitehouse.dev>
Signed-off-by: Jesse Whitehouse <jesse@whitehouse.dev>
|
@benfleis @sander-goos I updated the PR description to reflect the present state after this week's reviews / updates. |
benfleis
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good!
Which real connection tests did you perform to validate? Related, perhaps worth adding a note to PR (or near the retry code) explaining any manual E2E tests you used to validate behavior in "real conditions", so you (and others) won't have to think so hard about what's worth doing as a 1-off manual test.
Signed-off-by: Jesse Whitehouse <jesse@whitehouse.dev>
I used This process is very manual, so I'm not yet incorporating it into the How I ran the testsThe actual integration test I ran looked like this: def test_make_request_will_retry_GetOperationStatus(self):
import thrift, errno
from databricks.sql.thrift_api.TCLIService.TCLIService import Client
from databricks.sql.exc import RequestError
from databricks.sql.utils import NoRetryReason
from databricks.sql import client
from databricks.sql.thrift_api.TCLIService import ttypes
with self.cursor() as cursor:
cursor.execute("SELECT 1")
op_handle = cursor.active_op_handle
req = ttypes.TGetOperationStatusReq(
operationHandle=op_handle,
getProgressUpdate=False,
)
EXPECTED_RETRIES = 2
with self.cursor({"_socket_timeout": 10, "_retry_stop_after_attempts_count": 2}) as cursor:
_client = cursor.connection.thrift_backend._client
with self.assertRaises(RequestError) as cm:
breakpoint() # At this point I instructed mitmweb to intercept and suspend all requests
cursor.connection.thrift_backend.make_request(_client.GetOperationStatus, req)
self.assertEqual(NoRetryReason.OUT_OF_ATTEMPTS.value, cm.exception.context["no-retry-reason"])
self.assertEqual(f'{EXPECTED_RETRIES}/{EXPECTED_RETRIES}', cm.exception.context["attempt"])I used this Dockerfile to make this work on my local machine FROM python:3.7-slim-buster
RUN useradd --create-home pysql
# Ubuntu packages
RUN apt-get update && \
apt-get install -y --no-install-recommends \
curl \
gnupg \
build-essential \
pwgen \
libffi-dev \
sudo \
git-core \
# Additional packages required for data sources:
libssl-dev \
libsasl2-modules-gssapi-mit && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
RUN sudo mkdir /usr/local/share/ca-certificates/extra
RUN curl mitm.it/cert/pem > mitmcert.pem
RUN openssl x509 -in mitmcert.pem -inform PEM -out mitm.crt
RUN sudo cp mitm.crt /usr/local/share/ca-certificates/extra/mitm.crt
RUN sudo update-ca-certificates
RUN pip install poetry --user
COPY --chown=pysql . /pysql
RUN chown pysql /pysql
WORKDIR /pysql
RUN python -m poetry installAnd configured Docker using its "proxies":
{
"default":
{
"httpProxy": "http://<my local network ip address>:8080",
"httpsProxy": "http://<my local network ip address>:8080",
"noProxy": "pypi.org"
}
}The
Done here 10016ea |
Signed-off-by: Jesse Whitehouse <jesse@whitehouse.dev>
This reverts commit 4db4ad0. Signed-off-by: Jesse Whitehouse <jesse@whitehouse.dev>
|
Thanks for the fix and the detailed explanation! |
* Isolate delay bounding logic * Move error details scope up one-level. * Retry GetOperationStatus if an OSError was raised during execution. Add retry_delay_default to use in this case. * Log when a request is retried due to an OSError. Emit warnings for unexpected OSError codes * Update docstring for make_request * Nit: unit tests show the .warn message is deprecated. DeprecationWarning: The 'warn' function is deprecated, use 'warning' instead Signed-off-by: Jesse Whitehouse <jesse@whitehouse.dev>
* Isolate delay bounding logic * Move error details scope up one-level. * Retry GetOperationStatus if an OSError was raised during execution. Add retry_delay_default to use in this case. * Log when a request is retried due to an OSError. Emit warnings for unexpected OSError codes * Update docstring for make_request * Nit: unit tests show the .warn message is deprecated. DeprecationWarning: The 'warn' function is deprecated, use 'warning' instead Signed-off-by: Jesse Whitehouse <jesse@whitehouse.dev>
* Isolate delay bounding logic * Move error details scope up one-level. * Retry GetOperationStatus if an OSError was raised during execution. Add retry_delay_default to use in this case. * Log when a request is retried due to an OSError. Emit warnings for unexpected OSError codes * Update docstring for make_request * Nit: unit tests show the .warn message is deprecated. DeprecationWarning: The 'warn' function is deprecated, use 'warning' instead Signed-off-by: Jesse Whitehouse <jesse@whitehouse.dev> Signed-off-by: Sai Shree Pradhan <saishree.pradhan@databricks.com>
Note for reviewers
On 4 August 2022 I reverted all changes in this PR so I could reimplement and apply all your review feedback. This happened in fa1fd50. Every subsequent commit encapsulates one logical change to the code. Working through them one-at-a-time should be quite easy.
GetOperationStatusforOSErrora55cf9d (thanks @sander-goos and @benfleis )The final four commits are simple code cleanup and fixing one warning in the test suite un-related to this change.
Since our e2e tests are not enabled via Github Actions yet I ran them locally and all passed.
Description
Currently, we only retry attempts that returned a code 429 or 503 and include a Retry-After header. This pull request also allows
GetOperationStatusrequests to be retried if the request fails with anOSErrorexception. The configured retry policy is still honoured with regard to maximum attempts and max retry duration.Background
The reason we only retry 429 and 503 responses today is because retries must be idempotent. Otherwise the connector could cause unexpected or harmful consequences (data loss, excess resource utilisation etc.)
We know that 429/503 responses are idempotent because the attempt was halted before the server could execute it, regardless if the attempted operation was itself idempotent.
We also know that
GetOperationStatusrequests are idempotent because they do not modify data on the server. Therefore we can add an extra case to our retry allow list:GetOperationStatuscommandOSErrorsuch asTimeoutErrororConnectionResetError.Previously we attempted this same behaviour by retrying
GetOperationStatusrequests, regardless the nature of the exception. But this change could not pass our e2e tests because there are valid cases whenGetOperationStatuswill raise an exception from within our own library code: for example, if an operation is canceled in a separate threadGetOperationStatuswill raise a "DOES NOT EXIST" exception.Logging Behaviour
The connector will log whenever it retries an attempt because of an
OSError. It will use log levelINFOif theOSErroris one we consider normal. It will use log levelWARNINGif theOSErrorseems unusual. The codes we consider normal are:The full set of
OSErrorcodes is platform specific. I wrote this patch to target a specific customer scenario whenGetOperationStatusrequests were retried after an operating system socket timeout exception. In this customer scenario the error code wasErrno 110: Connection timed out. However that error code is specific to Linux. On a Darwin/Macos host the code would be65and on Windows it would be10060.Rather than catch these specifically, I use Python's
errnobuilt-in to check forerrno.ETIMEDOUTwhich will resolve to the platform-specific code at runtime. I tested this manually on Linux and MacOS (but not on Windows). @benfleis helped me pick what we consider "normal".We log all other
OSErrorcodes withWARNINGis because it would be pretty unusual for a request to fail because of a "FileNotFound" or a system fault.References
I found this article extremely helpful while formulating this fix. The author is solving a very similar problem across platforms.