Retry attempts that fail due to a connection timeout #24

susodapop · 2022-07-28T21:55:21Z

Note for reviewers

On 4 August 2022 I reverted all changes in this PR so I could reimplement and apply all your review feedback. This happened in fa1fd50. Every subsequent commit encapsulates one logical change to the code. Working through them one-at-a-time should be quite easy.

Isolate delay bounding bf65b81
Move error details scope up one level a0d340e (H/T @moderakh )
Retry GetOperationStatus for OSError a55cf9d (thanks @sander-goos and @benfleis )
Log when we do it 38411a8 (H/T @benfleis )

The final four commits are simple code cleanup and fixing one warning in the test suite un-related to this change.

Since our e2e tests are not enabled via Github Actions yet I ran them locally and all passed.

Description

Currently, we only retry attempts that returned a code 429 or 503 and include a Retry-After header. This pull request also allows GetOperationStatus requests to be retried if the request fails with an OSError exception. The configured retry policy is still honoured with regard to maximum attempts and max retry duration.

Background

The reason we only retry 429 and 503 responses today is because retries must be idempotent. Otherwise the connector could cause unexpected or harmful consequences (data loss, excess resource utilisation etc.)

We know that 429/503 responses are idempotent because the attempt was halted before the server could execute it, regardless if the attempted operation was itself idempotent.

We also know that GetOperationStatus requests are idempotent because they do not modify data on the server. Therefore we can add an extra case to our retry allow list:

If a request is a GetOperationStatus command
If it fails due to an OSError such as TimeoutError or ConnectionResetError.

Previously we attempted this same behaviour by retrying GetOperationStatus requests, regardless the nature of the exception. But this change could not pass our e2e tests because there are valid cases when GetOperationStatus will raise an exception from within our own library code: for example, if an operation is canceled in a separate thread GetOperationStatus will raise a "DOES NOT EXIST" exception.

Logging Behaviour

The connector will log whenever it retries an attempt because of an OSError. It will use log level INFO if the OSError is one we consider normal. It will use log level WARNING if the OSError seems unusual. The codes we consider normal are:

                                            # | Debian | Darwin |
                    info_errs = [           # |--------|--------|         
                        errno.ESHUTDOWN,    # |   32   |   32   |
                        errno.EAFNOSUPPORT, # |   97   |   47   |
                        errno.ECONNRESET,   # |   104  |   54   |
                        errno.ETIMEDOUT,    # |   110  |   60   |
                    ]

The full set of OSError codes is platform specific. I wrote this patch to target a specific customer scenario when GetOperationStatus requests were retried after an operating system socket timeout exception. In this customer scenario the error code was Errno 110: Connection timed out. However that error code is specific to Linux. On a Darwin/Macos host the code would be 65 and on Windows it would be 10060.

Rather than catch these specifically, I use Python's errno built-in to check for errno.ETIMEDOUT which will resolve to the platform-specific code at runtime. I tested this manually on Linux and MacOS (but not on Windows). @benfleis helped me pick what we consider "normal".

We log all other OSError codes with WARNING is because it would be pretty unusual for a request to fail because of a "FileNotFound" or a system fault.

References

I found this article extremely helpful while formulating this fix. The author is solving a very similar problem across platforms.

Signed-off-by: Jesse Whitehouse <jesse@whitehouse.dev>

It doesn't like retry policy bounds == None. Signed-off-by: Jesse Whitehouse <jesse@whitehouse.dev>

Signed-off-by: Jesse Whitehouse <jesse@whitehouse.dev>

Other OSError's like `EINTR` could indicate a call was interrupted after it was received by the server, which would potentially not be idempotent Signed-off-by: Jesse Whitehouse <jesse@whitehouse.dev>

Signed-off-by: Jesse Whitehouse <jesse@whitehouse.dev>

src/databricks/sql/thrift_backend.py

Signed-off-by: Jesse Whitehouse <jesse@whitehouse.dev>

Only make it non-null for retryable requests Signed-off-by: Jesse Whitehouse <jesse@whitehouse.dev>

src/databricks/sql/thrift_backend.py

every time. Whereas the previous approach they passed in ten seconds. Signed-off-by: Jesse Whitehouse <jesse@whitehouse.dev>

sander-goos

Left some comments. Could you please link in the description the resources that assures that the errors we catch are safe to retry (and platform independent).

src/databricks/sql/thrift_backend.py

sander-goos · 2022-08-02T14:38:34Z

tests/e2e/driver_tests.py

+        )
+
+
+        with self.assertRaises(OperationalError) as cm:


Can be more specific to assert RequestError

I actually removed the e2e tests for this behaviour because they were no more useful than unit tests. I'll add an e2e test for this scenario after we merge support for http proxies. Then we can simulate real timeouts.

In the unit tests asserts on RequestError btw.

tests/e2e/driver_tests.py

tests/unit/test_thrift_backend.py

src/databricks/sql/thrift_backend.py

Signed-off-by: Jesse Whitehouse <jesse@whitehouse.dev>

Add retry_delay_default to use in this case. Signed-off-by: Jesse Whitehouse <jesse@whitehouse.dev>

Emit warnings for unexpected OSError codes Signed-off-by: Jesse Whitehouse <jesse@whitehouse.dev>

Signed-off-by: Jesse Whitehouse <jesse@whitehouse.dev>

DeprecationWarning: The 'warn' function is deprecated, use 'warning' instead Signed-off-by: Jesse Whitehouse <jesse@whitehouse.dev>

Signed-off-by: Jesse Whitehouse <jesse@whitehouse.dev>

susodapop · 2022-08-04T15:56:18Z

@benfleis @sander-goos I updated the PR description to reflect the present state after this week's reviews / updates.

benfleis

looks good!

Which real connection tests did you perform to validate? Related, perhaps worth adding a note to PR (or near the retry code) explaining any manual E2E tests you used to validate behavior in "real conditions", so you (and others) won't have to think so hard about what's worth doing as a 1-off manual test.

Signed-off-by: Jesse Whitehouse <jesse@whitehouse.dev>

susodapop · 2022-08-05T21:12:17Z

Which real connection tests did you perform to validate?

I used mitmweb (more here) to intercept traffic between the connector and a SQL Warehouse. mitmweb allows me to force a timeout and abort a connection. Both of which raise OSError in Python. The expected behaviour was met: GetOperationStatus requests were retried within the retry policy, but other requests were not such as ExecuteStatement, FetchResults, OpenSession, and CloseSession.

This process is very manual, so I'm not yet incorporating it into the tests/e2e/driver_tests.py file. I will follow-up with an outline of how to automate this testing in the future.

How I ran the tests

The actual integration test I ran looked like this:

def test_make_request_will_retry_GetOperationStatus(self):
    
    import thrift, errno
    from databricks.sql.thrift_api.TCLIService.TCLIService import Client
    from databricks.sql.exc import RequestError
    from databricks.sql.utils import NoRetryReason
    from databricks.sql import client

    from databricks.sql.thrift_api.TCLIService import ttypes

    with self.cursor() as cursor:
        cursor.execute("SELECT 1")
        op_handle = cursor.active_op_handle
    
    req = ttypes.TGetOperationStatusReq(
        operationHandle=op_handle,
        getProgressUpdate=False,
    )

    EXPECTED_RETRIES = 2

    with self.cursor({"_socket_timeout": 10, "_retry_stop_after_attempts_count": 2}) as cursor:
        _client = cursor.connection.thrift_backend._client

        with self.assertRaises(RequestError) as cm:
            breakpoint() # At this point I instructed mitmweb to intercept and suspend all requests
            cursor.connection.thrift_backend.make_request(_client.GetOperationStatus, req)

        self.assertEqual(NoRetryReason.OUT_OF_ATTEMPTS.value, cm.exception.context["no-retry-reason"])
        self.assertEqual(f'{EXPECTED_RETRIES}/{EXPECTED_RETRIES}', cm.exception.context["attempt"])

I used this Dockerfile to make this work on my local machine

FROM python:3.7-slim-buster

RUN useradd --create-home pysql


# Ubuntu packages
RUN apt-get update && \
  apt-get install -y --no-install-recommends \
    curl \
    gnupg \
    build-essential \
    pwgen \
    libffi-dev \
    sudo \
    git-core \
    # Additional packages required for data sources:
    libssl-dev \
    libsasl2-modules-gssapi-mit && \
    apt-get clean && \
     rm -rf /var/lib/apt/lists/*

RUN sudo mkdir /usr/local/share/ca-certificates/extra
RUN curl mitm.it/cert/pem > mitmcert.pem
RUN openssl x509 -in mitmcert.pem -inform PEM -out mitm.crt
RUN sudo cp mitm.crt /usr/local/share/ca-certificates/extra/mitm.crt
RUN sudo update-ca-certificates

RUN pip install poetry --user

COPY --chown=pysql . /pysql
RUN chown pysql /pysql

WORKDIR /pysql

RUN python -m poetry install

And configured Docker using its config.json file to forward all traffic from my containers through the mitmweb proxy that I installed on my local environment, like this:

    "proxies":
    {
      "default":
      {
        "httpProxy": "http://<my local network ip address>:8080",
        "httpsProxy": "http://<my local network ip address>:8080",
        "noProxy": "pypi.org"
      }
    }

The mitmweb proxy must be running when building the Docker container.

Related, perhaps worth adding a note to PR (or near the retry code) explaining any manual E2E tests you used to validate behavior in "real conditions", so you (and others) won't have to think so hard about what's worth doing as a 1-off manual test.

Done here 10016ea

Signed-off-by: Jesse Whitehouse <jesse@whitehouse.dev>

This reverts commit 4db4ad0. Signed-off-by: Jesse Whitehouse <jesse@whitehouse.dev>

sander-goos · 2022-08-08T08:10:50Z

Thanks for the fix and the detailed explanation!

* Isolate delay bounding logic * Move error details scope up one-level. * Retry GetOperationStatus if an OSError was raised during execution. Add retry_delay_default to use in this case. * Log when a request is retried due to an OSError. Emit warnings for unexpected OSError codes * Update docstring for make_request * Nit: unit tests show the .warn message is deprecated. DeprecationWarning: The 'warn' function is deprecated, use 'warning' instead Signed-off-by: Jesse Whitehouse <jesse@whitehouse.dev>

* Isolate delay bounding logic * Move error details scope up one-level. * Retry GetOperationStatus if an OSError was raised during execution. Add retry_delay_default to use in this case. * Log when a request is retried due to an OSError. Emit warnings for unexpected OSError codes * Update docstring for make_request * Nit: unit tests show the .warn message is deprecated. DeprecationWarning: The 'warn' function is deprecated, use 'warning' instead Signed-off-by: Jesse Whitehouse <jesse@whitehouse.dev> Signed-off-by: Sai Shree Pradhan <saishree.pradhan@databricks.com>

susodapop added 4 commits July 28, 2022 16:04

Explicitly catch OSError and socket.timeout errors, with automatic retry

fa13d75

Signed-off-by: Jesse Whitehouse <jesse@whitehouse.dev>

Add unit and e2e tests for retries on timeout behaviour.

b44d0b8

Signed-off-by: Jesse Whitehouse <jesse@whitehouse.dev>

Add default delay to the retry policy

22d2ac0

Signed-off-by: Jesse Whitehouse <jesse@whitehouse.dev>

Don't make unit test take 5+ seconds

44f12b7

Signed-off-by: Jesse Whitehouse <jesse@whitehouse.dev>

susodapop requested review from arikfr, moderakh and yunbodeng-db as code owners July 28, 2022 21:55

Fix broken unit tests: test_retry_args_bounding

09cefec

It doesn't like retry policy bounds == None. Signed-off-by: Jesse Whitehouse <jesse@whitehouse.dev>

susodapop requested review from benfleis and sander-goos July 28, 2022 22:03

susodapop added 3 commits July 28, 2022 17:05

Black thrift_backend.py

de96fce

Signed-off-by: Jesse Whitehouse <jesse@whitehouse.dev>

Only retry OSError's that mention Errno 110

21c06d4

Other OSError's like `EINTR` could indicate a call was interrupted after it was received by the server, which would potentially not be idempotent Signed-off-by: Jesse Whitehouse <jesse@whitehouse.dev>

Fix unnecessary indentation

befdf3d

Signed-off-by: Jesse Whitehouse <jesse@whitehouse.dev>

yunbodeng-db approved these changes Jul 29, 2022

View reviewed changes

src/databricks/sql/thrift_backend.py Outdated Show resolved Hide resolved

src/databricks/sql/thrift_backend.py Show resolved Hide resolved

susodapop added 2 commits July 29, 2022 13:42

Clarify docstrings after review feedback

fb1275b

Signed-off-by: Jesse Whitehouse <jesse@whitehouse.dev>

Default retry_delay = None (not retryable)

c76ee65

Only make it non-null for retryable requests Signed-off-by: Jesse Whitehouse <jesse@whitehouse.dev>

yunbodeng-db approved these changes Jul 29, 2022

View reviewed changes

moderakh reviewed Aug 1, 2022

View reviewed changes

src/databricks/sql/thrift_backend.py Outdated Show resolved Hide resolved

moderakh reviewed Aug 1, 2022

View reviewed changes

src/databricks/sql/thrift_backend.py Show resolved Hide resolved

This approach passes the e2e tests, but they take exactly 4 mins 51 secs

7474834

every time. Whereas the previous approach they passed in ten seconds. Signed-off-by: Jesse Whitehouse <jesse@whitehouse.dev>

sander-goos reviewed Aug 2, 2022

View reviewed changes

susodapop added 9 commits August 4, 2022 10:20

Revert all changes since main

fa1fd50

Signed-off-by: Jesse Whitehouse <jesse@whitehouse.dev>

isolate delay bounding

bf65b81

Signed-off-by: Jesse Whitehouse <jesse@whitehouse.dev>

Move error details scope up one-level.

a0d340e

Signed-off-by: Jesse Whitehouse <jesse@whitehouse.dev>

Retry GetOperationStatus if an OSError was raised during execution

a55cf9d

Add retry_delay_default to use in this case. Signed-off-by: Jesse Whitehouse <jesse@whitehouse.dev>

Log when a request is retried due to an OSError.

38411a8

Emit warnings for unexpected OSError codes Signed-off-by: Jesse Whitehouse <jesse@whitehouse.dev>

Update docstring for make_request

5096ef0

Signed-off-by: Jesse Whitehouse <jesse@whitehouse.dev>

Nit: unit tests show the .warn message is deprecated.

5c1ee79

DeprecationWarning: The 'warn' function is deprecated, use 'warning' instead Signed-off-by: Jesse Whitehouse <jesse@whitehouse.dev>

Black thrift_backend.py

1f87a38

Signed-off-by: Jesse Whitehouse <jesse@whitehouse.dev>

Rerun black on thrift_backend.py

baff3d5

Signed-off-by: Jesse Whitehouse <jesse@whitehouse.dev>

susodapop requested a review from sander-goos August 4, 2022 15:53

benfleis approved these changes Aug 4, 2022

View reviewed changes

Add comment about manual tests

10016ea

Signed-off-by: Jesse Whitehouse <jesse@whitehouse.dev>

susodapop added 2 commits August 5, 2022 16:15

Bump to v2.0.3

4db4ad0

Signed-off-by: Jesse Whitehouse <jesse@whitehouse.dev>

Revert "Bump to v2.0.3"

767e34c

This reverts commit 4db4ad0. Signed-off-by: Jesse Whitehouse <jesse@whitehouse.dev>

susodapop merged commit c59c393 into main Aug 5, 2022

susodapop deleted the retry-connection-attempt-failures branch August 5, 2022 21:23

andrefurlan-db mentioned this pull request Jan 17, 2023

Sporadical "[Errno 110] Connection timed out" in databricks adapter. databricks/dbt-databricks#248

Closed

Retry attempts that fail due to a connection timeout #24

Retry attempts that fail due to a connection timeout #24

Uh oh!

Conversation

susodapop commented Jul 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Note for reviewers

Description

Background

Logging Behaviour

References

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sander-goos left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sander-goos Aug 2, 2022

Choose a reason for hiding this comment

Uh oh!

susodapop Aug 4, 2022

Choose a reason for hiding this comment

Uh oh!

susodapop Aug 4, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

susodapop commented Aug 4, 2022

Uh oh!

benfleis left a comment

Choose a reason for hiding this comment

Uh oh!

susodapop commented Aug 5, 2022

How I ran the tests

Uh oh!

sander-goos commented Aug 8, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

susodapop commented Jul 28, 2022 •

edited

Loading