Skip to content

Upon postgres segfault, connections in async for record in conn.cursor fail to detect loss of connectivity and hang indefinitely #549

@autumnjolitz

Description

@autumnjolitz
  • asyncpg version: 0.20.1
  • PostgreSQL version: PostgreSQL 10.6 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.3 20140911 (Red Hat 4.8.3-9), 64-bit
  • Do you use a PostgreSQL SaaS? If so, which? Can you reproduce
    the issue with a local PostgreSQL install?
    : unknown - this depends on being somewhere critically inside the iterator for conn.cursor(...)
  • Python version: 3.7.7 (default, Mar 24 2020, 03:07:37) \n[GCC 9.2.0]
  • Platform: python:3.7-alpine
  • Do you use pgbouncer?: No.
  • Did you install asyncpg with pip?: yes
  • If you built asyncpg locally, which version of Cython did you use?: N/A
  • Can the issue be reproduced under both asyncio and
    uvloop?
    : uvloop is not used in this installation.

I made use of a asyncpg.pool.Pool created via create_pool with a command_timeout of None.

Midway through heavy load, the RDS instance crashed and rebooted, losing all connections.

However, the connections actively in use jammed and as of this time have been jammed for well over an hour. Using awaitwhat and a shell, I was able to determine that all the coroutines were jammed on the following:

    async def _execute_iter(self, conn):
        async for record in conn.cursor(self.query, *self.bindings.values):
            yield record

    async def execute(self):
        conn = self.conn()
        if conn is None:
            raise ValueError("Connection has been garbage collected!")

        cursor = self._execute_iter(conn)
        if self.iterable:
            return cursor
        items = []
        async for record in cursor:  # <---- jammed in here indefinitely!
            items.append(record)
        return tuple(items)

Upon grabbing a stalled coroutine with my backdoor shell, I discovered a surprising behavior!

>>> tasks[0].get_stack()[0].f_locals['conn']
<PoolConnectionProxy <asyncpg.connection.Connection object at 0x7fe02b594d70> 0x7fe02a56df10>
>>> tasks[0].get_stack()[0].f_locals['conn'].is_closed()
False
>>> tasks[0].get_stack()[0].f_locals['conn']
<PoolConnectionProxy <asyncpg.connection.Connection object at 0x7fe02b594d70> 0x7fe02a56df10>
>>> tasks[0].get_stack()[0].f_locals['conn'].is_in_transaction()
True
>>> tasks[0].get_stack()[0].f_locals['conn']._con.get_server_pid()
22062
>>> await tasks[0].get_stack()[0].f_locals['conn']._con.fetch('SELECT 1')
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/asyncio/selector_events.py", line 814, in _read_ready__data_received
    data = self._sock.recv(self.max_size)
ConnectionResetError: [Errno 104] Connection reset by peer

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/aioconsole/execute.py", line 95, in aexec
    result, new_local = yield from coro
  File "<aexec>", line 2, in __corofn
  File "/usr/local/lib/python3.7/site-packages/asyncpg/connection.py", line 420, in fetch
    return await self._execute(query, args, 0, timeout)
  File "/usr/local/lib/python3.7/site-packages/asyncpg/connection.py", line 1403, in _execute
    query, args, limit, timeout, return_status=return_status)
  File "/usr/local/lib/python3.7/site-packages/asyncpg/connection.py", line 1411, in __execute
    return await self._do_execute(query, executor, timeout)
  File "/usr/local/lib/python3.7/site-packages/asyncpg/connection.py", line 1423, in _do_execute
    stmt = await self._get_statement(query, None)
  File "/usr/local/lib/python3.7/site-packages/asyncpg/connection.py", line 328, in _get_statement
    statement = await self._protocol.prepare(stmt_name, query, timeout)
  File "asyncpg/protocol/protocol.pyx", line 163, in prepare
asyncpg.exceptions.ConnectionDoesNotExistError: connection was closed in the middle of operation
>>> tasks[0].get_stack()[0].f_locals['conn'].is_closed()
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/aioconsole/execute.py", line 95, in aexec
    result, new_local = yield from coro
  File "<aexec>", line 2, in __corofn
  File "/usr/local/lib/python3.7/site-packages/asyncpg/pool.py", line 56, in call_con_method
    meth_name))
asyncpg.exceptions._base.InterfaceError: cannot call Connection.is_closed(): connection has been released back to the pool
>>> 

I submit to you that in the code for iterating through a cursor, something does not properly handle the case of a sudden connection reset. Notice how when I actually attempt to use the stalled connection which is blocked in an async for record in cursor generator, it suddenly realizes the connection is dead!

If I write b”x” in the socket itself, that also is enough to wake up asyncio to being disconnected!

I will experiment with a command_timeout to hopefully add in a worst-case timer for blocked queries, however this

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions