Skip to content

Fix deadlock while cancelling NpgsqlConnection.Wait via CancellationToken#5976

Open
vonzshik wants to merge 3 commits intomainfrom
5975-wait-cancellation-token-cancel-deadlock-fix
Open

Fix deadlock while cancelling NpgsqlConnection.Wait via CancellationToken#5976
vonzshik wants to merge 3 commits intomainfrom
5975-wait-cancellation-token-cancel-deadlock-fix

Conversation

@vonzshik
Copy link
Copy Markdown
Contributor

Fixes #5975

@vonzshik vonzshik requested a review from roji as a code owner December 17, 2024 10:29
Comment on lines +2629 to +2631
// We should set the timeout before checking the token as otherwise we're racing with the callback registered on cancellationToken
// Which can also change the timeout
cancellationToken.ThrowIfCancellationRequested();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is guaranteed to help. cancellationToken.ThrowIfCancellationRequested() only provides a read acquire barrier, and I don't see any other barriers as a part of setting ReadBuffer.Timeout, so the runtime is free to move the call to ThrowIfCancellationRequested() before the write to the timeout.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't according to memory model acquire barrier guarantees that any reads and writes can only happen after that read?

Volatile reads have "acquire semantics" - no read or write that is later in the program order may be speculatively executed ahead of a volatile read.

And from my understanding that's relevant if and only if ThrowIfCancellationRequested is inlined, as otherwise the runtime can't guarantee that it doesn't have side effects regarding ReadBuffer, so it can't move it after ReadBuffer.Timeout =.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You won't see an x64 cpu do so but on arm I suppose it could happen...

The JIT likely won't move the code as the write to the timeout and the read of the cts reference (until the if canceled branch) should be part of the same basic block, but that's mostly an implementation detail.

When I set these multi threaded timeout values in Slon I have an Interlocked.MemoryBarrier() after it, mostly to make sure it propagates to timer threads in a timely fashion. It would be enough here too (assuming this solves the deadlock, I haven't looked too closely).

I don't see the volatile read @WizardBrony referred to in ThrowIfCancellationRequested. Writing an object reference forces a barrier and is guaranteed to be atomic but reading an object reference has no barriers in .NET (unlike many low latency GCs that require them).

Copy link
Copy Markdown

@WizardBrony WizardBrony Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t think I follow. My point was that this change depends on the write to ReadBuffer.Timeout happening before the token is checked, and according to the C# memory model, it is valid for the runtime to reorder the write after the read. Please let me know if I am missing something.

Copy link
Copy Markdown
Contributor Author

@vonzshik vonzshik Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first one (keepalive) is performed whenever we do not have an active query, so we're not racing there (lock over SyncObj should guarantee that).
The second one is a bit more tricky. We call that method just before we start executing a query/start an action, so it's supposed to be the only one being able to change things around. But there is still a miniscule possibility of concurrent write due to StartCancellableOperation, which registers cancellation token callback. Overall a chance of something like that ever happening is extremely miniscule since cancellation callback has to do a lot of work before even being able to try and cancel a query, but just in case I did move it after changing the timeout.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even with the change in this PR, I still see a possible deadlock:

Say the call to NpgsqlConnection.Wait makes it all the way to this line in NpgsqlReadBuffer.EnsureLong without cancellation:
https://github.com/npgsql/npgsql/blob/5975-wait-cancellation-token-cancel-deadlock-fix/src/Npgsql/Internal/NpgsqlReadBuffer.cs#L295-L297
That line will see buffer.Timeout == TimeSpan.Zero.

Just before the call to buffer.Cts.Reset(), let's say cancellation occurs:
https://github.com/npgsql/npgsql/blob/5975-wait-cancellation-token-cancel-deadlock-fix/src/Npgsql/Internal/NpgsqlConnector.cs#L1824-L1825

Now buffer.Cts.Reset() runs. Because the cancellation callback called ReadBuffer.Cts.Cancel(), Reset() will see _cts.IsCancellationRequested == true:
https://github.com/npgsql/npgsql/blob/5975-wait-cancellation-token-cancel-deadlock-fix/src/Npgsql/Util/ResettableCancellationTokenSource.cs#L119-L123

That will create a new, uncanceled CancellationTokenSource, and the cancellation callback will already be completed: Deadlock.

Copy link
Copy Markdown
Contributor Author

@vonzshik vonzshik Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code there is indeed problematic, gonna have to think for a bit whether we can work around without adding a lock there (NpgsqlReadBuffer.Ensure is the most hot method we have which actually does IO).

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @vonzshik, just a side-note. Right before you replied to me I had deleted my comment asking about the writes to ReadBuffer.Timeout because I had come to the same conclusions you did, but also for the write in DoStartUserAction I realized that even if there was a race with the cancellation callback it wouldn’t really cause an issue for Wait, so I figured it was off-topic for this PR. But I appreciate you being thorough and fixing it for the non-Wait code paths anyway. Thanks for all your work on this.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vonzshik What I would recommend doing is simply taking the CancelLock during each concurrent access to NpgsqlReadBuffer's Cts and/or Timeout. For example, changing:

https://github.com/npgsql/npgsql/blob/5975-wait-cancellation-token-cancel-deadlock-fix/src/Npgsql/Internal/NpgsqlReadBuffer.cs#L295-L297

to:

CancellationToken finalCt;
lock (buffer.Connector.CancelLock)
{
    finalCt = async && buffer.Timeout != TimeSpan.Zero
        ? buffer.Cts.Start()
        : buffer.Cts.Reset();
}

Then we can remove ResettableCancellationTokenSource.lockObject since there will no longer be concurrent access to NpgsqlReadBuffer.Cts. That way we're not adding a lock to NpgsqlReadBuffer.Ensure, and accesses to Cts and Timeout are still thread-safe.

Additionally, async logic like this:

https://github.com/npgsql/npgsql/blob/5975-wait-cancellation-token-cancel-deadlock-fix/src/Npgsql/Internal/NpgsqlReadBuffer.cs#L360-L364

needs to be:

lock (buffer.Connector.CancelLock)
{
    if (async && buffer.Timeout == NpgsqlConnector._cancelImmediatelyTimeout)
        continue;

    if (cancellationTimeout > 0)
        buffer.Timeout = TimeSpan.FromMilliseconds(cancellationTimeout);

    if (async)
        finalCt = buffer.Cts.Start();
}

Otherwise if cancellation races with timeout and cancellation wins, the _cancelImmediatelyTimeout will get stomped on.

I also have a question about that logic. The documentation states that a Cancellation Timeout of "0 means infinite wait." But if buffer.Timeout is only updated when CancellationTimeout > 0, how does it ever infinitely wait if buffer.Timeout > TimeSpan.Zero?

@WizardBrony
Copy link
Copy Markdown

I’m guessing the answer to this is probably “no,” but I just thought I’d ask….

Is there any way this fix could be back-ported to 7.0?

@vonzshik
Copy link
Copy Markdown
Contributor Author

Is there any way this fix could be back-ported to 7.0?

We did discuss our support policy just before 9.0, and we mostly agreed that it's better if we align with how .NET is supported (about 3 years for LTS releases and 1.5 for non-LTS). We really should sit down and properly document it. cc @roji

@WizardBrony
Copy link
Copy Markdown

WizardBrony commented Jan 16, 2025

I have one other note about NpgsqlConnection.WaitAsync. The documentation states "Specifying -1 also indicates an infinite time-out period." This is (currently) only true when keepalive is enabled. If keepalive is disabled, WaitAsync immediately returns false.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Deadlock while cancelling NpgsqlConnection.WaitAsync

3 participants