Fix deadlock while cancelling NpgsqlConnection.Wait via CancellationToken#5976
Fix deadlock while cancelling NpgsqlConnection.Wait via CancellationToken#5976
Conversation
| // We should set the timeout before checking the token as otherwise we're racing with the callback registered on cancellationToken | ||
| // Which can also change the timeout | ||
| cancellationToken.ThrowIfCancellationRequested(); |
There was a problem hiding this comment.
I don't think this is guaranteed to help. cancellationToken.ThrowIfCancellationRequested() only provides a read acquire barrier, and I don't see any other barriers as a part of setting ReadBuffer.Timeout, so the runtime is free to move the call to ThrowIfCancellationRequested() before the write to the timeout.
There was a problem hiding this comment.
Isn't according to memory model acquire barrier guarantees that any reads and writes can only happen after that read?
Volatile reads have "acquire semantics" - no read or write that is later in the program order may be speculatively executed ahead of a volatile read.
And from my understanding that's relevant if and only if ThrowIfCancellationRequested is inlined, as otherwise the runtime can't guarantee that it doesn't have side effects regarding ReadBuffer, so it can't move it after ReadBuffer.Timeout =.
There was a problem hiding this comment.
You won't see an x64 cpu do so but on arm I suppose it could happen...
The JIT likely won't move the code as the write to the timeout and the read of the cts reference (until the if canceled branch) should be part of the same basic block, but that's mostly an implementation detail.
When I set these multi threaded timeout values in Slon I have an Interlocked.MemoryBarrier() after it, mostly to make sure it propagates to timer threads in a timely fashion. It would be enough here too (assuming this solves the deadlock, I haven't looked too closely).
I don't see the volatile read @WizardBrony referred to in ThrowIfCancellationRequested. Writing an object reference forces a barrier and is guaranteed to be atomic but reading an object reference has no barriers in .NET (unlike many low latency GCs that require them).
There was a problem hiding this comment.
I don’t think I follow. My point was that this change depends on the write to ReadBuffer.Timeout happening before the token is checked, and according to the C# memory model, it is valid for the runtime to reorder the write after the read. Please let me know if I am missing something.
There was a problem hiding this comment.
I don't see the volatile read @WizardBrony referred to in ThrowIfCancellationRequested.
And _state is marked as volatile.
There was a problem hiding this comment.
The first one (keepalive) is performed whenever we do not have an active query, so we're not racing there (lock over SyncObj should guarantee that).
The second one is a bit more tricky. We call that method just before we start executing a query/start an action, so it's supposed to be the only one being able to change things around. But there is still a miniscule possibility of concurrent write due to StartCancellableOperation, which registers cancellation token callback. Overall a chance of something like that ever happening is extremely miniscule since cancellation callback has to do a lot of work before even being able to try and cancel a query, but just in case I did move it after changing the timeout.
There was a problem hiding this comment.
Even with the change in this PR, I still see a possible deadlock:
Say the call to NpgsqlConnection.Wait makes it all the way to this line in NpgsqlReadBuffer.EnsureLong without cancellation:
https://github.com/npgsql/npgsql/blob/5975-wait-cancellation-token-cancel-deadlock-fix/src/Npgsql/Internal/NpgsqlReadBuffer.cs#L295-L297
That line will see buffer.Timeout == TimeSpan.Zero.
Just before the call to buffer.Cts.Reset(), let's say cancellation occurs:
https://github.com/npgsql/npgsql/blob/5975-wait-cancellation-token-cancel-deadlock-fix/src/Npgsql/Internal/NpgsqlConnector.cs#L1824-L1825
Now buffer.Cts.Reset() runs. Because the cancellation callback called ReadBuffer.Cts.Cancel(), Reset() will see _cts.IsCancellationRequested == true:
https://github.com/npgsql/npgsql/blob/5975-wait-cancellation-token-cancel-deadlock-fix/src/Npgsql/Util/ResettableCancellationTokenSource.cs#L119-L123
That will create a new, uncanceled CancellationTokenSource, and the cancellation callback will already be completed: Deadlock.
There was a problem hiding this comment.
The code there is indeed problematic, gonna have to think for a bit whether we can work around without adding a lock there (NpgsqlReadBuffer.Ensure is the most hot method we have which actually does IO).
There was a problem hiding this comment.
Hey @vonzshik, just a side-note. Right before you replied to me I had deleted my comment asking about the writes to ReadBuffer.Timeout because I had come to the same conclusions you did, but also for the write in DoStartUserAction I realized that even if there was a race with the cancellation callback it wouldn’t really cause an issue for Wait, so I figured it was off-topic for this PR. But I appreciate you being thorough and fixing it for the non-Wait code paths anyway. Thanks for all your work on this.
There was a problem hiding this comment.
@vonzshik What I would recommend doing is simply taking the CancelLock during each concurrent access to NpgsqlReadBuffer's Cts and/or Timeout. For example, changing:
to:
CancellationToken finalCt;
lock (buffer.Connector.CancelLock)
{
finalCt = async && buffer.Timeout != TimeSpan.Zero
? buffer.Cts.Start()
: buffer.Cts.Reset();
}Then we can remove ResettableCancellationTokenSource.lockObject since there will no longer be concurrent access to NpgsqlReadBuffer.Cts. That way we're not adding a lock to NpgsqlReadBuffer.Ensure, and accesses to Cts and Timeout are still thread-safe.
Additionally, async logic like this:
needs to be:
lock (buffer.Connector.CancelLock)
{
if (async && buffer.Timeout == NpgsqlConnector._cancelImmediatelyTimeout)
continue;
if (cancellationTimeout > 0)
buffer.Timeout = TimeSpan.FromMilliseconds(cancellationTimeout);
if (async)
finalCt = buffer.Cts.Start();
}Otherwise if cancellation races with timeout and cancellation wins, the _cancelImmediatelyTimeout will get stomped on.
I also have a question about that logic. The documentation states that a Cancellation Timeout of "0 means infinite wait." But if buffer.Timeout is only updated when CancellationTimeout > 0, how does it ever infinitely wait if buffer.Timeout > TimeSpan.Zero?
|
I’m guessing the answer to this is probably “no,” but I just thought I’d ask…. Is there any way this fix could be back-ported to 7.0? |
We did discuss our support policy just before 9.0, and we mostly agreed that it's better if we align with how .NET is supported (about 3 years for LTS releases and 1.5 for non-LTS). We really should sit down and properly document it. cc @roji |
|
I have one other note about |
Fixes #5975