Fix RealMutableStore write-queue data race (inverted lock polarity)#735
Conversation
6a974db to
d84ded3
Compare
|
@matt-ramotar would you be able to take a look at this when you get a chance? We're hitting this race in production — it manifests as a hard Root cause is the inverted lock polarity in |
|
@MartinStrambach Thanks for reporting! Looking |
The `build-and-test` checkout pinned `ref: ${{ github.head_ref || github.ref }}`
without a matching `repository`, so for cross-repository (fork) PRs Actions
looked for the head branch in the base repo and failed at checkout in ~7s with
"a branch or tag with the name '<head-branch>' could not be found" — before any
build or test ran. This affected all external/fork contributions (e.g. #735).
- Checkout: resolve `repository` and `ref` from the PR head when present
(`github.event.pull_request.head.*`), falling back to `github.repository` /
`github.ref` for push builds. The `ref` uses the head branch *name* (not the
head SHA) so HEAD stays attached to a branch — the KMMBridge plugin runs
`git pull --tags`, which fails on a detached HEAD ("you are not currently on a
branch"). Fork PRs now check out the contributor's head branch; same-repo PRs
and pushes to main are unchanged.
- Codecov: skip the upload on fork PRs, where `CODECOV_TOKEN` is unavailable and
`fail_ci_if_error: true` would otherwise fail the job. Coverage is still
uploaded and enforced for same-repo PRs and pushes to main.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
@MartinStrambach can you rebase? |
The per-key write-request queue is a non-thread-safe ArrayDeque, but withWriteRequestQueueLock guarded it with a Lightswitch — a shared/reader lock that lets multiple holders run concurrently. So addWriteRequestToQueue (add) and updateWriteRequestQueue (iterate + rebuild) could run on the same deque at once. A structural add during iteration corrupts the backing array: ConcurrentModificationException on the JVM, EXC_BAD_ACCESS on Kotlin/Native. The only exclusive (mutex) holder was a pure read (getLatestWriteRequest) — the polarity was inverted. Guard all write-queue access with the per-key exclusive mutex instead, so add/iterate/rebuild mutually exclude each other and getLatestWriteRequest. Adds MutableStoreConcurrencyTest reproducing the race (red pre-fix on both JVM and native, green after). Lightswitch is now unused in RealMutableStore; left in place to keep the diff focused (can be removed as a follow-up). Signed-off-by: Martin Strambach <martin.strambach@gmail.com>
After the previous commit nothing acquires the per-key Lightswitch: withWriteRequestQueueLock uses writeRequests.mutex directly, and readCompletions already used readCompletions.mutex. The lightswitch field on StoreThreadSafety was therefore read by nothing. Drop the field and delete the class. All three types (Lightswitch, StoreThreadSafety, ThreadSafety) are internal and absent from the binary-compatibility-validator dumps, so this changes no public API/ABI: jvmApiCheck and klibApiCheck both pass unchanged. Signed-off-by: Martin Strambach <martin.strambach@gmail.com>
d84ded3 to
d75ed05
Compare
|
Rebased onto |
|
@MartinStrambach looks like you just need to run ktlintFormat |
Signed-off-by: Martin Strambach <martin.strambach@gmail.com>
|
@matt-ramotar should be fixed |
Fix
RealMutableStorewrite-queue data race (inverted lock polarity)Summary
RealMutableStore's per-key write-request queue is a non-thread-safeArrayDeque(
internal/definition/WriteRequestQueue.kt), but the lock guarding it has inverted polarity:the mutators take a shared lock while the only exclusive holder is a pure read. Concurrent
writes to the same key therefore mutate the deque while another coroutine iterates it, corrupting
the backing array.
ConcurrentModificationException, whichRealMutableStorecatches and surfaces asStoreWriteResponse.Error.Exception.addreallocates the backing array under an activeiteration → freed-pointer dereference →
EXC_BAD_ACCESS(a hard process crash).This reproduces in production on a multithreaded native dispatcher under sustained same-key writes
(~every few seconds across concurrent streams).
Root cause
withWriteRequestQueueLockguards the queue with aLightswitch— a shared/reader lock that letsmultiple holders run concurrently:
All mutating access flows through this method:
addWriteRequestToQueue→queue.add(writeRequest)updateWriteRequestQueue→for (writeRequest in this) { ... }(iterate + rebuild)Because both take the shared lightswitch, they do not exclude each other. The only operation
taking the exclusive
writeRequests.mutexisgetLatestWriteRequest→queue.last(), a pureread. The polarity is backwards: the mutators share, the reader is exclusive.
Fix
Guard all write-queue access with the per-key exclusive mutex, so
add/ iterate / rebuildmutually exclude each other (and exclude
getLatestWriteRequest, which already useswriteRequests.mutex):Regression test
MutableStoreConcurrencyTest(commonTest) fires 64 concurrentwrites to the same key over 50rounds on
Dispatchers.Default, plus a sequential baseline. It asserts no corruption-class error(
ConcurrentModificationException/NullPointerException/IndexOutOfBoundsException); on Native,reaching the assertion at all proves the process didn't crash.
Verified:
:store:jvmTest): RED pre-fix (NullPointerException, 64/64 in round 0) → GREEN post-fix.:store:iosSimulatorArm64Test): RED pre-fix (kotlin.ConcurrentModificationException,round 19) → GREEN post-fix.
Notes for reviewers
Lightswitchremoved (2nd commit). OncewithWriteRequestQueueLockstopped using it, nothingacquired it anywhere —
readCompletionsalready locked via its ownmutex, so thelightswitchfield on
StoreThreadSafetywas read by nothing. The 2nd commit drops the field and deletes theclass.
Lightswitch/StoreThreadSafety/ThreadSafetyare allinternaland absent from thebinary-compatibility-validator dumps, so this changes no public API/ABI —
jvmApiCheckandklibApiCheckboth pass unchanged. (Happy to split this into a follow-up PR if you'd prefer thefix land in isolation.)
kotlinx.coroutinesMutexis non-reentrant. The call graph has no nestedacquisition of the same
writeRequests.mutex(getLatestWriteRequestunlocks infinallybeforetryUpdateServer→updateWriteRequestQueue;addWriteRequestToQueueruns inonEachbefore thecollectblock), so this is safe. One caveat worth a maintainer's eye:updateWriteRequestQueueinvokes user
onCompletion(s)callbacks while holding the queue lock — a callback that re-entersthe same key's write path would now deadlock instead of (unsafely) racing. Re-entrant store writes
from a completion callback are an unusual pattern, but flagging it explicitly.
Base
Branched off
main(5.1.0-alpha08). The same methods exist unchanged back through5.1.0-alpha06,so the fix backports cleanly.