Prevent alarm update races with promise chain serialization #5431

MellowYarker · 2025-10-29T17:19:04Z

alarmLaterTasks is a background task queue for updating the alarm manager when alarms are moved to a later time (including deletion). Tasks in this queue retry up to 4 times on failure.

Prior to this commit, a race condition could occur:

Alarm handler completes, queuing a deferred delete in alarmLaterTasks
User calls setAlarm() with a new time, which precommits to alarm manager
The deferred delete from step 1 retries and overwrites the new alarm

This causes the alarm manager to have no alarm while SRS has the new alarm time, leading to alarms that never fire (or fire late if the alarm time was set to run later, not never).

To fix this race, we clear alarmLaterTasks whenever a new alarm is set. This is safe because:

If we have a pending "move later" task, the alarm manager already has an earlier alarm time that will fire and reconcile via armAlarmHandler
The new alarm being set supersedes any pending "move later" operations anyway

This ensures the alarm manager always converges to the correct state without racing background updates.

MellowYarker · 2025-10-29T17:22:58Z

I can push all this to my other open PR but figured it might be better to just close it as is.

kentonv · 2025-10-29T17:29:24Z

Ah hah, that's a good catch!

This should help, but canceling probably doesn't fully resolve the race. It takes some time for the cancellation to propagate to the backend. It's still possible that a subsequent write gets to the backend first and gets applied, and then the canceled request applies after it, before realizing it has been canceled.

To fully avoid this what we really need to do is wait for all previous requests to complete before sending a new request. We might need to arrange our alarm updates into a chain rather than a TaskSet...

codspeed-hq · 2025-10-29T17:42:54Z

CodSpeed Performance Report

Merging #5431 will improve performances by 12.33%

_{Comparing milan/STOR-4521-redo (bdcc22e) with main (b0a7b2f)}

Summary

⚡ 1 improvement
✅ 33 untouched
⏩ 9 skipped¹

Benchmarks breakdown

	Benchmark	`BASE`	`HEAD`	Change
⚡	`simpleStringBody[Response]`	22.5 µs	20 µs	+12.33%

9 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩

justin-mp · 2025-10-29T19:04:13Z

To fully avoid this what we really need to do is wait for all previous requests to complete before sending a new request. We might need to arrange our alarm updates into a chain rather than a TaskSet...

The commitImpl has a pendingCommit that we could use to synchronize requests. It serves a slightly different purpose, but maybe we adapt that to run the commits in order?

src/workerd/io/actor-sqlite.c++

kentonv · 2025-11-03T14:33:37Z

src/workerd/io/actor-sqlite.h

+  // Promise that cleans up alarmLaterChain after all operations complete. This promise gets
+  // cancelled if we queue another "move alarm later" request, to prevent it from cancelling the
+  // new in-flight request.
+  kj::Maybe<kj::Promise<void>> alarmChainCleanup;


I think this cleanup promise is unnecessary. It seems to exist purely to set alarmLaterChain back to null when the chain completes. You can instead leave alarmLaterChain as a promise that has already resolved -- this just means we'll harmlessly wait on an already-resolved promise at the start of the next alarm update. In fact, alarmLaterChain need not be a Maybe, it can instead be initialized to kj::Promise<void>(kj::READY_NOW).

I think this would simplify the code quite a bit.

I started off by doing kj::ForkedPromise<void> alarmLaterChain = kj::Promise<void>(kj::READY_NOW).fork();, but then ran into an issue because we wait for any pending move later alarms before we call requestScheduledAlarm() in startPrecommitAlarmScheduling(), when. That breaks sync scheduling in Workerd.

It looks like:

state.schedulingPromise = alarmLaterChain.addBranch().then([this, alarmTime]() { return requestScheduledAlarm(alarmTime); });

so the request to the alarm manager happens on the next event loop turn. This caused the synchronous alarm scheduling tests in actor-sqlite-test to fail.

So after I saw that I figured it would be easier to force it to be sync by clearing the chain. But now I'm wondering if that means:

Move earlier

Move later

Move earlier

would break Workerd because (3) is forced to be async waiting on (2)...

I think what we need here is to drop the cleanup approach, and instead add a method to ActorSqliteHooks which returns a bool indicating whether we're sync (workerd) or async (production).

If we're sync, we just won't ever care about the alarmLaterChain, because everything is sync anyways. If we're async, we care about waiting for the chain + extending the chain.

Does that make sense?

Here is my attempt at fixing that: https://github.com/cloudflare/workerd/compare/5169e0066e0944c2c1194e6b6c368c0022db760c..436a5a6a24847e6f0fed36837180ae34cac922d3

I tried switching to using coroutines instead of relying on the isSynchronous() method, with the thinking being if coroutines are eagerly evaluated in KJ (we set auto initial_suspend() { return stdcoro::suspend_never(); }) then an already resolved promise shouldn't cause us to suspend, whereas .then() would still force us to fire later. However, it doesn't seem to work as I'd have expected (seems like we're still suspending) in tests.

Claude thinks ForkedPromise::addBranch() ensures that the task continues on a future turn of the event loop, which would mean we can't execute synchronously even if the promise is already resolved.

void PromiseNode::OnReadyEvent::init(Event* newEvent) { if (event == _kJ_ALREADY_READY) { // A new continuation was added to a promise that was already ready. In this case, we schedule // breadth-first, to make it difficult for applications to accidentally starve the event loop // by repeatedly waiting on immediate promises. if (newEvent) newEvent->armBreadthFirst(); } else { event = newEvent; } }

I'm not very familiar with the inner workings of async KJ, but am curious if that lines up with what you would expect @kentonv.

kentonv · 2025-11-03T14:38:34Z

src/workerd/io/actor-sqlite.c++

+    // alarm. Otherwise, we could unintentionally cancel our in-flight alarmLaterChain tasks.
+    alarmChainCleanup = kj::none;
+    auto updatePromise = [this, alarmStateForCommit]() -> kj::Promise<void> {
+      co_await requestScheduledAlarm(alarmStateForCommit).catch_([](kj::Exception&& e) {


Optimization: If we know that further alarm updates have been added to the chain after this one, we might as well skip this update.

`alarmLaterTasks` is a background task queue for updating the alarm manager when alarms are moved to a later time (including deletion). Tasks in this queue retry up to 4 times on failure. Prior to this commit, a race condition could occur: 1. Alarm handler completes, queuing a deferred delete in alarmLaterTasks 2. User calls setAlarm() with a new time, which precommits to alarm manager 3. The deferred delete from step 1 retries and overwrites the new alarm This causes the alarm manager to have no alarm while SRS has the new alarm time, leading to alarms that never fire (or fire late if the alarm time was set to run later, not never). To fix this race, we replace `alarmLaterTasks` (TaskSet) with `alarmLaterChain` (promise chain) that serializes all "move later" operations. Additionally, when moving an alarm earlier (precommit phase), we wait for any pending "move later" operations to complete first (simply canceling these promises will not guarantee that they're also canceled on the remote alarm manager in time). This ensures: - "Move later" operations execute sequentially at the alarm manager - "Move earlier" operations wait for pending updates before proceeding - No races between delete and set operations at the alarm manager We do not wait on these chained promises in the synchronous Workerd implementation, but do wait if we're in "async mode" (i.e. production).

MellowYarker · 2025-11-03T19:25:03Z

Sorry, pushed one more update https://github.com/cloudflare/workerd/compare/436a5a6a24847e6f0fed36837180ae34cac922d3..bdcc22e41eaa65e42e6314f5a26437c03678f8aa because a comment went stale while I was writing the previous change.

MellowYarker requested review from a team as code owners October 29, 2025 17:19

MellowYarker requested review from jclee, jqmmes, justin-mp and kentonv October 29, 2025 17:19

MellowYarker force-pushed the milan/STOR-4521-redo branch 2 times, most recently from c3c202f to fe8f895 Compare October 30, 2025 11:35

MellowYarker changed the title ~~Clear alarmLaterTasks when a new alarm is set~~ Prevent alarm update races with promise chain serialization Oct 31, 2025

justin-mp reviewed Oct 31, 2025

View reviewed changes

src/workerd/io/actor-sqlite.c++ Show resolved Hide resolved

src/workerd/io/actor-sqlite.c++ Outdated Show resolved Hide resolved

src/workerd/io/actor-sqlite.c++ Show resolved Hide resolved

MellowYarker force-pushed the milan/STOR-4521-redo branch 2 times, most recently from 7edec58 to a10c548 Compare November 2, 2025 19:15

MellowYarker requested a review from justin-mp November 2, 2025 19:40

kentonv reviewed Nov 3, 2025

View reviewed changes

MellowYarker force-pushed the milan/STOR-4521-redo branch 2 times, most recently from 5169e00 to 436a5a6 Compare November 3, 2025 19:14

MellowYarker force-pushed the milan/STOR-4521-redo branch from 436a5a6 to bdcc22e Compare November 3, 2025 19:23

Prevent alarm update races with promise chain serialization #5431

Are you sure you want to change the base?

Prevent alarm update races with promise chain serialization #5431

Conversation

MellowYarker commented Oct 29, 2025

Uh oh!

MellowYarker commented Oct 29, 2025

Uh oh!

kentonv commented Oct 29, 2025

Uh oh!

codspeed-hq bot commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CodSpeed Performance Report

Merging #5431 will improve performances by 12.33%

Summary

Benchmarks breakdown

Footnotes

Uh oh!

justin-mp commented Oct 29, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kentonv Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

MellowYarker Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

MellowYarker Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

MellowYarker Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

MellowYarker Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

kentonv Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

MellowYarker commented Nov 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codspeed-hq bot commented Oct 29, 2025 •

edited

Loading