Skip to content

Conversation

@MellowYarker
Copy link
Contributor

alarmLaterTasks is a background task queue for updating the alarm manager when alarms are moved to a later time (including deletion). Tasks in this queue retry up to 4 times on failure.

Prior to this commit, a race condition could occur:

  1. Alarm handler completes, queuing a deferred delete in alarmLaterTasks
  2. User calls setAlarm() with a new time, which precommits to alarm manager
  3. The deferred delete from step 1 retries and overwrites the new alarm

This causes the alarm manager to have no alarm while SRS has the new alarm time, leading to alarms that never fire (or fire late if the alarm time was set to run later, not never).

To fix this race, we clear alarmLaterTasks whenever a new alarm is set. This is safe because:

  • If we have a pending "move later" task, the alarm manager already has an earlier alarm time that will fire and reconcile via armAlarmHandler

  • The new alarm being set supersedes any pending "move later" operations anyway

This ensures the alarm manager always converges to the correct state without racing background updates.

@MellowYarker
Copy link
Contributor Author

I can push all this to my other open PR but figured it might be better to just close it as is.

@kentonv
Copy link
Member

kentonv commented Oct 29, 2025

Ah hah, that's a good catch!

This should help, but canceling probably doesn't fully resolve the race. It takes some time for the cancellation to propagate to the backend. It's still possible that a subsequent write gets to the backend first and gets applied, and then the canceled request applies after it, before realizing it has been canceled.

To fully avoid this what we really need to do is wait for all previous requests to complete before sending a new request. We might need to arrange our alarm updates into a chain rather than a TaskSet...

@codspeed-hq
Copy link

codspeed-hq bot commented Oct 29, 2025

CodSpeed Performance Report

Merging #5431 will improve performances by 12.33%

Comparing milan/STOR-4521-redo (bdcc22e) with main (b0a7b2f)

Summary

⚡ 1 improvement
✅ 33 untouched
⏩ 9 skipped1

Benchmarks breakdown

Benchmark BASE HEAD Change
simpleStringBody[Response] 22.5 µs 20 µs +12.33%

Footnotes

  1. 9 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@justin-mp
Copy link
Contributor

To fully avoid this what we really need to do is wait for all previous requests to complete before sending a new request. We might need to arrange our alarm updates into a chain rather than a TaskSet...

The commitImpl has a pendingCommit that we could use to synchronize requests. It serves a slightly different purpose, but maybe we adapt that to run the commits in order?

@MellowYarker MellowYarker force-pushed the milan/STOR-4521-redo branch 2 times, most recently from c3c202f to fe8f895 Compare October 30, 2025 11:35
@MellowYarker MellowYarker changed the title Clear alarmLaterTasks when a new alarm is set Prevent alarm update races with promise chain serialization Oct 31, 2025
@MellowYarker MellowYarker force-pushed the milan/STOR-4521-redo branch 2 times, most recently from 7edec58 to a10c548 Compare November 2, 2025 19:15
// Promise that cleans up alarmLaterChain after all operations complete. This promise gets
// cancelled if we queue another "move alarm later" request, to prevent it from cancelling the
// new in-flight request.
kj::Maybe<kj::Promise<void>> alarmChainCleanup;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this cleanup promise is unnecessary. It seems to exist purely to set alarmLaterChain back to null when the chain completes. You can instead leave alarmLaterChain as a promise that has already resolved -- this just means we'll harmlessly wait on an already-resolved promise at the start of the next alarm update. In fact, alarmLaterChain need not be a Maybe, it can instead be initialized to kj::Promise<void>(kj::READY_NOW).

I think this would simplify the code quite a bit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started off by doing kj::ForkedPromise<void> alarmLaterChain = kj::Promise<void>(kj::READY_NOW).fork();, but then ran into an issue because we wait for any pending move later alarms before we call requestScheduledAlarm() in startPrecommitAlarmScheduling(), when. That breaks sync scheduling in Workerd.

It looks like:

state.schedulingPromise =
      alarmLaterChain.addBranch().then([this, alarmTime]() { return requestScheduledAlarm(alarmTime); });

so the request to the alarm manager happens on the next event loop turn. This caused the synchronous alarm scheduling tests in actor-sqlite-test to fail.

So after I saw that I figured it would be easier to force it to be sync by clearing the chain. But now I'm wondering if that means:

  1. Move earlier
  2. Move later
  3. Move earlier

would break Workerd because (3) is forced to be async waiting on (2)...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what we need here is to drop the cleanup approach, and instead add a method to ActorSqliteHooks which returns a bool indicating whether we're sync (workerd) or async (production).

If we're sync, we just won't ever care about the alarmLaterChain, because everything is sync anyways. If we're async, we care about waiting for the chain + extending the chain.

Does that make sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried switching to using coroutines instead of relying on the isSynchronous() method, with the thinking being if coroutines are eagerly evaluated in KJ (we set auto initial_suspend() { return stdcoro::suspend_never(); }) then an already resolved promise shouldn't cause us to suspend, whereas .then() would still force us to fire later. However, it doesn't seem to work as I'd have expected (seems like we're still suspending) in tests.

Claude thinks ForkedPromise::addBranch() ensures that the task continues on a future turn of the event loop, which would mean we can't execute synchronously even if the promise is already resolved.

void PromiseNode::OnReadyEvent::init(Event* newEvent) {
  if (event == _kJ_ALREADY_READY) {
    // A new continuation was added to a promise that was already ready.  In this case, we schedule
    // breadth-first, to make it difficult for applications to accidentally starve the event loop
    // by repeatedly waiting on immediate promises.
    if (newEvent) newEvent->armBreadthFirst();
  } else {
    event = newEvent;
  }
}

I'm not very familiar with the inner workings of async KJ, but am curious if that lines up with what you would expect @kentonv.

// alarm. Otherwise, we could unintentionally cancel our in-flight alarmLaterChain tasks.
alarmChainCleanup = kj::none;
auto updatePromise = [this, alarmStateForCommit]() -> kj::Promise<void> {
co_await requestScheduledAlarm(alarmStateForCommit).catch_([](kj::Exception&& e) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optimization: If we know that further alarm updates have been added to the chain after this one, we might as well skip this update.

@MellowYarker MellowYarker force-pushed the milan/STOR-4521-redo branch 2 times, most recently from 5169e00 to 436a5a6 Compare November 3, 2025 19:14
`alarmLaterTasks` is a background task queue for updating the alarm
manager when alarms are moved to a later time (including deletion).
Tasks in this queue retry up to 4 times on failure.

Prior to this commit, a race condition could occur:
1. Alarm handler completes, queuing a deferred delete in alarmLaterTasks
2. User calls setAlarm() with a new time, which precommits to alarm
   manager
3. The deferred delete from step 1 retries and overwrites the new alarm

This causes the alarm manager to have no alarm while SRS has the new
alarm time, leading to alarms that never fire (or fire late if the alarm
time was set to run later, not never).

To fix this race, we replace `alarmLaterTasks` (TaskSet) with
`alarmLaterChain` (promise chain) that serializes all "move later"
operations. Additionally, when moving an alarm earlier (precommit
phase), we wait for any pending "move later" operations to complete
first (simply canceling these promises will not guarantee that they're
also canceled on the remote alarm manager in time).

This ensures:
	- "Move later" operations execute sequentially at the alarm manager
	- "Move earlier" operations wait for pending updates before proceeding
	- No races between delete and set operations at the alarm manager

We do not wait on these chained promises in the synchronous Workerd
implementation, but do wait if we're in "async mode" (i.e. production).
@MellowYarker
Copy link
Contributor Author

Sorry, pushed one more update https://github.com/cloudflare/workerd/compare/436a5a6a24847e6f0fed36837180ae34cac922d3..bdcc22e41eaa65e42e6314f5a26437c03678f8aa because a comment went stale while I was writing the previous change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants