-
Notifications
You must be signed in to change notification settings - Fork 465
Prevent alarm update races with promise chain serialization #5431
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
I can push all this to my other open PR but figured it might be better to just close it as is. |
|
Ah hah, that's a good catch! This should help, but canceling probably doesn't fully resolve the race. It takes some time for the cancellation to propagate to the backend. It's still possible that a subsequent write gets to the backend first and gets applied, and then the canceled request applies after it, before realizing it has been canceled. To fully avoid this what we really need to do is wait for all previous requests to complete before sending a new request. We might need to arrange our alarm updates into a chain rather than a TaskSet... |
CodSpeed Performance ReportMerging #5431 will improve performances by 12.33%Comparing Summary
Benchmarks breakdown
Footnotes
|
The |
c3c202f to
fe8f895
Compare
alarmLaterTasks when a new alarm is set7edec58 to
a10c548
Compare
src/workerd/io/actor-sqlite.h
Outdated
| // Promise that cleans up alarmLaterChain after all operations complete. This promise gets | ||
| // cancelled if we queue another "move alarm later" request, to prevent it from cancelling the | ||
| // new in-flight request. | ||
| kj::Maybe<kj::Promise<void>> alarmChainCleanup; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this cleanup promise is unnecessary. It seems to exist purely to set alarmLaterChain back to null when the chain completes. You can instead leave alarmLaterChain as a promise that has already resolved -- this just means we'll harmlessly wait on an already-resolved promise at the start of the next alarm update. In fact, alarmLaterChain need not be a Maybe, it can instead be initialized to kj::Promise<void>(kj::READY_NOW).
I think this would simplify the code quite a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started off by doing kj::ForkedPromise<void> alarmLaterChain = kj::Promise<void>(kj::READY_NOW).fork();, but then ran into an issue because we wait for any pending move later alarms before we call requestScheduledAlarm() in startPrecommitAlarmScheduling(), when. That breaks sync scheduling in Workerd.
It looks like:
state.schedulingPromise =
alarmLaterChain.addBranch().then([this, alarmTime]() { return requestScheduledAlarm(alarmTime); });
so the request to the alarm manager happens on the next event loop turn. This caused the synchronous alarm scheduling tests in actor-sqlite-test to fail.
So after I saw that I figured it would be easier to force it to be sync by clearing the chain. But now I'm wondering if that means:
- Move earlier
- Move later
- Move earlier
would break Workerd because (3) is forced to be async waiting on (2)...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what we need here is to drop the cleanup approach, and instead add a method to ActorSqliteHooks which returns a bool indicating whether we're sync (workerd) or async (production).
If we're sync, we just won't ever care about the alarmLaterChain, because everything is sync anyways. If we're async, we care about waiting for the chain + extending the chain.
Does that make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is my attempt at fixing that: https://github.com/cloudflare/workerd/compare/5169e0066e0944c2c1194e6b6c368c0022db760c..436a5a6a24847e6f0fed36837180ae34cac922d3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried switching to using coroutines instead of relying on the isSynchronous() method, with the thinking being if coroutines are eagerly evaluated in KJ (we set auto initial_suspend() { return stdcoro::suspend_never(); }) then an already resolved promise shouldn't cause us to suspend, whereas .then() would still force us to fire later. However, it doesn't seem to work as I'd have expected (seems like we're still suspending) in tests.
Claude thinks ForkedPromise::addBranch() ensures that the task continues on a future turn of the event loop, which would mean we can't execute synchronously even if the promise is already resolved.
void PromiseNode::OnReadyEvent::init(Event* newEvent) {
if (event == _kJ_ALREADY_READY) {
// A new continuation was added to a promise that was already ready. In this case, we schedule
// breadth-first, to make it difficult for applications to accidentally starve the event loop
// by repeatedly waiting on immediate promises.
if (newEvent) newEvent->armBreadthFirst();
} else {
event = newEvent;
}
}
I'm not very familiar with the inner workings of async KJ, but am curious if that lines up with what you would expect @kentonv.
src/workerd/io/actor-sqlite.c++
Outdated
| // alarm. Otherwise, we could unintentionally cancel our in-flight alarmLaterChain tasks. | ||
| alarmChainCleanup = kj::none; | ||
| auto updatePromise = [this, alarmStateForCommit]() -> kj::Promise<void> { | ||
| co_await requestScheduledAlarm(alarmStateForCommit).catch_([](kj::Exception&& e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optimization: If we know that further alarm updates have been added to the chain after this one, we might as well skip this update.
5169e00 to
436a5a6
Compare
`alarmLaterTasks` is a background task queue for updating the alarm manager when alarms are moved to a later time (including deletion). Tasks in this queue retry up to 4 times on failure. Prior to this commit, a race condition could occur: 1. Alarm handler completes, queuing a deferred delete in alarmLaterTasks 2. User calls setAlarm() with a new time, which precommits to alarm manager 3. The deferred delete from step 1 retries and overwrites the new alarm This causes the alarm manager to have no alarm while SRS has the new alarm time, leading to alarms that never fire (or fire late if the alarm time was set to run later, not never). To fix this race, we replace `alarmLaterTasks` (TaskSet) with `alarmLaterChain` (promise chain) that serializes all "move later" operations. Additionally, when moving an alarm earlier (precommit phase), we wait for any pending "move later" operations to complete first (simply canceling these promises will not guarantee that they're also canceled on the remote alarm manager in time). This ensures: - "Move later" operations execute sequentially at the alarm manager - "Move earlier" operations wait for pending updates before proceeding - No races between delete and set operations at the alarm manager We do not wait on these chained promises in the synchronous Workerd implementation, but do wait if we're in "async mode" (i.e. production).
436a5a6 to
bdcc22e
Compare
|
Sorry, pushed one more update https://github.com/cloudflare/workerd/compare/436a5a6a24847e6f0fed36837180ae34cac922d3..bdcc22e41eaa65e42e6314f5a26437c03678f8aa because a comment went stale while I was writing the previous change. |
alarmLaterTasksis a background task queue for updating the alarm manager when alarms are moved to a later time (including deletion). Tasks in this queue retry up to 4 times on failure.Prior to this commit, a race condition could occur:
alarmLaterTasksThis causes the alarm manager to have no alarm while SRS has the new alarm time, leading to alarms that never fire (or fire late if the alarm time was set to run later, not never).
To fix this race, we clear
alarmLaterTaskswhenever a new alarm is set. This is safe because:If we have a pending "move later" task, the alarm manager already has an earlier alarm time that will fire and reconcile via
armAlarmHandlerThe new alarm being set supersedes any pending "move later" operations anyway
This ensures the alarm manager always converges to the correct state without racing background updates.