[Obsolete draft] Fix dropped jobs issue #197

hughsw · 2020-04-07T15:04:46Z

[ This PR is made obsolete by #237]

Changes to fix problem reported in #189 and #78

This fix adopts an approach mentioned in #78 of making a separate Redis call to update the id counter. Some simplifications emerge, but the extra overhead should be reviewed.

The example code in #196 shows no lost jobs when run against this branch.

coveralls · 2020-04-07T15:06:04Z

Coverage decreased (-0.4%) to 99.634% when pulling bd568e9 on hughsw:fix-dropped-jobs-incr into cec1498 on bee-queue:master.

hughsw · 2020-04-07T15:09:11Z

I've closed #193 which had an alternate approach, and which has considerable discussion.

hughsw · 2020-04-07T15:10:05Z

I will have to look into the decrease in code coverage. I need to revisit whether changing the "should not cause an error if immediately removed'" is necessary.

hughsw · 2020-04-07T15:23:34Z

Unfortunately the old version of "should not cause an error if immediately removed" fails when run with the new id-setting logic. It is a very hairy test that wraps Queue._waitForJob() with specialized testing logic. Smells

Me wonders if somehow this test relies upon the old race condition...

My speculation at this point is that the test used to work because the job was created in Redis early in the Job.save() promise chain, so the test's removeJob() succeeded as designed. In the new world, the job is created in Redis late in the Job.save() promise chain, and the removeJob() call runs before the job is in Redis, and silently fails, and so the job gets to run...

Way too much fragile asynchrony in play...

hughsw · 2020-04-07T20:50:36Z

By a gross, but deterministic hack to Queue._waitForJob() I can make this test pass. The success of the hack supports my contention that the test was really only working due to fortuitous race conditions rather than a deterministic flow of control.

hughsw · 2020-04-08T11:48:38Z

Evidence:
Without these changes, master shows over a 2% job loss rate as per the Docker exampe #196

client_1  | {"numJobSaveSuccess":16538,"numJobSaveError":0,"numQueueSucceeded":16544,"numQueueFailed":0,"numJobSucceeded":16180,"throughput":"1838","numSucceededLost":364,"succededLossPercent":"2.2"}
client_1  | {"numJobSaveSuccess":24535,"numJobSaveError":0,"numQueueSucceeded":24532,"numQueueFailed":0,"numJobSucceeded":24017,"throughput":"2044","numSucceededLost":515,"succededLossPercent":"2.1"}

With the changes of this PR, after running overnight, I have 0 lost jobs from over 44 million jobs run (numSucceededLost: 0):

client_1  | {"numJobSaveSuccess":44798380,"numJobSaveError":0,"numQueueSucceeded":44798370,"numQueueFailed":0,"numJobSucceeded":44798370,"throughput":"1226","numSucceededLost":0,"succededLossPercent":"0.0"}
client_1  | {"numJobSaveSuccess":44801830,"numJobSaveError":0,"numQueueSucceeded":44801821,"numQueueFailed":0,"numJobSucceeded":44801821,"throughput":"1226","numSucceededLost":0,"succededLossPercent":"0.0"}

skeggse

I'm pretty concerned about the performance impact of this - particularly on high-latency connections. I also think it'll significantly complicate the implementation of #198.

I wonder whether this could be functionality that gets enabled only when some combination of options (e.g. storeJobs) is enabled - the storeJobs option and the other events-related options can result in memory leaks and poor scaling parameters when enabled, and we don't use them internally.

skeggse · 2020-04-08T22:30:21Z

lib/queue.js

+    // The only reason for this callback is to support deterministic testing of
+    // removeJob between the time brpoplpush resolves and Job.fromId() runs


Instead of modifying the source to support the test, what do you think about modifying the test to either stub the Job.fromId static method or modify the brpoplpush function to provide this hook?

Yes, I had thought about wrapping Job.fromId() -- I'll revisit that approach.

skeggse · 2020-04-08T22:30:53Z

lib/queue.js

      .then((results) => {
        const numRaised = results[0], nextOpportunity = results[1];
        if (numRaised) {
+          // not documented


heh yeah

This seems unrelated to the overarching PR, and maybe we could document it instead of leaving it more explicitly undocumented.

skeggse · 2020-04-08T22:31:16Z

test/queue-test.js

          waitDone = null;
        }
-        return waitJob.apply(this, args);
+        return waitJob.apply(this);


apply like this just always feels weird to me :\

Suggested change

return waitJob.apply(this);

return waitJob.call(this);

You're right.

hughsw · 2020-04-15T19:10:36Z

Another solution I had working was for Job.save() to generate a uuid. This avoids the extra Redis call. It intrudes slightly on the semantics of newestJob from Queue.checkHealth().

Regarding #198 -- I imagine the present implementation of Queue.saveAll() will make it very easy to demonstrate the dropped jobs problem because so much time passes before the Queue gets to update its jobs table.

Do you think I should I pursue the uuid approach with an eye towards how Queue.saveAll() should be implemented?

skeggse · 2020-04-15T19:39:51Z

Regarding #198 -- I imagine the present implementation of Queue.saveAll() will make it very easy to demonstrate the dropped jobs problem because so much time passes before the Queue gets to update its jobs table.

That's possible, though my (albeit fairly ancient) understanding of how pipelined commands work is that Redis still iterates through clients and accepts commands one at a time from each client where there are buffered commands. What that means (I think) is that the saveAll call will put a bunch of save commands in the kernel buffer on the redis-server's side, and redis will read one of them, then read a command from another client (e.g. the worker client). Worth testing, though! I could imagine adding a semi-atomic mode to saveAll that uses multi instead of batch to more or less do what you're suggesting.

I like the approach in #193 due to its simplicity and its superior scaling properties/avoidance of additional latency in the save-job flow.

I think the key requirements that I'm looking for in a fix for #78 / #198 are (specifically when storeJobs and getEvents are false - I can imagine solutions that change the behavior in other cases):

Does not alter the default job ID format, at least prior to bee-queue 2.0
Does not introduce additional network latency by virtue of making additional Redis calls that require sequential feedback (i.e. make a redis call to get the ID, then make a redis call to save the job)

So based on those requirements, here are my thoughts (in order of preference, I think):

Add an extra redis call in some cases (a la this PR), either by default when necessary to support the desired features (event delivery), or by explicit opt-in
Change the format of the ID in some cases (a la fix(job.id): Use unique hex string for job id to avoid a race condition #193), either by default when necessary to support the desired features (event delivery), or by explicit opt-in (which could become default in bee-queue 2.0)
- Possibly also add a new key to the data model to track completed and accepted jobs, for metrication
- Possibly learn from MongoDB's ObjectID scheme that attempts to reduce the (already incredibly small) possibility of ID collision
Augment the data model to store jobs that have completed but for whom the sender has not yet acknowledged the completion - not sure what would inspire the sender to go and check this (maybe a periodic check), but it would allow eventual event delivery at the expense of additional complexity and higher latency in the exceptional case
Augment the data model to use redis streams (one per client/sender, for example; might use an alternate pull-messaging approach per here) for event delivery instead of pubsub (possibly as a configurable feature); this might be a breaking change, and would introduce concerns around client persistence and the delivery of the updates to non-senders
Augment the pubsub approach to publish using a key that includes the job ID. In the completion script, if the publish doesn't see any receivers for the event, it uses the augmented data model to store the job's final status for eventual delivery
Add a pluggable job ID allocation mechanism (along the lines of a plugin system)
Reserve blocks of IDs for a given worker, so that it can synchronously dole out IDs while mostly preserving the information in the id key for input metrication
- This adds a lot of complexity, so not a preferred solution
Add an artificial delay to the publish in _finishJob if the job finishes quickly. Not a complete solution, as it only mitigates the race condition - it does not solve it - and it hurts bee-queue's core features of performance and latency

I suspect the solution will be one of the first two, with the thinking that the event delivery system will become a plugin at some point. Ultimately, the solution will need to be predicated on your needs, as we don't use bee-queue's event delivery due to its somewhat suboptimal scaling properties (and lack of need for its functionality to begin with).

hughsw · 2020-04-18T14:51:02Z

There's a lot to think about that you've outlined. I appreciate your putting it all down. We've touched on version 2 in other PR and Issues threads. How much of the API is on the table for version 2? Where is that discussion happening? I have some radical simplification ideas, prompted by your mentions of a more plug-in centric architecture.

hughsw added 2 commits April 7, 2020 10:58

fix(lost-jobs): Fix lost-jobs issue by incrementing counter early

623a96c

fix(lost-jobs): Update some tests to deal with job.id increment changes

bd568e9

hughsw mentioned this pull request Apr 7, 2020

fix(job.id): Use unique hex string for job id to avoid a race condition #193

Closed

feat(lost-jobs-testing): Change a removeJob test to be deterministic

5a1bb3a

skeggse reviewed Apr 8, 2020

View reviewed changes

skeggse mentioned this pull request Apr 16, 2020

fix: prevent event delivery race condition #232

Closed

hughsw marked this pull request as draft April 20, 2020 12:28

hughsw changed the title ~~Fix dropped jobs issue~~ [Obsolete draft] Fix dropped jobs issue Apr 20, 2020

bradvogel mentioned this pull request May 26, 2020

2.0 dev (WIP) #112

Closed

compwright closed this Nov 21, 2022

		// The only reason for this callback is to support deterministic testing of
		// removeJob between the time brpoplpush resolves and Job.fromId() runs

[Obsolete draft] Fix dropped jobs issue #197

[Obsolete draft] Fix dropped jobs issue #197

Conversation

hughsw commented Apr 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coveralls commented Apr 7, 2020

Uh oh!

hughsw commented Apr 7, 2020

Uh oh!

hughsw commented Apr 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hughsw commented Apr 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hughsw commented Apr 7, 2020

Uh oh!

hughsw commented Apr 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

skeggse left a comment

Choose a reason for hiding this comment

Uh oh!

skeggse Apr 8, 2020

Choose a reason for hiding this comment

Uh oh!

hughsw Apr 15, 2020

Choose a reason for hiding this comment

Uh oh!

skeggse Apr 8, 2020

Choose a reason for hiding this comment

Uh oh!

skeggse Apr 8, 2020

Choose a reason for hiding this comment

Uh oh!

hughsw Apr 15, 2020

Choose a reason for hiding this comment

Uh oh!

hughsw commented Apr 15, 2020

Uh oh!

skeggse commented Apr 15, 2020

Uh oh!

hughsw commented Apr 18, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

hughsw commented Apr 7, 2020 •

edited

Loading

hughsw commented Apr 7, 2020 •

edited

Loading

hughsw commented Apr 7, 2020 •

edited

Loading

hughsw commented Apr 8, 2020 •

edited

Loading