Skip to content

Conversation

@hughsw
Copy link
Collaborator

@hughsw hughsw commented Apr 7, 2020

[ This PR is made obsolete by #237]

Changes to fix problem reported in #189 and #78

This fix adopts an approach mentioned in #78 of making a separate Redis call to update the id counter. Some simplifications emerge, but the extra overhead should be reviewed.

The example code in #196 shows no lost jobs when run against this branch.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.4%) to 99.634% when pulling bd568e9 on hughsw:fix-dropped-jobs-incr into cec1498 on bee-queue:master.

@hughsw
Copy link
Collaborator Author

hughsw commented Apr 7, 2020

I've closed #193 which had an alternate approach, and which has considerable discussion.

@hughsw
Copy link
Collaborator Author

hughsw commented Apr 7, 2020

I will have to look into the decrease in code coverage. I need to revisit whether changing the "should not cause an error if immediately removed'" is necessary.

@hughsw
Copy link
Collaborator Author

hughsw commented Apr 7, 2020

Unfortunately the old version of "should not cause an error if immediately removed" fails when run with the new id-setting logic. It is a very hairy test that wraps Queue._waitForJob() with specialized testing logic. Smells

Me wonders if somehow this test relies upon the old race condition...

My speculation at this point is that the test used to work because the job was created in Redis early in the Job.save() promise chain, so the test's removeJob() succeeded as designed. In the new world, the job is created in Redis late in the Job.save() promise chain, and the removeJob() call runs before the job is in Redis, and silently fails, and so the job gets to run...

Way too much fragile asynchrony in play...

@hughsw
Copy link
Collaborator Author

hughsw commented Apr 7, 2020

By a gross, but deterministic hack to Queue._waitForJob() I can make this test pass. The success of the hack supports my contention that the test was really only working due to fortuitous race conditions rather than a deterministic flow of control.

@hughsw
Copy link
Collaborator Author

hughsw commented Apr 8, 2020

Evidence:
Without these changes, master shows over a 2% job loss rate as per the Docker exampe #196

client_1  | {"numJobSaveSuccess":16538,"numJobSaveError":0,"numQueueSucceeded":16544,"numQueueFailed":0,"numJobSucceeded":16180,"throughput":"1838","numSucceededLost":364,"succededLossPercent":"2.2"}
client_1  | {"numJobSaveSuccess":24535,"numJobSaveError":0,"numQueueSucceeded":24532,"numQueueFailed":0,"numJobSucceeded":24017,"throughput":"2044","numSucceededLost":515,"succededLossPercent":"2.1"}

With the changes of this PR, after running overnight, I have 0 lost jobs from over 44 million jobs run (numSucceededLost: 0):

client_1  | {"numJobSaveSuccess":44798380,"numJobSaveError":0,"numQueueSucceeded":44798370,"numQueueFailed":0,"numJobSucceeded":44798370,"throughput":"1226","numSucceededLost":0,"succededLossPercent":"0.0"}
client_1  | {"numJobSaveSuccess":44801830,"numJobSaveError":0,"numQueueSucceeded":44801821,"numQueueFailed":0,"numJobSucceeded":44801821,"throughput":"1226","numSucceededLost":0,"succededLossPercent":"0.0"}

Copy link
Member

@skeggse skeggse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty concerned about the performance impact of this - particularly on high-latency connections. I also think it'll significantly complicate the implementation of #198.

I wonder whether this could be functionality that gets enabled only when some combination of options (e.g. storeJobs) is enabled - the storeJobs option and the other events-related options can result in memory leaks and poor scaling parameters when enabled, and we don't use them internally.

Comment on lines +428 to +429
// The only reason for this callback is to support deterministic testing of
// removeJob between the time brpoplpush resolves and Job.fromId() runs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of modifying the source to support the test, what do you think about modifying the test to either stub the Job.fromId static method or modify the brpoplpush function to provide this hook?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I had thought about wrapping Job.fromId() -- I'll revisit that approach.

.then((results) => {
const numRaised = results[0], nextOpportunity = results[1];
if (numRaised) {
// not documented
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

heh yeah

This seems unrelated to the overarching PR, and maybe we could document it instead of leaving it more explicitly undocumented.

waitDone = null;
}
return waitJob.apply(this, args);
return waitJob.apply(this);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apply like this just always feels weird to me :\

Suggested change
return waitJob.apply(this);
return waitJob.call(this);

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right.

@hughsw
Copy link
Collaborator Author

hughsw commented Apr 15, 2020

Another solution I had working was for Job.save() to generate a uuid. This avoids the extra Redis call. It intrudes slightly on the semantics of newestJob from Queue.checkHealth().

Regarding #198 -- I imagine the present implementation of Queue.saveAll() will make it very easy to demonstrate the dropped jobs problem because so much time passes before the Queue gets to update its jobs table.

Do you think I should I pursue the uuid approach with an eye towards how Queue.saveAll() should be implemented?

@skeggse
Copy link
Member

skeggse commented Apr 15, 2020

Regarding #198 -- I imagine the present implementation of Queue.saveAll() will make it very easy to demonstrate the dropped jobs problem because so much time passes before the Queue gets to update its jobs table.

That's possible, though my (albeit fairly ancient) understanding of how pipelined commands work is that Redis still iterates through clients and accepts commands one at a time from each client where there are buffered commands. What that means (I think) is that the saveAll call will put a bunch of save commands in the kernel buffer on the redis-server's side, and redis will read one of them, then read a command from another client (e.g. the worker client). Worth testing, though! I could imagine adding a semi-atomic mode to saveAll that uses multi instead of batch to more or less do what you're suggesting.


I like the approach in #193 due to its simplicity and its superior scaling properties/avoidance of additional latency in the save-job flow.

I think the key requirements that I'm looking for in a fix for #78 / #198 are (specifically when storeJobs and getEvents are false - I can imagine solutions that change the behavior in other cases):

  • Does not alter the default job ID format, at least prior to bee-queue 2.0
  • Does not introduce additional network latency by virtue of making additional Redis calls that require sequential feedback (i.e. make a redis call to get the ID, then make a redis call to save the job)

So based on those requirements, here are my thoughts (in order of preference, I think):

  • Add an extra redis call in some cases (a la this PR), either by default when necessary to support the desired features (event delivery), or by explicit opt-in
  • Change the format of the ID in some cases (a la fix(job.id): Use unique hex string for job id to avoid a race condition #193), either by default when necessary to support the desired features (event delivery), or by explicit opt-in (which could become default in bee-queue 2.0)
    • Possibly also add a new key to the data model to track completed and accepted jobs, for metrication
    • Possibly learn from MongoDB's ObjectID scheme that attempts to reduce the (already incredibly small) possibility of ID collision
  • Augment the data model to store jobs that have completed but for whom the sender has not yet acknowledged the completion - not sure what would inspire the sender to go and check this (maybe a periodic check), but it would allow eventual event delivery at the expense of additional complexity and higher latency in the exceptional case
  • Augment the data model to use redis streams (one per client/sender, for example; might use an alternate pull-messaging approach per here) for event delivery instead of pubsub (possibly as a configurable feature); this might be a breaking change, and would introduce concerns around client persistence and the delivery of the updates to non-senders
  • Augment the pubsub approach to publish using a key that includes the job ID. In the completion script, if the publish doesn't see any receivers for the event, it uses the augmented data model to store the job's final status for eventual delivery
  • Add a pluggable job ID allocation mechanism (along the lines of a plugin system)
  • Reserve blocks of IDs for a given worker, so that it can synchronously dole out IDs while mostly preserving the information in the id key for input metrication
    • This adds a lot of complexity, so not a preferred solution
  • Add an artificial delay to the publish in _finishJob if the job finishes quickly. Not a complete solution, as it only mitigates the race condition - it does not solve it - and it hurts bee-queue's core features of performance and latency

I suspect the solution will be one of the first two, with the thinking that the event delivery system will become a plugin at some point. Ultimately, the solution will need to be predicated on your needs, as we don't use bee-queue's event delivery due to its somewhat suboptimal scaling properties (and lack of need for its functionality to begin with).

@hughsw
Copy link
Collaborator Author

hughsw commented Apr 18, 2020

There's a lot to think about that you've outlined. I appreciate your putting it all down. We've touched on version 2 in other PR and Issues threads. How much of the API is on the table for version 2? Where is that discussion happening? I have some radical simplification ideas, prompted by your mentions of a more plug-in centric architecture.

@hughsw hughsw marked this pull request as draft April 20, 2020 12:28
@hughsw hughsw changed the title Fix dropped jobs issue [Obsolete draft] Fix dropped jobs issue Apr 20, 2020
@bradvogel bradvogel mentioned this pull request May 26, 2020
@compwright compwright closed this Nov 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants