I have now a big number of documents to process and am using Python RQ to parallelize the task.
I would like a pipeline of work to be done as different operations is performed on each document. For example: A -> B -> C means pass the document to function A, after A is done, proceed to B and last C.
However, Python RQ does not seem to support the pipeline stuff very nicely.
Here is a simple but somewhat dirty of doing this. In one word, each function along the pipeline call its next function in a nesting way.
For example, for a pipeline A->B->C.
At the top level, some code is written like this:
q.enqueue(A, the_doc)
where q is the Queue instance and in function A there are code like:
q.enqueue(B, the_doc)
And in B, there are something like this:
q.enqueue(C, the_doc)
Is there any other way more elegant than this? For example some code in ONE function:
q.enqueue(A, the_doc)
q.enqueue(B, the_doc, after = A)
q.enqueue(C, the_doc, after= B)
depends_on parameter is the closest one to my requirement, however, running something like:
A_job = q.enqueue(A, the_doc)
q.enqueue(B, depends_on=A_job )
won't work. As q.enqueue(B, depends_on=A_job ) is executed immediately after A_job = q.enqueue(A, the_doc) is executed. By the time B is enqueued, the result from A might not be ready as it takes time to process.
PS:
If Python RQ is not really good at this, what else tool in Python can I use to achieve the same purpose:
- round-robin parallelization
- pipeline processing support
q.enqueue(B, depends_on=A_job )job B will be processed after A is finished. Isn't what matters is when it process rather than enqueue?