-
Notifications
You must be signed in to change notification settings - Fork 16
Fix orchestration stuck using anyOf/allOf with retry policy #153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good find. A few suggestions for improvements to this PR.
| this.currentTask = taskFactory.create(); | ||
| // to make sure the RetriableTask future is same as currentTask, | ||
| // so they complete at the same time | ||
| setFuture(this.currentTask.future); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code style consistency nit:
| setFuture(this.currentTask.future); | |
| this.setFuture(this.currentTask.future); |
| while (true) { | ||
| Task<V> currentTask = this.taskFactory.create(); | ||
| // For first try, currentTask is not null, this will be skipped. | ||
| if (this.currentTask == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maintainability nit: Can you add a comment here explaining when this.currentTask == null will be true?
|
|
||
| this.totalRetryTime = Duration.between(startTime, this.context.getCurrentInstant()); | ||
| //empty currentTask. | ||
| this.currentTask = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than setting this.currentTask = null and then relying on the null-check at the top of the loop, wouldn't it be simpler to just reset the current task here and remove the null-check?
// Generate a new task/future pair for the next attempt
this.currentTask = this.taskFactory.create();
setFuture(this.currentTask.future);| public List<String> parallelOrchestrator( | ||
| @DurableOrchestrationTrigger(name = "ctx") TaskOrchestrationContext ctx) { | ||
| RetryPolicy retryPolicy = new RetryPolicy(2, Duration.ofSeconds(5)); | ||
| TaskOptions taskOptions = new TaskOptions(retryPolicy); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test covers the base case where activities succeed, but shouldn't we also add a test for cases where at least one function call fails and is retried? Otherwise I feel like we're not really validating the correctness of this PR.
Also, would it make sense to have this test coverage in the main client library?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, would it make sense to have this test coverage in the main client library?
You mean adding similar test cases for integration tests that run on durable sidecar right?
@cgillum, it turns out the current approach failed on this case. The thing is when customers using tasks.add(ctx.callActivity("Append", "InputSad1", taskOptions, String.class));
tasks.add(ctx.callActivity("Append", "InputSad1", taskOptions, String.class));
tasks.add(ctx.callActivity("Append", "InputSad1", taskOptions, String.class));
return ctx.allOf(tasks).await();the durabletask-java/client/src/main/java/com/microsoft/durabletask/TaskOrchestrationExecutor.java Line 188 in b919929
So all the RetriableTask tasks we put in the list won't be retried as the task we await on is a CompletableTask.
So I am thinking let To really solve this issue, I think we should have a new GRPC type dedicate for cc: @davidmrdavid |
I think this will be problematic in large fan-out cases. If 1/100 activities fails, the customer will expect that we only retry the 1 that failed and not the other 99. I don't see this as a corner case. I think we should take the time necessary to figure it out, though I'm okay with doing a partial fix since the current problem of getting stuck is worse than the problem of too much retrying.
The gRPC layer has no notion of tasks, and this is by design. Tasks are purely an SDK concept, so there isn't any way to model it in the gRPC layer even if we wanted to. I can make time to try and help identify the correct design, but it will probably have to wait until after I get back from my vacation. |
|
Close this PR as a new one is created #157 |
Issue describing the changes in this PR
resolves #149
Pull request checklist
CHANGELOG.mdAdditional information
Additional PR information