Fix orchestration stuck using anyOf/allOf with retry policy #153

kaibocai · 2023-07-12T21:45:53Z

Issue describing the changes in this PR

resolves #149

Pull request checklist

My changes do not require documentation changes
- Otherwise: Documentation issue linked to PR
My changes are added to the CHANGELOG.md
I have added all required tests (Unit tests, E2E tests)

Additional information

Additional PR information

cgillum

Good find. A few suggestions for improvements to this PR.

cgillum · 2023-07-12T22:57:34Z

client/src/main/java/com/microsoft/durabletask/TaskOrchestrationExecutor.java

+                this.currentTask = taskFactory.create();
+                // to make sure the RetriableTask future is same as currentTask,
+                // so they complete at the same time
+                setFuture(this.currentTask.future);


code style consistency nit:

Suggested change

setFuture(this.currentTask.future);

this.setFuture(this.currentTask.future);

cgillum · 2023-07-12T22:59:44Z

client/src/main/java/com/microsoft/durabletask/TaskOrchestrationExecutor.java

                while (true) {
-                    Task<V> currentTask = this.taskFactory.create();
+                    // For first try, currentTask is not null, this will be skipped.
+                    if (this.currentTask == null) {


Maintainability nit: Can you add a comment here explaining when this.currentTask == null will be true?

cgillum · 2023-07-12T23:03:44Z

client/src/main/java/com/microsoft/durabletask/TaskOrchestrationExecutor.java


                    this.totalRetryTime  = Duration.between(startTime, this.context.getCurrentInstant());
+                    //empty currentTask.
+                    this.currentTask = null;


Rather than setting this.currentTask = null and then relying on the null-check at the top of the loop, wouldn't it be simpler to just reset the current task here and remove the null-check?

// Generate a new task/future pair for the next attempt this.currentTask = this.taskFactory.create(); setFuture(this.currentTask.future);

cgillum · 2023-07-12T23:11:39Z

samples-azure-functions/src/main/java/com/functions/ParallelFunctions.java

+    public List<String> parallelOrchestrator(
+            @DurableOrchestrationTrigger(name = "ctx") TaskOrchestrationContext ctx) {
+        RetryPolicy retryPolicy = new RetryPolicy(2, Duration.ofSeconds(5));
+        TaskOptions taskOptions = new TaskOptions(retryPolicy);


This test covers the base case where activities succeed, but shouldn't we also add a test for cases where at least one function call fails and is retried? Otherwise I feel like we're not really validating the correctness of this PR.

Also, would it make sense to have this test coverage in the main client library?

Also, would it make sense to have this test coverage in the main client library?

You mean adding similar test cases for integration tests that run on durable sidecar right?

kaibocai · 2023-07-13T14:11:26Z

shouldn't we also add a test for cases where at least one function call fails and is retried

@cgillum, it turns out the current approach failed on this case. The thing is when customers using anyOf/allOF for example

        tasks.add(ctx.callActivity("Append", "InputSad1", taskOptions, String.class));
        tasks.add(ctx.callActivity("Append", "InputSad1", taskOptions, String.class));
        tasks.add(ctx.callActivity("Append", "InputSad1", taskOptions, String.class));
        return ctx.allOf(tasks).await();

the allOf here returns a CompletableTask

durabletask-java/client/src/main/java/com/microsoft/durabletask/TaskOrchestrationExecutor.java

Line 188 in b919929

return new CompletableTask<>(CompletableFuture.allOf(futures)

, which is not retriable and we are awaiting on it.
So all the RetriableTask tasks we put in the list won't be retried as the task we await on is a CompletableTask.

So I am thinking let allOf return CompletableTask or RetriableTask depending on the tasks it has in the collection. However, there is another issue comes, as the collection of tasks can be mixed with CompletableTask and RetriableTask. In that case, I cannot run retry logic only on those RetriableTask in the collection. I can only run (or not run) retry logic on all of the tasks altogether.
But maybe this is a corner case and it doesn't really matter to customers that few of their non-retriable tasks get retried?

To really solve this issue, I think we should have a new GRPC type dedicate for RetriableTask, so the SDK can decide whether it should retry based on the message sent from the sidecar.

cc: @davidmrdavid

cgillum · 2023-07-13T15:51:21Z

But maybe this is a corner case and it doesn't really matter to customers that few of their non-retriable tasks get retried?

I think this will be problematic in large fan-out cases. If 1/100 activities fails, the customer will expect that we only retry the 1 that failed and not the other 99. I don't see this as a corner case. I think we should take the time necessary to figure it out, though I'm okay with doing a partial fix since the current problem of getting stuck is worse than the problem of too much retrying.

To really solve this issue, I think we should have a new GRPC type dedicate for RetriableTask, so the SDK can decide whether it should retry based on the message sent from the sidecar.

The gRPC layer has no notion of tasks, and this is by design. Tasks are purely an SDK concept, so there isn't any way to model it in the gRPC layer even if we wanted to. I can make time to try and help identify the correct design, but it will probably have to wait until after I get back from my vacation.

kaibocai · 2023-07-25T14:10:34Z

Close this PR as a new one is created #157

kaibocai added 3 commits July 12, 2023 16:44

fix orchestration stuck using anyOf/allOf with retry policy

becef0c

Update CHANGELOG.md

cab6d6d

clean up code

e50dd9a

kaibocai marked this pull request as ready for review July 12, 2023 22:21

kaibocai requested review from cgillum and shreyas-gopalakrishna as code owners July 12, 2023 22:21

kaibocai mentioned this pull request Jul 12, 2023

Fix orchestration stuck when using anyof/allof with retry policy #152

Closed

4 tasks

update end2end test

a4b7990

cgillum reviewed Jul 12, 2023

View reviewed changes

kaibocai marked this pull request as draft July 13, 2023 13:28

kaibocai closed this Jul 25, 2023

kaibocai deleted the kaibocai/issue-149-1 branch October 30, 2023 15:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix orchestration stuck using anyOf/allOf with retry policy #153

Fix orchestration stuck using anyOf/allOf with retry policy #153

Uh oh!

kaibocai commented Jul 12, 2023 •

edited

Loading

Uh oh!

cgillum left a comment •

edited

Loading

Uh oh!

cgillum Jul 12, 2023

Uh oh!

cgillum Jul 12, 2023

Uh oh!

cgillum Jul 12, 2023

Uh oh!

cgillum Jul 12, 2023

Uh oh!

kaibocai Jul 12, 2023 •

edited

Loading

Uh oh!

kaibocai commented Jul 13, 2023

Uh oh!

cgillum commented Jul 13, 2023

Uh oh!

kaibocai commented Jul 25, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	setFuture(this.currentTask.future);
	this.setFuture(this.currentTask.future);

Fix orchestration stuck using anyOf/allOf with retry policy #153

Fix orchestration stuck using anyOf/allOf with retry policy #153

Uh oh!

Conversation

kaibocai commented Jul 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issue describing the changes in this PR

Pull request checklist

Additional information

Uh oh!

cgillum left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cgillum Jul 12, 2023

Choose a reason for hiding this comment

Uh oh!

cgillum Jul 12, 2023

Choose a reason for hiding this comment

Uh oh!

cgillum Jul 12, 2023

Choose a reason for hiding this comment

Uh oh!

cgillum Jul 12, 2023

Choose a reason for hiding this comment

Uh oh!

kaibocai Jul 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kaibocai commented Jul 13, 2023

Uh oh!

cgillum commented Jul 13, 2023

Uh oh!

kaibocai commented Jul 25, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kaibocai commented Jul 12, 2023 •

edited

Loading

cgillum left a comment •

edited

Loading

kaibocai Jul 12, 2023 •

edited

Loading