Skip to content

Conversation

@kaibocai
Copy link
Member

@kaibocai kaibocai commented Jul 12, 2023

Issue describing the changes in this PR

resolves #149

Pull request checklist

  • My changes do not require documentation changes
    • Otherwise: Documentation issue linked to PR
  • My changes are added to the CHANGELOG.md
  • I have added all required tests (Unit tests, E2E tests)

Additional information

Additional PR information

@kaibocai kaibocai marked this pull request as ready for review July 12, 2023 22:21
Copy link
Member

@cgillum cgillum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good find. A few suggestions for improvements to this PR.

this.currentTask = taskFactory.create();
// to make sure the RetriableTask future is same as currentTask,
// so they complete at the same time
setFuture(this.currentTask.future);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code style consistency nit:

Suggested change
setFuture(this.currentTask.future);
this.setFuture(this.currentTask.future);

while (true) {
Task<V> currentTask = this.taskFactory.create();
// For first try, currentTask is not null, this will be skipped.
if (this.currentTask == null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maintainability nit: Can you add a comment here explaining when this.currentTask == null will be true?


this.totalRetryTime = Duration.between(startTime, this.context.getCurrentInstant());
//empty currentTask.
this.currentTask = null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than setting this.currentTask = null and then relying on the null-check at the top of the loop, wouldn't it be simpler to just reset the current task here and remove the null-check?

// Generate a new task/future pair for the next attempt
this.currentTask = this.taskFactory.create();
setFuture(this.currentTask.future);

public List<String> parallelOrchestrator(
@DurableOrchestrationTrigger(name = "ctx") TaskOrchestrationContext ctx) {
RetryPolicy retryPolicy = new RetryPolicy(2, Duration.ofSeconds(5));
TaskOptions taskOptions = new TaskOptions(retryPolicy);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test covers the base case where activities succeed, but shouldn't we also add a test for cases where at least one function call fails and is retried? Otherwise I feel like we're not really validating the correctness of this PR.

Also, would it make sense to have this test coverage in the main client library?

Copy link
Member Author

@kaibocai kaibocai Jul 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, would it make sense to have this test coverage in the main client library?

You mean adding similar test cases for integration tests that run on durable sidecar right?

@kaibocai kaibocai marked this pull request as draft July 13, 2023 13:28
@kaibocai
Copy link
Member Author

shouldn't we also add a test for cases where at least one function call fails and is retried

@cgillum, it turns out the current approach failed on this case. The thing is when customers using anyOf/allOF for example

        tasks.add(ctx.callActivity("Append", "InputSad1", taskOptions, String.class));
        tasks.add(ctx.callActivity("Append", "InputSad1", taskOptions, String.class));
        tasks.add(ctx.callActivity("Append", "InputSad1", taskOptions, String.class));
        return ctx.allOf(tasks).await();

the allOf here returns a CompletableTask

return new CompletableTask<>(CompletableFuture.allOf(futures)
, which is not retriable and we are awaiting on it.
So all the RetriableTask tasks we put in the list won't be retried as the task we await on is a CompletableTask.

So I am thinking let allOf return CompletableTask or RetriableTask depending on the tasks it has in the collection. However, there is another issue comes, as the collection of tasks can be mixed with CompletableTask and RetriableTask. In that case, I cannot run retry logic only on those RetriableTask in the collection. I can only run (or not run) retry logic on all of the tasks altogether.
But maybe this is a corner case and it doesn't really matter to customers that few of their non-retriable tasks get retried?

To really solve this issue, I think we should have a new GRPC type dedicate for RetriableTask, so the SDK can decide whether it should retry based on the message sent from the sidecar.

cc: @davidmrdavid

@cgillum
Copy link
Member

cgillum commented Jul 13, 2023

But maybe this is a corner case and it doesn't really matter to customers that few of their non-retriable tasks get retried?

I think this will be problematic in large fan-out cases. If 1/100 activities fails, the customer will expect that we only retry the 1 that failed and not the other 99. I don't see this as a corner case. I think we should take the time necessary to figure it out, though I'm okay with doing a partial fix since the current problem of getting stuck is worse than the problem of too much retrying.

To really solve this issue, I think we should have a new GRPC type dedicate for RetriableTask, so the SDK can decide whether it should retry based on the message sent from the sidecar.

The gRPC layer has no notion of tasks, and this is by design. Tasks are purely an SDK concept, so there isn't any way to model it in the gRPC layer even if we wanted to. I can make time to try and help identify the correct design, but it will probably have to wait until after I get back from my vacation.

@kaibocai
Copy link
Member Author

Close this PR as a new one is created #157

@kaibocai kaibocai closed this Jul 25, 2023
@kaibocai kaibocai deleted the kaibocai/issue-149-1 branch October 30, 2023 15:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Automatic retry not working with Fan-out Fan-in pattern

3 participants