-
Notifications
You must be signed in to change notification settings - Fork 696
fix: job retry mechanism not triggering #4961
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Dependency Review✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.Scanned FilesNone |
584b846 to
b02ebc2
Compare
Co-authored-by: Brend Smits <brend.smits@philips.com>
b02ebc2 to
3eee9b2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request fixes a critical bug where the job retry mechanism was not being triggered during the scale-up process. The fix re-introduces the publishRetryMessage call and corrects the logic for skipping runner creation when the maximum runner count is exceeded or when newRunners would be negative.
Key Changes
- Re-introduced the
publishRetryMessagecall in the scale-up loop to ensure retry messages are published for queued jobs - Fixed the condition for skipping runner creation from
missingInstanceCount === scaleUptonewRunners <= 0, preventing attempts to create negative numbers of runners - Added comprehensive test coverage for the retry mechanism with 7 new test cases covering various scenarios
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
lambdas/functions/control-plane/src/scale-runners/scale-up.ts |
Imports publishRetryMessage, calls it for each queued message, and fixes the skip condition to handle negative newRunners values |
lambdas/functions/control-plane/src/scale-runners/scale-up.test.ts |
Adds mock setup for publishRetryMessage and a new test suite with 7 tests covering retry mechanism behavior in various scenarios |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| } | ||
|
|
||
| scaleUp++; | ||
| await publishRetryMessage(message as ActionRequestMessageRetry); |
Copilot
AI
Dec 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The type cast to ActionRequestMessageRetry is incorrect and unnecessary. The publishRetryMessage function accepts ActionRequestMessage as its parameter (line 16 in job-retry.ts), not ActionRequestMessageRetry. Additionally, ActionRequestMessageRetry requires retryCounter to be a non-optional number field, but message here is of type ActionRequestMessageSQS where retryCounter is optional. The cast should be removed since ActionRequestMessageSQS already extends ActionRequestMessage.
| await publishRetryMessage(message as ActionRequestMessageRetry); | |
| await publishRetryMessage(message); |
| it('calls publishRetryMessage even when maximum runners is reached', async () => { | ||
| process.env.RUNNERS_MAXIMUM_COUNT = '0'; // No runners can be created | ||
|
|
||
| const messages = createTestMessages(2); | ||
|
|
||
| await scaleUpModule.scaleUp(messages); | ||
|
|
||
| // publishRetryMessage should still be called even though no runners will be created | ||
| expect(mockPublishRetryMessage).toHaveBeenCalledTimes(2); | ||
| expect(createRunner).not.toHaveBeenCalled(); | ||
| }); |
Copilot
AI
Dec 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test sets RUNNERS_MAXIMUM_COUNT = '0' but doesn't verify that listEC2Runners is called. According to the scale-up logic (lines 370-371 in scale-up.ts), when maximumRunners is not -1, listEC2Runners should be called to get the current runner count. The test should verify this behavior is occurring as expected.
| } | ||
|
|
||
| scaleUp++; | ||
| await publishRetryMessage(message as ActionRequestMessageRetry); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, the call got lost in the PR that introces the btaches.
I doubt a bit if we should publish the message hre. At this point no runner is created. Only the count is increased. But on the other hand we cannot match the created runners with the event.
I think it would be a bit safter to call the publishRetry messages at the end for each message that is not marked as invalid. Since the invalid once are going back to the queue and will be retried. This will lead that more retry messages are created for the same event.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@copilot what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if it's worth it to add all that additional logic (and an additional for-loop) into this function. It will make it way more complex as well..
In case the message is marked as invalid, worst case it goes through the retry mechanism, ends up in the same scale-up function and it still is an invalid message. Leading to minimal waste.
IMO a good trade-off to avoid making this function even more complex.
What do you think @npalm?
iainlane
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cheers - right now I just had one request!
|
|
||
| // No runners will be created, so skip calling the EC2 API. | ||
| if (missingInstanceCount === scaleUp) { | ||
| if (newRunners <= 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you say this is a separate fix? I think it'd be easier to get in via its own dedicated PR (with tests) if that's possible 👍
This pull request adds comprehensive tests for the retry mechanism in the
scaleUpfunctionality and reintroduces thepublishRetryMessagecall to the scale-up process. The tests ensure that the retry logic works correctly under various scenarios, such as when jobs are queued, when the maximum number of runners is reached, and when queue checks are disabled.Testing and Retry Mechanism Enhancements:
scale-up.test.tsto cover scenarios wherepublishRetryMessageshould be called, including: when jobs are queued, when maximum runners are reached, with correct message structure, and when job queue checks are disabled.Other code Updates:
newRunners <= 0instead of comparing counts, improving clarity and correctness.Example scenarios for the above bug
Scenario 1
Scenario 2
Scenario 3
We tested this in our staging environment and verified it's working.
Closes #4960