Skip to content

Conversation

@stuartp44
Copy link
Contributor

This pull request adds comprehensive tests for the retry mechanism in the scaleUp functionality and reintroduces the publishRetryMessage call to the scale-up process. The tests ensure that the retry logic works correctly under various scenarios, such as when jobs are queued, when the maximum number of runners is reached, and when queue checks are disabled.

Testing and Retry Mechanism Enhancements:

  • Added a new test suite "Retry mechanism tests" in scale-up.test.ts to cover scenarios where publishRetryMessage should be called, including: when jobs are queued, when maximum runners are reached, with correct message structure, and when job queue checks are disabled.

Other code Updates:

  • Fixed logic to skip runner creation if no new runners are needed by checking if newRunners <= 0 instead of comparing counts, improving clarity and correctness.
Example scenarios for the above bug

Scenario 1

  • Admin sets RUNNERS_MAXIMUM_COUNT=20
  • System scales up to 15 active runners
  • Admin reduces RUNNERS_MAXIMUM_COUNT=10 (cost control, policy change)
  • Before those 15 runners terminate, new jobs arrive
  • Bug triggers: newRunners = Math.min(scaleUp, 10-15) = -5
  • Code tries to call createRunners({numberOfRunners: -5}) and fails

Scenario 2

  • RUNNERS_MAXIMUM_COUNT=5
  • Someone manually launches 8 EC2 instances with runner tags
  • New jobs arrive
  • Bug triggers: newRunners = Math.min(2, 5-8) = -3
  • Code tries to call createRunners({numberOfRunners: -3}) and fails

Scenario 3

  • Admin sets RUNNERS_MAXIMUM_COUNT=20
  • System scales up to 15 active runners
  • Admin reduces RUNNERS_MAXIMUM_COUNT=10 (cost control, policy change)
  • Before those 15 runners terminate, new jobs arrive
  • Bug triggers: newRunners = Math.min(scaleUp, 10-15) = -5
  • Code tries to call createRunners({numberOfRunners: -5}) and fails

We tested this in our staging environment and verified it's working.

Closes #4960

@stuartp44 stuartp44 requested a review from a team as a code owner December 18, 2025 10:09
@github-actions
Copy link
Contributor

github-actions bot commented Dec 18, 2025

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Scanned Files

None

Co-authored-by: Brend Smits <brend.smits@philips.com>
@stuartp44 stuartp44 added the bug Something isn't working label Dec 18, 2025
@npalm npalm requested a review from Copilot December 18, 2025 20:20
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request fixes a critical bug where the job retry mechanism was not being triggered during the scale-up process. The fix re-introduces the publishRetryMessage call and corrects the logic for skipping runner creation when the maximum runner count is exceeded or when newRunners would be negative.

Key Changes

  • Re-introduced the publishRetryMessage call in the scale-up loop to ensure retry messages are published for queued jobs
  • Fixed the condition for skipping runner creation from missingInstanceCount === scaleUp to newRunners <= 0, preventing attempts to create negative numbers of runners
  • Added comprehensive test coverage for the retry mechanism with 7 new test cases covering various scenarios

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
lambdas/functions/control-plane/src/scale-runners/scale-up.ts Imports publishRetryMessage, calls it for each queued message, and fixes the skip condition to handle negative newRunners values
lambdas/functions/control-plane/src/scale-runners/scale-up.test.ts Adds mock setup for publishRetryMessage and a new test suite with 7 tests covering retry mechanism behavior in various scenarios

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

}

scaleUp++;
await publishRetryMessage(message as ActionRequestMessageRetry);
Copy link

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type cast to ActionRequestMessageRetry is incorrect and unnecessary. The publishRetryMessage function accepts ActionRequestMessage as its parameter (line 16 in job-retry.ts), not ActionRequestMessageRetry. Additionally, ActionRequestMessageRetry requires retryCounter to be a non-optional number field, but message here is of type ActionRequestMessageSQS where retryCounter is optional. The cast should be removed since ActionRequestMessageSQS already extends ActionRequestMessage.

Suggested change
await publishRetryMessage(message as ActionRequestMessageRetry);
await publishRetryMessage(message);

Copilot uses AI. Check for mistakes.
Comment on lines +1752 to +1762
it('calls publishRetryMessage even when maximum runners is reached', async () => {
process.env.RUNNERS_MAXIMUM_COUNT = '0'; // No runners can be created

const messages = createTestMessages(2);

await scaleUpModule.scaleUp(messages);

// publishRetryMessage should still be called even though no runners will be created
expect(mockPublishRetryMessage).toHaveBeenCalledTimes(2);
expect(createRunner).not.toHaveBeenCalled();
});
Copy link

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test sets RUNNERS_MAXIMUM_COUNT = '0' but doesn't verify that listEC2Runners is called. According to the scale-up logic (lines 370-371 in scale-up.ts), when maximumRunners is not -1, listEC2Runners should be called to get the current runner count. The test should verify this behavior is occurring as expected.

Copilot uses AI. Check for mistakes.
}

scaleUp++;
await publishRetryMessage(message as ActionRequestMessageRetry);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, the call got lost in the PR that introces the btaches.

I doubt a bit if we should publish the message hre. At this point no runner is created. Only the count is increased. But on the other hand we cannot match the created runners with the event.

I think it would be a bit safter to call the publishRetry messages at the end for each message that is not marked as invalid. Since the invalid once are going back to the queue and will be retried. This will lead that more retry messages are created for the same event.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if it's worth it to add all that additional logic (and an additional for-loop) into this function. It will make it way more complex as well..

In case the message is marked as invalid, worst case it goes through the retry mechanism, ends up in the same scale-up function and it still is an invalid message. Leading to minimal waste.
IMO a good trade-off to avoid making this function even more complex.

What do you think @npalm?

Copy link
Contributor

Copilot AI commented Dec 18, 2025

@npalm I've opened a new pull request, #4966, to work on those changes. Once the pull request is ready, I'll request review from you.

Copy link
Contributor

@iainlane iainlane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cheers - right now I just had one request!


// No runners will be created, so skip calling the EC2 API.
if (missingInstanceCount === scaleUp) {
if (newRunners <= 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you say this is a separate fix? I think it'd be easier to get in via its own dedicated PR (with tests) if that's possible 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

job-retry mechanism broken in version 7.0.0

5 participants