fix: job retry mechanism not triggering #4961

stuartp44 · 2025-12-18T10:09:51Z

This pull request adds comprehensive tests for the retry mechanism in the scaleUp functionality and reintroduces the publishRetryMessage call to the scale-up process. The tests ensure that the retry logic works correctly under various scenarios, such as when jobs are queued, when the maximum number of runners is reached, and when queue checks are disabled.

Testing and Retry Mechanism Enhancements:

Added a new test suite "Retry mechanism tests" in scale-up.test.ts to cover scenarios where publishRetryMessage should be called, including: when jobs are queued, when maximum runners are reached, with correct message structure, and when job queue checks are disabled.

Other code Updates:

Fixed logic to skip runner creation if no new runners are needed by checking if newRunners <= 0 instead of comparing counts, improving clarity and correctness.

Example scenarios for the above bug

Scenario 1

Admin sets RUNNERS_MAXIMUM_COUNT=20
System scales up to 15 active runners
Admin reduces RUNNERS_MAXIMUM_COUNT=10 (cost control, policy change)
Before those 15 runners terminate, new jobs arrive
Bug triggers: newRunners = Math.min(scaleUp, 10-15) = -5
Code tries to call createRunners({numberOfRunners: -5}) and fails

Scenario 2

RUNNERS_MAXIMUM_COUNT=5
Someone manually launches 8 EC2 instances with runner tags
New jobs arrive
Bug triggers: newRunners = Math.min(2, 5-8) = -3
Code tries to call createRunners({numberOfRunners: -3}) and fails

Scenario 3

Admin sets RUNNERS_MAXIMUM_COUNT=20
System scales up to 15 active runners
Admin reduces RUNNERS_MAXIMUM_COUNT=10 (cost control, policy change)
Before those 15 runners terminate, new jobs arrive
Bug triggers: newRunners = Math.min(scaleUp, 10-15) = -5
Code tries to call createRunners({numberOfRunners: -5}) and fails

We tested this in our staging environment and verified it's working.

Closes #4960

github-actions · 2025-12-18T10:10:09Z

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Scanned Files

None

Co-authored-by: Brend Smits <brend.smits@philips.com>

Copilot

Pull request overview

This pull request fixes a critical bug where the job retry mechanism was not being triggered during the scale-up process. The fix re-introduces the publishRetryMessage call and corrects the logic for skipping runner creation when the maximum runner count is exceeded or when newRunners would be negative.

Key Changes

Re-introduced the publishRetryMessage call in the scale-up loop to ensure retry messages are published for queued jobs
Fixed the condition for skipping runner creation from missingInstanceCount === scaleUp to newRunners <= 0, preventing attempts to create negative numbers of runners
Added comprehensive test coverage for the retry mechanism with 7 new test cases covering various scenarios

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
`lambdas/functions/control-plane/src/scale-runners/scale-up.ts`	Imports `publishRetryMessage`, calls it for each queued message, and fixes the skip condition to handle negative `newRunners` values
`lambdas/functions/control-plane/src/scale-runners/scale-up.test.ts`	Adds mock setup for `publishRetryMessage` and a new test suite with 7 tests covering retry mechanism behavior in various scenarios

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-18T20:25:18Z

lambdas/functions/control-plane/src/scale-runners/scale-up.ts

      }

      scaleUp++;
+      await publishRetryMessage(message as ActionRequestMessageRetry);


The type cast to ActionRequestMessageRetry is incorrect and unnecessary. The publishRetryMessage function accepts ActionRequestMessage as its parameter (line 16 in job-retry.ts), not ActionRequestMessageRetry. Additionally, ActionRequestMessageRetry requires retryCounter to be a non-optional number field, but message here is of type ActionRequestMessageSQS where retryCounter is optional. The cast should be removed since ActionRequestMessageSQS already extends ActionRequestMessage.

Suggested change

await publishRetryMessage(message as ActionRequestMessageRetry);

await publishRetryMessage(message);

Copilot · 2025-12-18T20:25:18Z

lambdas/functions/control-plane/src/scale-runners/scale-up.test.ts

+  it('calls publishRetryMessage even when maximum runners is reached', async () => {
+    process.env.RUNNERS_MAXIMUM_COUNT = '0'; // No runners can be created
+
+    const messages = createTestMessages(2);
+
+    await scaleUpModule.scaleUp(messages);
+
+    // publishRetryMessage should still be called even though no runners will be created
+    expect(mockPublishRetryMessage).toHaveBeenCalledTimes(2);
+    expect(createRunner).not.toHaveBeenCalled();
+  });


The test sets RUNNERS_MAXIMUM_COUNT = '0' but doesn't verify that listEC2Runners is called. According to the scale-up logic (lines 370-371 in scale-up.ts), when maximumRunners is not -1, listEC2Runners should be called to get the current runner count. The test should verify this behavior is occurring as expected.

npalm · 2025-12-18T20:27:37Z

lambdas/functions/control-plane/src/scale-runners/scale-up.ts

      }

      scaleUp++;
+      await publishRetryMessage(message as ActionRequestMessageRetry);


Good catch, the call got lost in the PR that introces the btaches.

I doubt a bit if we should publish the message hre. At this point no runner is created. Only the count is increased. But on the other hand we cannot match the created runners with the event.

I think it would be a bit safter to call the publishRetry messages at the end for each message that is not marked as invalid. Since the invalid once are going back to the queue and will be retried. This will lead that more retry messages are created for the same event.

@copilot what do you think?

I'm not sure if it's worth it to add all that additional logic (and an additional for-loop) into this function. It will make it way more complex as well..

In case the message is marked as invalid, worst case it goes through the retry mechanism, ends up in the same scale-up function and it still is an invalid message. Leading to minimal waste.
IMO a good trade-off to avoid making this function even more complex.

What do you think @npalm?

Copilot · 2025-12-18T20:43:00Z

@npalm I've opened a new pull request, #4966, to work on those changes. Once the pull request is ready, I'll request review from you.

iainlane

Cheers - right now I just had one request!

iainlane · 2025-12-19T18:15:30Z

lambdas/functions/control-plane/src/scale-runners/scale-up.ts


      // No runners will be created, so skip calling the EC2 API.
-      if (missingInstanceCount === scaleUp) {
+      if (newRunners <= 0) {


Would you say this is a separate fix? I think it'd be easier to get in via its own dedicated PR (with tests) if that's possible 👍

stuartp44 requested a review from a team as a code owner December 18, 2025 10:09

stuartp44 force-pushed the stu/fix_job_retry branch from 584b846 to b02ebc2 Compare December 18, 2025 10:13

fix: improve readability of retry message assertions in tests

3eee9b2

Co-authored-by: Brend Smits <brend.smits@philips.com>

stuartp44 force-pushed the stu/fix_job_retry branch from b02ebc2 to 3eee9b2 Compare December 18, 2025 10:14

stuartp44 added the bug Something isn't working label Dec 18, 2025

npalm requested a review from Copilot December 18, 2025 20:20

Copilot started reviewing on behalf of npalm December 18, 2025 20:20 View session

Copilot AI reviewed Dec 18, 2025

View reviewed changes

npalm reviewed Dec 18, 2025

View reviewed changes

Copilot AI mentioned this pull request Dec 18, 2025

Move publishRetryMessage to end of processing loop to avoid duplicate retries #4966

Draft

iainlane reviewed Dec 19, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: job retry mechanism not triggering #4961

fix: job retry mechanism not triggering #4961

stuartp44 commented Dec 18, 2025

Uh oh!

github-actions bot commented Dec 18, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Dec 18, 2025

Uh oh!

Copilot AI Dec 18, 2025

Uh oh!

npalm Dec 18, 2025

Uh oh!

npalm Dec 18, 2025

Uh oh!

Brend-Smits Dec 19, 2025

Uh oh!

Copilot AI commented Dec 18, 2025

Uh oh!

iainlane left a comment

Uh oh!

iainlane Dec 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

	await publishRetryMessage(message as ActionRequestMessageRetry);
	await publishRetryMessage(message);

fix: job retry mechanism not triggering #4961

Are you sure you want to change the base?

fix: job retry mechanism not triggering #4961

Conversation

stuartp44 commented Dec 18, 2025

Uh oh!

github-actions bot commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Dependency Review

Scanned Files

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Key Changes

Reviewed changes

Uh oh!

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

npalm Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

npalm Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

Brend-Smits Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI commented Dec 18, 2025

Uh oh!

iainlane left a comment

Choose a reason for hiding this comment

Uh oh!

iainlane Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

github-actions bot commented Dec 18, 2025 •

edited

Loading