Skip to content

ScaleUp losing events when no capacity available #2925

@tcsc

Description

@tcsc

We are using a ephemeral arm64 builders and are intermittently having builds stall due to a lack of capacity, even with on-demand instances.

Every time this happens, the build permanently stalls and has to be manually cancelled.

From a bit of digging it looks like adding InsufficientInstanceCapacity to the list of what's considered a "Scaling Error" should fix this.

Redacted CloudWatch log:

2023-02-06T10:27:56.604+11:00	2023-02-05 23:27:56.494 WARN [runners:34ecb39a-ae35-5506-b158-efc0938129a8 index.js:120365 createRunner] No instances created by fleet request. Check configuration! Response:
2023-02-06T10:27:56.604+11:00	{
2023-02-06T10:27:56.604+11:00	FleetId: 'fleet-92368284-5b0d-44bc-0e18-af80f19be5e5',
2023-02-06T10:27:56.604+11:00	Errors: [
2023-02-06T10:27:56.604+11:00	{
2023-02-06T10:27:56.604+11:00	LaunchTemplateAndOverrides: {
2023-02-06T10:27:56.604+11:00	LaunchTemplateSpecification: {
2023-02-06T10:27:56.604+11:00	LaunchTemplateId: 'lt-REDACTED',
2023-02-06T10:27:56.604+11:00	Version: '4'
2023-02-06T10:27:56.604+11:00	},
2023-02-06T10:27:56.604+11:00	Overrides: {
2023-02-06T10:27:56.604+11:00	InstanceType: 'c6gd.8xlarge',
2023-02-06T10:27:56.604+11:00	SubnetId: 'subnet-REDACTED'
2023-02-06T10:27:56.604+11:00	}
2023-02-06T10:27:56.604+11:00	},
2023-02-06T10:27:56.604+11:00	Lifecycle: 'on-demand',
2023-02-06T10:27:56.604+11:00	ErrorCode: 'InsufficientInstanceCapacity',
2023-02-06T10:27:56.604+11:00	ErrorMessage: 'We currently do not have sufficient c6gd.8xlarge capacity in the Availability Zone you requested (REDACTED). Our system will be working on provisioning additional capacity. You can currently get c6gd.8xlarge capacity by not specifying an Availability Zone in your request or choosing us-west-2b, us-west-2c, us-west-2d.'
2023-02-06T10:27:56.604+11:00	}
2023-02-06T10:27:56.604+11:00	],
2023-02-06T10:27:56.604+11:00	Instances: []
2023-02-06T10:27:56.604+11:00	}
2023-02-06T10:27:56.622+11:00	2023-02-05 23:27:56.621 WARN [runners:34ecb39a-ae35-5506-b158-efc0938129a8 index.js:120384 createRunner] Create fleet failed, error not recognized as scaling error.
2023-02-06T10:27:56.622+11:00	[
2023-02-06T10:27:56.622+11:00	{
2023-02-06T10:27:56.622+11:00	LaunchTemplateAndOverrides: {
2023-02-06T10:27:56.622+11:00	LaunchTemplateSpecification: {
2023-02-06T10:27:56.622+11:00	LaunchTemplateId: 'lt-REDACTED',
2023-02-06T10:27:56.622+11:00	Version: '4'
2023-02-06T10:27:56.622+11:00	},
2023-02-06T10:27:56.622+11:00	Overrides: {
2023-02-06T10:27:56.622+11:00	InstanceType: 'c6gd.8xlarge',
2023-02-06T10:27:56.622+11:00	SubnetId: 'subnet-REDACTED'
2023-02-06T10:27:56.622+11:00	}
2023-02-06T10:27:56.622+11:00	},
2023-02-06T10:27:56.622+11:00	Lifecycle: 'on-demand',
2023-02-06T10:27:56.622+11:00	ErrorCode: 'InsufficientInstanceCapacity',
2023-02-06T10:27:56.622+11:00	ErrorMessage: 'We currently do not have sufficient c6gd.8xlarge capacity in the Availability Zone you requested (REDACTED). Our system will be working on provisioning additional capacity. You can currently get c6gd.8xlarge capacity by not specifying an Availability Zone in your request or choosing REDACTED.'
2023-02-06T10:27:56.622+11:00	}
2023-02-06T10:27:56.622+11:00	]
2023-02-06T10:27:56.622+11:00	2023-02-05 23:27:56.622 WARN [scale-runners:34ecb39a-ae35-5506-b158-efc0938129a8 index.js:120511 Runtime.handler] Ignoring error: Create fleet failed, no instance created. {"runnerType":"Org","runnerOwner":"gravitational","event":"workflow_job","id":"11123528452"}
2023-02-06T10:27:56.623+11:00	END RequestId: 34ecb39a-ae35-5506-b158-efc0938129a8
2023-02-06T10:27:56.623+11:00   REPORT RequestId: 34ecb39a-ae35-5506-b158-efc0938129a8	Duration: 2269.85 ms	Billed Duration: 2270 ms	Memory Size: 512 MB	Max Memory Used: 219 MB	
REPORT RequestId: 34ecb39a-ae35-5506-b158-efc0938129a8 Duration: 2269.85 ms Billed Duration: 2270 ms Memory Size: 512 MB Max Memory Used: 219 MB

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions