Conversation
Add a third WorkloadManager implementation that creates sandboxes via the compute gateway HTTP API (POST /api/sandboxes). Uses native fetch with no new dependencies. Enabled by setting COMPUTE_GATEWAY_URL, which takes priority over Kubernetes and Docker providers.
The fetch() call had no timeout, causing infinite hangs when the gateway accepted requests but never returned responses. Adds AbortSignal.timeout (30s) and consolidates all logging into a single structured event per create() call with timing, status, and error context.
Emit a single canonical log line in a finally block instead of scattered log calls at each early return. Adds business context (envId, envType, orgId, projectId, deploymentVersion, machine) and instanceName to the event. Always emits at info level with ok=true/false for queryability.
Pass business context (runId, envId, orgId, projectId, machine, etc.) as metadata on CreateSandboxRequest instead of relying on env vars. This enables wide event logging in the compute stack without parsing env or leaking secrets.
Passes machine preset cpu and memory as top-level fields on the CreateSandboxRequest so the compute stack can use them for admission control and resource allocation.
Thread timing context from queue consumer through to the compute workload manager's wide event: - dequeueResponseMs: platform dequeue HTTP round-trip - pollingIntervalMs: which polling interval was active (idle vs active) - warmStartCheckMs: warm start check duration All fields are optional to avoid breaking existing consumers.
- Fix instance creation URL from /api/sandboxes to /api/instances - Pass name: runnerId when creating compute instances - Add snapshot(), deleteInstance(), and restore() methods to ComputeWorkloadManager - Add /api/v1/compute/snapshot-complete callback endpoint to WorkloadServer - Handle suspend requests in compute mode via fire-and-forget snapshot with callback - Handle restore in compute mode by calling gateway restore API directly - Wire computeManager into WorkloadServer for compute mode suspend/restore
…re request
Restore calls now send a request body with the runner name, env override metadata,
cpu, and memory so the agent can inject them before the VM resumes. The runner
fetches these overrides from TRIGGER_METADATA_URL at restore time.
runnerId is derived per restore cycle as runner-{runIdShort}-{checkpointSuffix},
matching iceman's pattern.
Gates snapshot/restore behaviour independently of compute mode. When disabled, VMs won't receive the metadata URL and suspend/restore are no-ops. Defaults to off so compute mode can be used without snapshots.
🦋 Changeset detectedLatest commit: 9925c72 The changes in this PR will be included in the next version bump. This PR includes changesets to release 29 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds end-to-end compute support: a new internal package Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
…nabled Remove the silent `localhost` fallback for the snapshot callback URL, which would be unreachable from external compute gateways. Add env validation and a runtime guard matching the existing metadata URL pattern.
delay compute snapshot requests to avoid wasted work on short-lived waitpoints (e.g. triggerAndWait resolving in <5s). configurable via COMPUTE_SNAPSHOT_DELAY_MS (default 5s).
There was a problem hiding this comment.
♻️ Duplicate comments (1)
apps/webapp/app/v3/services/computeTemplateCreation.server.ts (1)
109-131:⚠️ Potential issue | 🟠 MajorGateway misconfiguration can fail open for MICROVM projects.
When
this.clientis undefined (gateway URL not configured),resolveModereturns"skip"at line 114 before checking if the project'sworkloadTypeisMICROVM. This allows MICROVM deployments to finalize without template creation if the webapp is misconfigured.Consider moving the client check after the workloadType check, failing explicitly for required projects:
🛠️ Proposed fix to fail closed for required mode
async resolveMode( projectId: string, prisma: PrismaClientOrTransaction ): Promise<TemplateCreationMode> { - if (!this.client) { - return "skip"; - } - const project = await prisma.project.findFirst({ where: { id: projectId }, select: { defaultWorkerGroup: { select: { workloadType: true }, }, organization: { select: { featureFlags: true }, }, }, }); if (project?.defaultWorkerGroup?.workloadType === "MICROVM") { + if (!this.client) { + throw new Error("Compute gateway not configured but required for MICROVM workload"); + } return "required"; } const flag = makeFlag(prisma); const hasComputeAccess = await flag({ key: FEATURE_FLAG.hasComputeAccess, defaultValue: false, overrides: (project?.organization?.featureFlags as Record<string, unknown>) ?? {}, }); if (hasComputeAccess) { + if (!this.client) { + throw new Error("Compute gateway not configured but required for project with compute access"); + } return "required"; } + if (!this.client) { + return "skip"; + } + const rolloutPct = Number(env.COMPUTE_TEMPLATE_SHADOW_ROLLOUT_PCT ?? "0");🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/webapp/app/v3/services/computeTemplateCreation.server.ts` around lines 109 - 131, In resolveMode, the early return when this.client is falsy causes MICROVM projects to be skipped; change the logic in resolveMode so you first fetch the project and check project?.defaultWorkerGroup?.workloadType === "MICROVM" (using the existing project query) and return "required" for MICROVM regardless of this.client, then after that check handle the case where this.client is undefined by returning "skip" for non-MICROVM projects; update the resolveMode function (referencing resolveMode and the project.defaultWorkerGroup.workloadType check) so MICROVM cannot bypass template creation when the gateway client is not configured.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@apps/webapp/app/v3/services/computeTemplateCreation.server.ts`:
- Around line 109-131: In resolveMode, the early return when this.client is
falsy causes MICROVM projects to be skipped; change the logic in resolveMode so
you first fetch the project and check project?.defaultWorkerGroup?.workloadType
=== "MICROVM" (using the existing project query) and return "required" for
MICROVM regardless of this.client, then after that check handle the case where
this.client is undefined by returning "skip" for non-MICROVM projects; update
the resolveMode function (referencing resolveMode and the
project.defaultWorkerGroup.workloadType check) so MICROVM cannot bypass template
creation when the gateway client is not configured.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: bc42b688-0ed1-4216-a21c-8b1cedba0f2a
📒 Files selected for processing (5)
apps/supervisor/src/workloadManager/compute.tsapps/webapp/app/v3/services/computeTemplateCreation.server.tsinternal-packages/compute/src/client.tsinternal-packages/compute/src/imageRef.tsinternal-packages/compute/src/index.ts
✅ Files skipped from review due to trivial changes (1)
- internal-packages/compute/src/imageRef.ts
🚧 Files skipped from review as they are similar to previous changes (1)
- internal-packages/compute/src/index.ts
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (27)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (7, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (7, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (5, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (8, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (1, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (6, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (3, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (8, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (2, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (5, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (6, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (3, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (1, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (4, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (2, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (4, 8)
- GitHub Check: units / packages / 🧪 Unit Tests: Packages (1, 1)
- GitHub Check: sdk-compat / Node.js 22.12 (ubuntu-latest)
- GitHub Check: sdk-compat / Cloudflare Workers
- GitHub Check: sdk-compat / Bun Runtime
- GitHub Check: sdk-compat / Deno Runtime
- GitHub Check: typecheck / typecheck
- GitHub Check: sdk-compat / Node.js 20.20 (ubuntu-latest)
- GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - pnpm)
- GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - npm)
- GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - npm)
- GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - pnpm)
🧰 Additional context used
📓 Path-based instructions (13)
**/*.{ts,tsx}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.{ts,tsx}: Use types over interfaces for TypeScript
Avoid using enums; prefer string unions or const objects instead
**/*.{ts,tsx}: For apps and internal packages (apps/*,internal-packages/*), usepnpm run typecheck --filter <package>for verification, never usebuildas it proves almost nothing about correctness
Use testcontainers helpers (redisTest,postgresTest,containerTestfrom@internal/testcontainers) for integration tests with Redis and PostgreSQL instead of mocking
When writing Trigger.dev tasks, always import from@trigger.dev/sdk- never use@trigger.dev/sdk/v3or deprecatedclient.defineJob
Files:
apps/webapp/app/v3/services/computeTemplateCreation.server.tsapps/supervisor/src/workloadManager/compute.tsinternal-packages/compute/src/client.ts
{packages/core,apps/webapp}/**/*.{ts,tsx}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Use zod for validation in packages/core and apps/webapp
Files:
apps/webapp/app/v3/services/computeTemplateCreation.server.ts
**/*.{ts,tsx,js,jsx}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Use function declarations instead of default exports
**/*.{ts,tsx,js,jsx}: Use pnpm for package management in this monorepo (version 10.23.0) with Turborepo for orchestration - run commands from root withpnpm run
Add crumbs as you write code for debug tracing using//@Crumbscomments or `// `#region` `@crumbsblocks - they stay on the branch throughout development and are stripped viaagentcrumbs stripbefore merge
Files:
apps/webapp/app/v3/services/computeTemplateCreation.server.tsapps/supervisor/src/workloadManager/compute.tsinternal-packages/compute/src/client.ts
apps/webapp/app/**/*.{ts,tsx}
📄 CodeRabbit inference engine (.cursor/rules/webapp.mdc)
Access all environment variables through the
envexport ofenv.server.tsinstead of directly accessingprocess.envin the Trigger.dev webapp
Files:
apps/webapp/app/v3/services/computeTemplateCreation.server.ts
apps/webapp/**/*.{ts,tsx}
📄 CodeRabbit inference engine (.cursor/rules/webapp.mdc)
apps/webapp/**/*.{ts,tsx}: When importing from@trigger.dev/corein the webapp, use subpath exports from the package.json instead of importing from the root path
Follow the Remix 2.1.0 and Express server conventions when updating the main trigger.dev webapp
Files:
apps/webapp/app/v3/services/computeTemplateCreation.server.ts
apps/webapp/app/v3/services/**/*.server.{ts,tsx}
📄 CodeRabbit inference engine (.cursor/rules/webapp.mdc)
Organize services in the webapp following the pattern
app/v3/services/*/*.server.ts
Files:
apps/webapp/app/v3/services/computeTemplateCreation.server.ts
**/*.ts
📄 CodeRabbit inference engine (.cursor/rules/otel-metrics.mdc)
**/*.ts: When creating or editing OTEL metrics (counters, histograms, gauges), ensure metric attributes have low cardinality by using only enums, booleans, bounded error codes, or bounded shard IDs
Do not use high-cardinality attributes in OTEL metrics such as UUIDs/IDs (envId, userId, runId, projectId, organizationId), unbounded integers (itemCount, batchSize, retryCount), timestamps (createdAt, startTime), or free-form strings (errorMessage, taskName, queueName)
When exporting OTEL metrics via OTLP to Prometheus, be aware that the exporter automatically adds unit suffixes to metric names (e.g., 'my_duration_ms' becomes 'my_duration_ms_milliseconds', 'my_counter' becomes 'my_counter_total'). Account for these transformations when writing Grafana dashboards or Prometheus queries
Files:
apps/webapp/app/v3/services/computeTemplateCreation.server.tsapps/supervisor/src/workloadManager/compute.tsinternal-packages/compute/src/client.ts
**/*.{js,ts,jsx,tsx,json,md,yaml,yml}
📄 CodeRabbit inference engine (AGENTS.md)
Format code using Prettier before committing
Files:
apps/webapp/app/v3/services/computeTemplateCreation.server.tsapps/supervisor/src/workloadManager/compute.tsinternal-packages/compute/src/client.ts
apps/**/*.{ts,tsx,js,jsx}
📄 CodeRabbit inference engine (CLAUDE.md)
When modifying only server components (
apps/webapp/,apps/supervisor/, etc.) with no package changes, add a.server-changes/file instead of a changeset
Files:
apps/webapp/app/v3/services/computeTemplateCreation.server.tsapps/supervisor/src/workloadManager/compute.ts
apps/webapp/app/v3/**/*.{ts,tsx}
📄 CodeRabbit inference engine (CLAUDE.md)
In the webapp v3 directory, only modify V2 code paths when encountering V1/V2 branching in services - all new work uses Run Engine 2.0 (
@internal/run-engine) and redis-worker, not legacy V1 engine code
Files:
apps/webapp/app/v3/services/computeTemplateCreation.server.ts
apps/webapp/app/**/*.{ts,tsx,server.ts}
📄 CodeRabbit inference engine (apps/webapp/CLAUDE.md)
Access environment variables via
envexport fromapp/env.server.ts. Never useprocess.envdirectly
Files:
apps/webapp/app/v3/services/computeTemplateCreation.server.ts
apps/webapp/app/v3/services/**/*.server.ts
📄 CodeRabbit inference engine (apps/webapp/CLAUDE.md)
Only modify V2 code paths when editing services that branch on
RunEngineVersionto support both V1 and V2 (e.g.,cancelTaskRun.server.ts,batchTriggerV3.server.ts)
Files:
apps/webapp/app/v3/services/computeTemplateCreation.server.ts
apps/supervisor/src/workloadManager/**/*.{js,ts}
📄 CodeRabbit inference engine (apps/supervisor/CLAUDE.md)
Container orchestration abstraction (Docker or Kubernetes) should be implemented in
src/workloadManager/
Files:
apps/supervisor/src/workloadManager/compute.ts
🧠 Learnings (13)
📓 Common learnings
Learnt from: nicktrn
Repo: triggerdotdev/trigger.dev PR: 3114
File: apps/supervisor/src/workloadServer/index.ts:832-840
Timestamp: 2026-03-27T11:45:37.910Z
Learning: In `apps/supervisor/src/workloadManager/compute.ts` and the supervisor restore flow, `TRIGGER_METADATA_URL` does not need to be re-injected on VM restore because it is baked into the instance environment at creation time and the environment is preserved through snapshot/restore. The Kubernetes restore path follows the same pattern. Do not flag the absence of `TRIGGER_METADATA_URL` re-injection on restore as a bug.
Learnt from: nicktrn
Repo: triggerdotdev/trigger.dev PR: 3114
File: apps/supervisor/src/workloadServer/index.ts:494-539
Timestamp: 2026-03-26T23:24:31.165Z
Learning: In `apps/supervisor/src/workloadServer/index.ts`, the `/api/v1/compute/snapshot-complete` POST endpoint always replies `200` even when `workerClient.submitSuspendCompletion` returns `result.success === false`. This is intentional: the compute gateway's callback is fire-and-forget, it has no retry logic, and the snapshot state is already determined at the time of the callback. Returning a non-2xx would only cause the gateway to log a spurious error it cannot remediate. Failures are already logged by the supervisor, and the platform will eventually time out the suspend if it never receives a completion. Do not flag this as a bug.
Learnt from: nicktrn
Repo: triggerdotdev/trigger.dev PR: 3114
File: apps/supervisor/src/index.ts:226-251
Timestamp: 2026-02-23T12:56:51.147Z
Learning: In the supervisor compute restore flow (apps/supervisor/src/index.ts), the run engine handles retries for failed restore operations, so runs won't get permanently stuck even if computeManager.restore returns false or throws an exception.
📚 Learning: 2026-03-02T12:43:34.140Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: packages/cli-v3/CLAUDE.md:0-0
Timestamp: 2026-03-02T12:43:34.140Z
Learning: Applies to packages/cli-v3/src/commands/deploy.ts : Implement `deploy.ts` command in `src/commands/` for production deployment
Applied to files:
apps/webapp/app/v3/services/computeTemplateCreation.server.ts
📚 Learning: 2026-03-22T13:26:12.060Z
Learnt from: ericallam
Repo: triggerdotdev/trigger.dev PR: 3244
File: apps/webapp/app/components/code/TextEditor.tsx:81-86
Timestamp: 2026-03-22T13:26:12.060Z
Learning: In the triggerdotdev/trigger.dev codebase, do not flag `navigator.clipboard.writeText(...)` calls for `missing-await`/`unhandled-promise` issues. These clipboard writes are intentionally invoked without `await` and without `catch` handlers across the project; keep that behavior consistent when reviewing TypeScript/TSX files (e.g., usages like in `apps/webapp/app/components/code/TextEditor.tsx`).
Applied to files:
apps/webapp/app/v3/services/computeTemplateCreation.server.tsapps/supervisor/src/workloadManager/compute.tsinternal-packages/compute/src/client.ts
📚 Learning: 2026-03-26T10:02:22.373Z
Learnt from: 0ski
Repo: triggerdotdev/trigger.dev PR: 3254
File: apps/webapp/app/services/platformNotifications.server.ts:363-385
Timestamp: 2026-03-26T10:02:22.373Z
Learning: In `triggerdotdev/trigger.dev`, the `getNextCliNotification` fallback in `apps/webapp/app/services/platformNotifications.server.ts` intentionally uses `prisma.orgMember.findFirst` (single org) when no `projectRef` is provided. This is acceptable for v1 because the CLI (`dev` and `login` commands) always passes `projectRef` in normal usage, making the fallback a rare edge case. Do not flag the single-org fallback as a multi-org correctness bug in this file.
Applied to files:
apps/webapp/app/v3/services/computeTemplateCreation.server.ts
📚 Learning: 2026-03-26T23:24:31.165Z
Learnt from: nicktrn
Repo: triggerdotdev/trigger.dev PR: 3114
File: apps/supervisor/src/workloadServer/index.ts:494-539
Timestamp: 2026-03-26T23:24:31.165Z
Learning: In `apps/supervisor/src/workloadServer/index.ts`, the `/api/v1/compute/snapshot-complete` POST endpoint always replies `200` even when `workerClient.submitSuspendCompletion` returns `result.success === false`. This is intentional: the compute gateway's callback is fire-and-forget, it has no retry logic, and the snapshot state is already determined at the time of the callback. Returning a non-2xx would only cause the gateway to log a spurious error it cannot remediate. Failures are already logged by the supervisor, and the platform will eventually time out the suspend if it never receives a completion. Do not flag this as a bug.
Applied to files:
apps/webapp/app/v3/services/computeTemplateCreation.server.tsinternal-packages/compute/src/client.ts
📚 Learning: 2026-03-10T17:56:20.938Z
Learnt from: samejr
Repo: triggerdotdev/trigger.dev PR: 3201
File: apps/webapp/app/v3/services/setSeatsAddOn.server.ts:25-29
Timestamp: 2026-03-10T17:56:20.938Z
Learning: Do not implement local userId-to-organizationId authorization checks inside org-scoped service classes (e.g., SetSeatsAddOnService, SetBranchesAddOnService) in the web app. Rely on route-layer authentication (requireUserId(request)) and org membership enforcement via the _app.orgs.$organizationSlug layout route. Any userId/organizationId that reaches these services from org-scoped routes has already been validated. Apply this pattern across all org-scoped services to avoid redundant auth checks and maintain consistency.
Applied to files:
apps/webapp/app/v3/services/computeTemplateCreation.server.ts
📚 Learning: 2026-03-22T19:24:14.403Z
Learnt from: matt-aitken
Repo: triggerdotdev/trigger.dev PR: 3187
File: apps/webapp/app/v3/services/alerts/deliverErrorGroupAlert.server.ts:200-204
Timestamp: 2026-03-22T19:24:14.403Z
Learning: In the triggerdotdev/trigger.dev codebase, webhook URLs are not expected to contain embedded credentials/secrets (e.g., fields like `ProjectAlertWebhookProperties` should only hold credential-free webhook endpoints). During code review, if you see logging or inclusion of raw webhook URLs in error messages, do not automatically treat it as a credential-leak/secrets-in-logs issue by default—first verify the URL does not contain embedded credentials (for example, no username/password in the URL, no obvious secret/token query params or fragments). If the URL is credential-free per this project’s conventions, allow the logging.
Applied to files:
apps/webapp/app/v3/services/computeTemplateCreation.server.tsapps/supervisor/src/workloadManager/compute.tsinternal-packages/compute/src/client.ts
📚 Learning: 2026-03-02T12:42:47.652Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: apps/supervisor/CLAUDE.md:0-0
Timestamp: 2026-03-02T12:42:47.652Z
Learning: Applies to apps/supervisor/src/workloadManager/**/*.{js,ts} : Container orchestration abstraction (Docker or Kubernetes) should be implemented in `src/workloadManager/`
Applied to files:
apps/supervisor/src/workloadManager/compute.ts
📚 Learning: 2026-03-02T12:42:47.652Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: apps/supervisor/CLAUDE.md:0-0
Timestamp: 2026-03-02T12:42:47.652Z
Learning: Applies to apps/supervisor/src/workloadServer/**/*.{js,ts} : HTTP server for workload communication (heartbeats, snapshots) should be implemented in `src/workloadServer/`
Applied to files:
apps/supervisor/src/workloadManager/compute.ts
📚 Learning: 2026-02-23T12:56:51.147Z
Learnt from: nicktrn
Repo: triggerdotdev/trigger.dev PR: 3114
File: apps/supervisor/src/index.ts:226-251
Timestamp: 2026-02-23T12:56:51.147Z
Learning: In the supervisor compute restore flow (apps/supervisor/src/index.ts), the run engine handles retries for failed restore operations, so runs won't get permanently stuck even if computeManager.restore returns false or throws an exception.
Applied to files:
apps/supervisor/src/workloadManager/compute.ts
📚 Learning: 2026-03-27T11:45:37.910Z
Learnt from: nicktrn
Repo: triggerdotdev/trigger.dev PR: 3114
File: apps/supervisor/src/workloadServer/index.ts:832-840
Timestamp: 2026-03-27T11:45:37.910Z
Learning: In `apps/supervisor/src/workloadManager/compute.ts` and the supervisor restore flow, `TRIGGER_METADATA_URL` does not need to be re-injected on VM restore because it is baked into the instance environment at creation time and the environment is preserved through snapshot/restore. The Kubernetes restore path follows the same pattern. Do not flag the absence of `TRIGGER_METADATA_URL` re-injection on restore as a bug.
Applied to files:
apps/supervisor/src/workloadManager/compute.ts
📚 Learning: 2026-03-23T06:24:14.566Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-23T06:24:14.566Z
Learning: Applies to Dockerfile : When updating Docker image references, always use multiplatform/index digests, not architecture-specific digests, to prevent CI failures on different build environments
Applied to files:
apps/supervisor/src/workloadManager/compute.ts
📚 Learning: 2026-03-26T23:24:51.644Z
Learnt from: nicktrn
Repo: triggerdotdev/trigger.dev PR: 3114
File: apps/supervisor/src/workloadServer/index.ts:817-825
Timestamp: 2026-03-26T23:24:51.644Z
Learning: In `apps/supervisor/src/workloadServer/index.ts` (`WorkloadServer.stop()`), pending items returned by `this.snapshotDelayWheel?.stop()` are intentionally logged and dropped rather than dispatched. The entire supervisor is shutting down, so the snapshot callback URL would point at a dead server; dispatching snapshots during teardown would create orphaned gateway callbacks. Runners detect the supervisor is gone and reconnect to a new supervisor instance, which re-triggers the snapshot workflow. Do not flag the drop-on-shutdown behavior as a bug.
Applied to files:
apps/supervisor/src/workloadManager/compute.ts
🔇 Additional comments (6)
internal-packages/compute/src/client.ts (1)
49-70: LGTM! Type safety issue has been addressed.The
post()method now correctly returnsPromise<T | undefined>, andinstances.create()properly validates the response at lines 121-124, throwing a clear error when the gateway returns no body. This addresses the previous concern about callers silently dereferencingundefined.Also applies to: 117-126
apps/supervisor/src/workloadManager/compute.ts (3)
40-151: LGTM! Clean implementation of workload lifecycle management.The
create()method properly:
- Builds environment variables with conditional flags for warm-start, metadata, heartbeat, and snapshot polling
- Uses
tryCatchfor clean error handling with typed error classification (timeout vs fetch)- Emits a comprehensive wide-event log in
finallyfor observability- Optionally emits OTLP trace spans when tracing is enabled
The image digest stripping at line 85 is now consistent with
ComputeTemplateCreationServicesince both usestripImageDigestfrom@internal/compute.
153-192: LGTM! Consistent error handling pattern.Both
snapshot()anddeleteInstance()follow a clean pattern withtryCatch, proper error logging, and boolean return values for caller-side handling.
252-312: LGTM! Restore implementation follows architecture correctly.The metadata object deliberately excludes
TRIGGER_METADATA_URLbecause, per the established architecture, this URL is baked into the instance environment at creation time and preserved through snapshot/restore. The Kubernetes restore path follows the same pattern. Based on learnings: "TRIGGER_METADATA_URL does not need to be re-injected on VM restore because it is baked into the instance environment at creation time and the environment is preserved through snapshot/restore."apps/webapp/app/v3/services/computeTemplateCreation.server.ts (2)
48-67: LGTM! Shadow mode error handling fixed.The
.then((result) => {...}).catch(...)pattern correctly handles both the expected{ success: false, error }return value and any unexpected thrown exceptions.
152-176: LGTM! Clean template creation with proper error handling.The method catches exceptions, logs them, and returns a structured result that callers can handle appropriately. Image digest stripping is consistent with the supervisor's
ComputeWorkloadManager.
…n, OTLP endpoint, snapshot concurrency
…d version in compute package
…nal/compute, add zod pinning rule
| startTimeUnixNano: String(span.startTimeMs * 1_000_000), | ||
| endTimeUnixNano: String(span.endTimeMs * 1_000_000), |
There was a problem hiding this comment.
🟡 OTLP timestamp conversion overflows Number.MAX_SAFE_INTEGER for real epoch values
The buildPayload function converts millisecond timestamps to nanoseconds via String(span.startTimeMs * 1_000_000). In production, startTimeMs is a real epoch value from Date.now() (~1.7e12), so the product (~1.7e18) exceeds Number.MAX_SAFE_INTEGER (~9.0e15), causing IEEE 754 precision loss. The resulting nanosecond string will have incorrect trailing digits. The test at otlpTraceService.test.ts:82-83 only uses startTimeMs: 1000 (product 1e9, well within safe range), so it doesn't catch this. All callers pass real epoch timestamps — e.g. compute.ts:234 passes opts.dequeuedAt.getTime() - 1, and computeSnapshotService.ts:210 passes Date.now().
Fix approach
Use string concatenation or BigInt to avoid floating-point overflow:
String(BigInt(span.startTimeMs) * 1_000_000n) or
span.startTimeMs.toString() + "000000"
| startTimeUnixNano: String(span.startTimeMs * 1_000_000), | |
| endTimeUnixNano: String(span.endTimeMs * 1_000_000), | |
| startTimeUnixNano: String(BigInt(span.startTimeMs) * 1_000_000n), | |
| endTimeUnixNano: String(BigInt(span.endTimeMs) * 1_000_000n), |
Was this helpful? React with 👍 or 👎 to provide feedback.
Adds the
ComputeWorkloadManagerfor routing task execution through the compute gateway, including full checkpoint/restore support.Changes
Compute workload manager (
apps/supervisor/src/workloadManager/compute.ts)Supervisor wiring (
apps/supervisor/src/index.ts)Workload server (
apps/supervisor/src/workloadServer/index.ts)submitSuspendCompletionEnv validation (
apps/supervisor/src/env.ts)