Skip to content

feat(supervisor): compute workload manager#3114

Open
nicktrn wants to merge 68 commits intomainfrom
feat/compute-workload-manager
Open

feat(supervisor): compute workload manager#3114
nicktrn wants to merge 68 commits intomainfrom
feat/compute-workload-manager

Conversation

@nicktrn
Copy link
Copy Markdown
Collaborator

@nicktrn nicktrn commented Feb 23, 2026

Adds the ComputeWorkloadManager for routing task execution through the compute gateway, including full checkpoint/restore support.

Changes

Compute workload manager (apps/supervisor/src/workloadManager/compute.ts)

  • Routes VM create, snapshot, delete, and restore through the compute gateway API
  • Wide event logging on create with full timing and context
  • Configurable gateway timeout, auth token, image digest stripping
  • Restore sends name, env override metadata, CPU and memory so the agent can inject them before the VM resumes

Supervisor wiring (apps/supervisor/src/index.ts)

  • Compute mode activated when gateway URL is configured
  • Restore branch derives a unique runnerId per restore cycle, matching iceman's convention
  • Suspend/restore gated behind snapshots enabled flag

Workload server (apps/supervisor/src/workloadServer/index.ts)

  • Suspend handler triggers a compute snapshot (fire-and-forget) when in compute mode with snapshots enabled
  • Snapshot-complete callback endpoint receives the snapshot ID and calls submitSuspendCompletion

Env validation (apps/supervisor/src/env.ts)

  • Compute gateway URL, auth token, and timeout settings
  • Snapshots enabled flag (defaults off — compute mode can run without checkpoints)
  • Metadata URL required when snapshots enabled (validated at startup)

Add a third WorkloadManager implementation that creates sandboxes via
the compute gateway HTTP API (POST /api/sandboxes). Uses native fetch
with no new dependencies. Enabled by setting COMPUTE_GATEWAY_URL, which
takes priority over Kubernetes and Docker providers.
The fetch() call had no timeout, causing infinite hangs when the gateway
accepted requests but never returned responses. Adds AbortSignal.timeout
(30s) and consolidates all logging into a single structured event per
create() call with timing, status, and error context.
Emit a single canonical log line in a finally block instead of scattered
log calls at each early return. Adds business context (envId, envType,
orgId, projectId, deploymentVersion, machine) and instanceName to the
event. Always emits at info level with ok=true/false for queryability.
Pass business context (runId, envId, orgId, projectId, machine, etc.)
as metadata on CreateSandboxRequest instead of relying on env vars.
This enables wide event logging in the compute stack without parsing
env or leaking secrets.
Passes machine preset cpu and memory as top-level fields on the
CreateSandboxRequest so the compute stack can use them for admission
control and resource allocation.
Thread timing context from queue consumer through to the compute
workload manager's wide event:

- dequeueResponseMs: platform dequeue HTTP round-trip
- pollingIntervalMs: which polling interval was active (idle vs active)
- warmStartCheckMs: warm start check duration

All fields are optional to avoid breaking existing consumers.
- Fix instance creation URL from /api/sandboxes to /api/instances
- Pass name: runnerId when creating compute instances
- Add snapshot(), deleteInstance(), and restore() methods to ComputeWorkloadManager
- Add /api/v1/compute/snapshot-complete callback endpoint to WorkloadServer
- Handle suspend requests in compute mode via fire-and-forget snapshot with callback
- Handle restore in compute mode by calling gateway restore API directly
- Wire computeManager into WorkloadServer for compute mode suspend/restore
…re request

Restore calls now send a request body with the runner name, env override metadata,
cpu, and memory so the agent can inject them before the VM resumes. The runner
fetches these overrides from TRIGGER_METADATA_URL at restore time.

runnerId is derived per restore cycle as runner-{runIdShort}-{checkpointSuffix},
matching iceman's pattern.
Gates snapshot/restore behaviour independently of compute mode.
When disabled, VMs won't receive the metadata URL and suspend/restore
are no-ops. Defaults to off so compute mode can be used without snapshots.
@changeset-bot
Copy link
Copy Markdown

changeset-bot bot commented Feb 23, 2026

🦋 Changeset detected

Latest commit: 9925c72

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 29 packages
Name Type
trigger.dev Patch
d3-chat Patch
references-d3-openai-agents Patch
references-nextjs-realtime Patch
references-realtime-hooks-test Patch
references-realtime-streams Patch
references-telemetry Patch
@trigger.dev/build Patch
@trigger.dev/core Patch
@trigger.dev/python Patch
@trigger.dev/react-hooks Patch
@trigger.dev/redis-worker Patch
@trigger.dev/rsc Patch
@trigger.dev/schema-to-json Patch
@trigger.dev/sdk Patch
@trigger.dev/database Patch
@trigger.dev/otlp-importer Patch
@internal/cache Patch
@internal/clickhouse Patch
@internal/llm-model-catalog Patch
@internal/redis Patch
@internal/replication Patch
@internal/run-engine Patch
@internal/schedule-engine Patch
@internal/testcontainers Patch
@internal/tracing Patch
@internal/tsql Patch
@internal/zod-worker Patch
@internal/sdk-compat-tests Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Feb 23, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds end-to-end compute support: a new internal package @internal/compute (client, types, imageRef), supervisor compute workload manager and wiring (create/snapshot/restore), OTLP trace payload/dispatch, timer-wheel-based delayed snapshot orchestration and HTTP callback route, environment schema extensions, webapp compute template creation service with feature-flag and rollout logic, a DB migration adding WorkloadType and WorkerInstanceGroup.workloadType, propagation of dequeue/polling timing through the run queue, a CLI local-build --load behavior fix, and new tests and logging verbosity adjustments.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 37.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat(supervisor): compute workload manager' accurately and clearly describes the main change: adding a ComputeWorkloadManager to the supervisor component.
Description check ✅ Passed The PR description is comprehensive and structured, covering key changes across multiple files and components. However, it does not follow the provided template structure (Closes #, Checklist, Testing, Changelog, Screenshots sections).

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/compute-workload-manager

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

coderabbitai[bot]

This comment was marked as resolved.

…nabled

Remove the silent `localhost` fallback for the snapshot callback URL,
which would be unreachable from external compute gateways. Add env
validation and a runtime guard matching the existing metadata URL pattern.
coderabbitai[bot]

This comment was marked as resolved.

coderabbitai[bot]

This comment was marked as resolved.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
apps/webapp/app/v3/services/computeTemplateCreation.server.ts (1)

109-131: ⚠️ Potential issue | 🟠 Major

Gateway misconfiguration can fail open for MICROVM projects.

When this.client is undefined (gateway URL not configured), resolveMode returns "skip" at line 114 before checking if the project's workloadType is MICROVM. This allows MICROVM deployments to finalize without template creation if the webapp is misconfigured.

Consider moving the client check after the workloadType check, failing explicitly for required projects:

🛠️ Proposed fix to fail closed for required mode
   async resolveMode(
     projectId: string,
     prisma: PrismaClientOrTransaction
   ): Promise<TemplateCreationMode> {
-    if (!this.client) {
-      return "skip";
-    }
-
     const project = await prisma.project.findFirst({
       where: { id: projectId },
       select: {
         defaultWorkerGroup: {
           select: { workloadType: true },
         },
         organization: {
           select: { featureFlags: true },
         },
       },
     });

     if (project?.defaultWorkerGroup?.workloadType === "MICROVM") {
+      if (!this.client) {
+        throw new Error("Compute gateway not configured but required for MICROVM workload");
+      }
       return "required";
     }

     const flag = makeFlag(prisma);
     const hasComputeAccess = await flag({
       key: FEATURE_FLAG.hasComputeAccess,
       defaultValue: false,
       overrides: (project?.organization?.featureFlags as Record<string, unknown>) ?? {},
     });

     if (hasComputeAccess) {
+      if (!this.client) {
+        throw new Error("Compute gateway not configured but required for project with compute access");
+      }
       return "required";
     }

+    if (!this.client) {
+      return "skip";
+    }
+
     const rolloutPct = Number(env.COMPUTE_TEMPLATE_SHADOW_ROLLOUT_PCT ?? "0");
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/webapp/app/v3/services/computeTemplateCreation.server.ts` around lines
109 - 131, In resolveMode, the early return when this.client is falsy causes
MICROVM projects to be skipped; change the logic in resolveMode so you first
fetch the project and check project?.defaultWorkerGroup?.workloadType ===
"MICROVM" (using the existing project query) and return "required" for MICROVM
regardless of this.client, then after that check handle the case where
this.client is undefined by returning "skip" for non-MICROVM projects; update
the resolveMode function (referencing resolveMode and the
project.defaultWorkerGroup.workloadType check) so MICROVM cannot bypass template
creation when the gateway client is not configured.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@apps/webapp/app/v3/services/computeTemplateCreation.server.ts`:
- Around line 109-131: In resolveMode, the early return when this.client is
falsy causes MICROVM projects to be skipped; change the logic in resolveMode so
you first fetch the project and check project?.defaultWorkerGroup?.workloadType
=== "MICROVM" (using the existing project query) and return "required" for
MICROVM regardless of this.client, then after that check handle the case where
this.client is undefined by returning "skip" for non-MICROVM projects; update
the resolveMode function (referencing resolveMode and the
project.defaultWorkerGroup.workloadType check) so MICROVM cannot bypass template
creation when the gateway client is not configured.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: bc42b688-0ed1-4216-a21c-8b1cedba0f2a

📥 Commits

Reviewing files that changed from the base of the PR and between b7fa420 and d8e478a.

📒 Files selected for processing (5)
  • apps/supervisor/src/workloadManager/compute.ts
  • apps/webapp/app/v3/services/computeTemplateCreation.server.ts
  • internal-packages/compute/src/client.ts
  • internal-packages/compute/src/imageRef.ts
  • internal-packages/compute/src/index.ts
✅ Files skipped from review due to trivial changes (1)
  • internal-packages/compute/src/imageRef.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • internal-packages/compute/src/index.ts
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (27)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (7, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (7, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (5, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (8, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (1, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (6, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (3, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (8, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (2, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (5, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (6, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (3, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (1, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (4, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (2, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (4, 8)
  • GitHub Check: units / packages / 🧪 Unit Tests: Packages (1, 1)
  • GitHub Check: sdk-compat / Node.js 22.12 (ubuntu-latest)
  • GitHub Check: sdk-compat / Cloudflare Workers
  • GitHub Check: sdk-compat / Bun Runtime
  • GitHub Check: sdk-compat / Deno Runtime
  • GitHub Check: typecheck / typecheck
  • GitHub Check: sdk-compat / Node.js 20.20 (ubuntu-latest)
  • GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - pnpm)
  • GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - npm)
  • GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - npm)
  • GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - pnpm)
🧰 Additional context used
📓 Path-based instructions (13)
**/*.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{ts,tsx}: Use types over interfaces for TypeScript
Avoid using enums; prefer string unions or const objects instead

**/*.{ts,tsx}: For apps and internal packages (apps/*, internal-packages/*), use pnpm run typecheck --filter <package> for verification, never use build as it proves almost nothing about correctness
Use testcontainers helpers (redisTest, postgresTest, containerTest from @internal/testcontainers) for integration tests with Redis and PostgreSQL instead of mocking
When writing Trigger.dev tasks, always import from @trigger.dev/sdk - never use @trigger.dev/sdk/v3 or deprecated client.defineJob

Files:

  • apps/webapp/app/v3/services/computeTemplateCreation.server.ts
  • apps/supervisor/src/workloadManager/compute.ts
  • internal-packages/compute/src/client.ts
{packages/core,apps/webapp}/**/*.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use zod for validation in packages/core and apps/webapp

Files:

  • apps/webapp/app/v3/services/computeTemplateCreation.server.ts
**/*.{ts,tsx,js,jsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use function declarations instead of default exports

**/*.{ts,tsx,js,jsx}: Use pnpm for package management in this monorepo (version 10.23.0) with Turborepo for orchestration - run commands from root with pnpm run
Add crumbs as you write code for debug tracing using // @Crumbs comments or `// `#region` `@crumbs blocks - they stay on the branch throughout development and are stripped via agentcrumbs strip before merge

Files:

  • apps/webapp/app/v3/services/computeTemplateCreation.server.ts
  • apps/supervisor/src/workloadManager/compute.ts
  • internal-packages/compute/src/client.ts
apps/webapp/app/**/*.{ts,tsx}

📄 CodeRabbit inference engine (.cursor/rules/webapp.mdc)

Access all environment variables through the env export of env.server.ts instead of directly accessing process.env in the Trigger.dev webapp

Files:

  • apps/webapp/app/v3/services/computeTemplateCreation.server.ts
apps/webapp/**/*.{ts,tsx}

📄 CodeRabbit inference engine (.cursor/rules/webapp.mdc)

apps/webapp/**/*.{ts,tsx}: When importing from @trigger.dev/core in the webapp, use subpath exports from the package.json instead of importing from the root path
Follow the Remix 2.1.0 and Express server conventions when updating the main trigger.dev webapp

Files:

  • apps/webapp/app/v3/services/computeTemplateCreation.server.ts
apps/webapp/app/v3/services/**/*.server.{ts,tsx}

📄 CodeRabbit inference engine (.cursor/rules/webapp.mdc)

Organize services in the webapp following the pattern app/v3/services/*/*.server.ts

Files:

  • apps/webapp/app/v3/services/computeTemplateCreation.server.ts
**/*.ts

📄 CodeRabbit inference engine (.cursor/rules/otel-metrics.mdc)

**/*.ts: When creating or editing OTEL metrics (counters, histograms, gauges), ensure metric attributes have low cardinality by using only enums, booleans, bounded error codes, or bounded shard IDs
Do not use high-cardinality attributes in OTEL metrics such as UUIDs/IDs (envId, userId, runId, projectId, organizationId), unbounded integers (itemCount, batchSize, retryCount), timestamps (createdAt, startTime), or free-form strings (errorMessage, taskName, queueName)
When exporting OTEL metrics via OTLP to Prometheus, be aware that the exporter automatically adds unit suffixes to metric names (e.g., 'my_duration_ms' becomes 'my_duration_ms_milliseconds', 'my_counter' becomes 'my_counter_total'). Account for these transformations when writing Grafana dashboards or Prometheus queries

Files:

  • apps/webapp/app/v3/services/computeTemplateCreation.server.ts
  • apps/supervisor/src/workloadManager/compute.ts
  • internal-packages/compute/src/client.ts
**/*.{js,ts,jsx,tsx,json,md,yaml,yml}

📄 CodeRabbit inference engine (AGENTS.md)

Format code using Prettier before committing

Files:

  • apps/webapp/app/v3/services/computeTemplateCreation.server.ts
  • apps/supervisor/src/workloadManager/compute.ts
  • internal-packages/compute/src/client.ts
apps/**/*.{ts,tsx,js,jsx}

📄 CodeRabbit inference engine (CLAUDE.md)

When modifying only server components (apps/webapp/, apps/supervisor/, etc.) with no package changes, add a .server-changes/ file instead of a changeset

Files:

  • apps/webapp/app/v3/services/computeTemplateCreation.server.ts
  • apps/supervisor/src/workloadManager/compute.ts
apps/webapp/app/v3/**/*.{ts,tsx}

📄 CodeRabbit inference engine (CLAUDE.md)

In the webapp v3 directory, only modify V2 code paths when encountering V1/V2 branching in services - all new work uses Run Engine 2.0 (@internal/run-engine) and redis-worker, not legacy V1 engine code

Files:

  • apps/webapp/app/v3/services/computeTemplateCreation.server.ts
apps/webapp/app/**/*.{ts,tsx,server.ts}

📄 CodeRabbit inference engine (apps/webapp/CLAUDE.md)

Access environment variables via env export from app/env.server.ts. Never use process.env directly

Files:

  • apps/webapp/app/v3/services/computeTemplateCreation.server.ts
apps/webapp/app/v3/services/**/*.server.ts

📄 CodeRabbit inference engine (apps/webapp/CLAUDE.md)

Only modify V2 code paths when editing services that branch on RunEngineVersion to support both V1 and V2 (e.g., cancelTaskRun.server.ts, batchTriggerV3.server.ts)

Files:

  • apps/webapp/app/v3/services/computeTemplateCreation.server.ts
apps/supervisor/src/workloadManager/**/*.{js,ts}

📄 CodeRabbit inference engine (apps/supervisor/CLAUDE.md)

Container orchestration abstraction (Docker or Kubernetes) should be implemented in src/workloadManager/

Files:

  • apps/supervisor/src/workloadManager/compute.ts
🧠 Learnings (13)
📓 Common learnings
Learnt from: nicktrn
Repo: triggerdotdev/trigger.dev PR: 3114
File: apps/supervisor/src/workloadServer/index.ts:832-840
Timestamp: 2026-03-27T11:45:37.910Z
Learning: In `apps/supervisor/src/workloadManager/compute.ts` and the supervisor restore flow, `TRIGGER_METADATA_URL` does not need to be re-injected on VM restore because it is baked into the instance environment at creation time and the environment is preserved through snapshot/restore. The Kubernetes restore path follows the same pattern. Do not flag the absence of `TRIGGER_METADATA_URL` re-injection on restore as a bug.
Learnt from: nicktrn
Repo: triggerdotdev/trigger.dev PR: 3114
File: apps/supervisor/src/workloadServer/index.ts:494-539
Timestamp: 2026-03-26T23:24:31.165Z
Learning: In `apps/supervisor/src/workloadServer/index.ts`, the `/api/v1/compute/snapshot-complete` POST endpoint always replies `200` even when `workerClient.submitSuspendCompletion` returns `result.success === false`. This is intentional: the compute gateway's callback is fire-and-forget, it has no retry logic, and the snapshot state is already determined at the time of the callback. Returning a non-2xx would only cause the gateway to log a spurious error it cannot remediate. Failures are already logged by the supervisor, and the platform will eventually time out the suspend if it never receives a completion. Do not flag this as a bug.
Learnt from: nicktrn
Repo: triggerdotdev/trigger.dev PR: 3114
File: apps/supervisor/src/index.ts:226-251
Timestamp: 2026-02-23T12:56:51.147Z
Learning: In the supervisor compute restore flow (apps/supervisor/src/index.ts), the run engine handles retries for failed restore operations, so runs won't get permanently stuck even if computeManager.restore returns false or throws an exception.
📚 Learning: 2026-03-02T12:43:34.140Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: packages/cli-v3/CLAUDE.md:0-0
Timestamp: 2026-03-02T12:43:34.140Z
Learning: Applies to packages/cli-v3/src/commands/deploy.ts : Implement `deploy.ts` command in `src/commands/` for production deployment

Applied to files:

  • apps/webapp/app/v3/services/computeTemplateCreation.server.ts
📚 Learning: 2026-03-22T13:26:12.060Z
Learnt from: ericallam
Repo: triggerdotdev/trigger.dev PR: 3244
File: apps/webapp/app/components/code/TextEditor.tsx:81-86
Timestamp: 2026-03-22T13:26:12.060Z
Learning: In the triggerdotdev/trigger.dev codebase, do not flag `navigator.clipboard.writeText(...)` calls for `missing-await`/`unhandled-promise` issues. These clipboard writes are intentionally invoked without `await` and without `catch` handlers across the project; keep that behavior consistent when reviewing TypeScript/TSX files (e.g., usages like in `apps/webapp/app/components/code/TextEditor.tsx`).

Applied to files:

  • apps/webapp/app/v3/services/computeTemplateCreation.server.ts
  • apps/supervisor/src/workloadManager/compute.ts
  • internal-packages/compute/src/client.ts
📚 Learning: 2026-03-26T10:02:22.373Z
Learnt from: 0ski
Repo: triggerdotdev/trigger.dev PR: 3254
File: apps/webapp/app/services/platformNotifications.server.ts:363-385
Timestamp: 2026-03-26T10:02:22.373Z
Learning: In `triggerdotdev/trigger.dev`, the `getNextCliNotification` fallback in `apps/webapp/app/services/platformNotifications.server.ts` intentionally uses `prisma.orgMember.findFirst` (single org) when no `projectRef` is provided. This is acceptable for v1 because the CLI (`dev` and `login` commands) always passes `projectRef` in normal usage, making the fallback a rare edge case. Do not flag the single-org fallback as a multi-org correctness bug in this file.

Applied to files:

  • apps/webapp/app/v3/services/computeTemplateCreation.server.ts
📚 Learning: 2026-03-26T23:24:31.165Z
Learnt from: nicktrn
Repo: triggerdotdev/trigger.dev PR: 3114
File: apps/supervisor/src/workloadServer/index.ts:494-539
Timestamp: 2026-03-26T23:24:31.165Z
Learning: In `apps/supervisor/src/workloadServer/index.ts`, the `/api/v1/compute/snapshot-complete` POST endpoint always replies `200` even when `workerClient.submitSuspendCompletion` returns `result.success === false`. This is intentional: the compute gateway's callback is fire-and-forget, it has no retry logic, and the snapshot state is already determined at the time of the callback. Returning a non-2xx would only cause the gateway to log a spurious error it cannot remediate. Failures are already logged by the supervisor, and the platform will eventually time out the suspend if it never receives a completion. Do not flag this as a bug.

Applied to files:

  • apps/webapp/app/v3/services/computeTemplateCreation.server.ts
  • internal-packages/compute/src/client.ts
📚 Learning: 2026-03-10T17:56:20.938Z
Learnt from: samejr
Repo: triggerdotdev/trigger.dev PR: 3201
File: apps/webapp/app/v3/services/setSeatsAddOn.server.ts:25-29
Timestamp: 2026-03-10T17:56:20.938Z
Learning: Do not implement local userId-to-organizationId authorization checks inside org-scoped service classes (e.g., SetSeatsAddOnService, SetBranchesAddOnService) in the web app. Rely on route-layer authentication (requireUserId(request)) and org membership enforcement via the _app.orgs.$organizationSlug layout route. Any userId/organizationId that reaches these services from org-scoped routes has already been validated. Apply this pattern across all org-scoped services to avoid redundant auth checks and maintain consistency.

Applied to files:

  • apps/webapp/app/v3/services/computeTemplateCreation.server.ts
📚 Learning: 2026-03-22T19:24:14.403Z
Learnt from: matt-aitken
Repo: triggerdotdev/trigger.dev PR: 3187
File: apps/webapp/app/v3/services/alerts/deliverErrorGroupAlert.server.ts:200-204
Timestamp: 2026-03-22T19:24:14.403Z
Learning: In the triggerdotdev/trigger.dev codebase, webhook URLs are not expected to contain embedded credentials/secrets (e.g., fields like `ProjectAlertWebhookProperties` should only hold credential-free webhook endpoints). During code review, if you see logging or inclusion of raw webhook URLs in error messages, do not automatically treat it as a credential-leak/secrets-in-logs issue by default—first verify the URL does not contain embedded credentials (for example, no username/password in the URL, no obvious secret/token query params or fragments). If the URL is credential-free per this project’s conventions, allow the logging.

Applied to files:

  • apps/webapp/app/v3/services/computeTemplateCreation.server.ts
  • apps/supervisor/src/workloadManager/compute.ts
  • internal-packages/compute/src/client.ts
📚 Learning: 2026-03-02T12:42:47.652Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: apps/supervisor/CLAUDE.md:0-0
Timestamp: 2026-03-02T12:42:47.652Z
Learning: Applies to apps/supervisor/src/workloadManager/**/*.{js,ts} : Container orchestration abstraction (Docker or Kubernetes) should be implemented in `src/workloadManager/`

Applied to files:

  • apps/supervisor/src/workloadManager/compute.ts
📚 Learning: 2026-03-02T12:42:47.652Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: apps/supervisor/CLAUDE.md:0-0
Timestamp: 2026-03-02T12:42:47.652Z
Learning: Applies to apps/supervisor/src/workloadServer/**/*.{js,ts} : HTTP server for workload communication (heartbeats, snapshots) should be implemented in `src/workloadServer/`

Applied to files:

  • apps/supervisor/src/workloadManager/compute.ts
📚 Learning: 2026-02-23T12:56:51.147Z
Learnt from: nicktrn
Repo: triggerdotdev/trigger.dev PR: 3114
File: apps/supervisor/src/index.ts:226-251
Timestamp: 2026-02-23T12:56:51.147Z
Learning: In the supervisor compute restore flow (apps/supervisor/src/index.ts), the run engine handles retries for failed restore operations, so runs won't get permanently stuck even if computeManager.restore returns false or throws an exception.

Applied to files:

  • apps/supervisor/src/workloadManager/compute.ts
📚 Learning: 2026-03-27T11:45:37.910Z
Learnt from: nicktrn
Repo: triggerdotdev/trigger.dev PR: 3114
File: apps/supervisor/src/workloadServer/index.ts:832-840
Timestamp: 2026-03-27T11:45:37.910Z
Learning: In `apps/supervisor/src/workloadManager/compute.ts` and the supervisor restore flow, `TRIGGER_METADATA_URL` does not need to be re-injected on VM restore because it is baked into the instance environment at creation time and the environment is preserved through snapshot/restore. The Kubernetes restore path follows the same pattern. Do not flag the absence of `TRIGGER_METADATA_URL` re-injection on restore as a bug.

Applied to files:

  • apps/supervisor/src/workloadManager/compute.ts
📚 Learning: 2026-03-23T06:24:14.566Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-23T06:24:14.566Z
Learning: Applies to Dockerfile : When updating Docker image references, always use multiplatform/index digests, not architecture-specific digests, to prevent CI failures on different build environments

Applied to files:

  • apps/supervisor/src/workloadManager/compute.ts
📚 Learning: 2026-03-26T23:24:51.644Z
Learnt from: nicktrn
Repo: triggerdotdev/trigger.dev PR: 3114
File: apps/supervisor/src/workloadServer/index.ts:817-825
Timestamp: 2026-03-26T23:24:51.644Z
Learning: In `apps/supervisor/src/workloadServer/index.ts` (`WorkloadServer.stop()`), pending items returned by `this.snapshotDelayWheel?.stop()` are intentionally logged and dropped rather than dispatched. The entire supervisor is shutting down, so the snapshot callback URL would point at a dead server; dispatching snapshots during teardown would create orphaned gateway callbacks. Runners detect the supervisor is gone and reconnect to a new supervisor instance, which re-triggers the snapshot workflow. Do not flag the drop-on-shutdown behavior as a bug.

Applied to files:

  • apps/supervisor/src/workloadManager/compute.ts
🔇 Additional comments (6)
internal-packages/compute/src/client.ts (1)

49-70: LGTM! Type safety issue has been addressed.

The post() method now correctly returns Promise<T | undefined>, and instances.create() properly validates the response at lines 121-124, throwing a clear error when the gateway returns no body. This addresses the previous concern about callers silently dereferencing undefined.

Also applies to: 117-126

apps/supervisor/src/workloadManager/compute.ts (3)

40-151: LGTM! Clean implementation of workload lifecycle management.

The create() method properly:

  • Builds environment variables with conditional flags for warm-start, metadata, heartbeat, and snapshot polling
  • Uses tryCatch for clean error handling with typed error classification (timeout vs fetch)
  • Emits a comprehensive wide-event log in finally for observability
  • Optionally emits OTLP trace spans when tracing is enabled

The image digest stripping at line 85 is now consistent with ComputeTemplateCreationService since both use stripImageDigest from @internal/compute.


153-192: LGTM! Consistent error handling pattern.

Both snapshot() and deleteInstance() follow a clean pattern with tryCatch, proper error logging, and boolean return values for caller-side handling.


252-312: LGTM! Restore implementation follows architecture correctly.

The metadata object deliberately excludes TRIGGER_METADATA_URL because, per the established architecture, this URL is baked into the instance environment at creation time and preserved through snapshot/restore. The Kubernetes restore path follows the same pattern. Based on learnings: "TRIGGER_METADATA_URL does not need to be re-injected on VM restore because it is baked into the instance environment at creation time and the environment is preserved through snapshot/restore."

apps/webapp/app/v3/services/computeTemplateCreation.server.ts (2)

48-67: LGTM! Shadow mode error handling fixed.

The .then((result) => {...}).catch(...) pattern correctly handles both the expected { success: false, error } return value and any unexpected thrown exceptions.


152-176: LGTM! Clean template creation with proper error handling.

The method catches exceptions, logs them, and returns a structured result that callers can handle appropriately. Image digest stripping is consistent with the supervisor's ComputeWorkloadManager.

@nicktrn nicktrn marked this pull request as ready for review March 27, 2026 12:25
devin-ai-integration[bot]

This comment was marked as resolved.

coderabbitai[bot]

This comment was marked as resolved.

coderabbitai[bot]

This comment was marked as resolved.

devin-ai-integration[bot]

This comment was marked as resolved.

coderabbitai[bot]

This comment was marked as resolved.

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

View 18 additional findings in Devin Review.

Open in Devin Review

Comment on lines +66 to +67
startTimeUnixNano: String(span.startTimeMs * 1_000_000),
endTimeUnixNano: String(span.endTimeMs * 1_000_000),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 OTLP timestamp conversion overflows Number.MAX_SAFE_INTEGER for real epoch values

The buildPayload function converts millisecond timestamps to nanoseconds via String(span.startTimeMs * 1_000_000). In production, startTimeMs is a real epoch value from Date.now() (~1.7e12), so the product (~1.7e18) exceeds Number.MAX_SAFE_INTEGER (~9.0e15), causing IEEE 754 precision loss. The resulting nanosecond string will have incorrect trailing digits. The test at otlpTraceService.test.ts:82-83 only uses startTimeMs: 1000 (product 1e9, well within safe range), so it doesn't catch this. All callers pass real epoch timestamps — e.g. compute.ts:234 passes opts.dequeuedAt.getTime() - 1, and computeSnapshotService.ts:210 passes Date.now().

Fix approach

Use string concatenation or BigInt to avoid floating-point overflow:
String(BigInt(span.startTimeMs) * 1_000_000n) or
span.startTimeMs.toString() + "000000"

Suggested change
startTimeUnixNano: String(span.startTimeMs * 1_000_000),
endTimeUnixNano: String(span.endTimeMs * 1_000_000),
startTimeUnixNano: String(BigInt(span.startTimeMs) * 1_000_000n),
endTimeUnixNano: String(BigInt(span.endTimeMs) * 1_000_000n),
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant