feat(supervisor): compute workload manager by nicktrn · Pull Request #3114 · triggerdotdev/trigger.dev

nicktrn · 2026-02-23T12:14:37Z

Adds the ComputeWorkloadManager for routing task execution through the compute gateway, including full checkpoint/restore support.

Changes

Compute workload manager (apps/supervisor/src/workloadManager/compute.ts)

Routes VM create, snapshot, delete, and restore through the compute gateway API
Wide event logging on create with full timing and context
Configurable gateway timeout, auth token, image digest stripping
Restore sends name, env override metadata, CPU and memory so the agent can inject them before the VM resumes

Supervisor wiring (apps/supervisor/src/index.ts)

Compute mode activated when gateway URL is configured
Restore branch derives a unique runnerId per restore cycle, matching iceman's convention
Suspend/restore gated behind snapshots enabled flag

Workload server (apps/supervisor/src/workloadServer/index.ts)

Suspend handler triggers a compute snapshot (fire-and-forget) when in compute mode with snapshots enabled
Snapshot-complete callback endpoint receives the snapshot ID and calls submitSuspendCompletion

Env validation (apps/supervisor/src/env.ts)

Compute gateway URL, auth token, and timeout settings
Snapshots enabled flag (defaults off — compute mode can run without checkpoints)
Metadata URL required when snapshots enabled (validated at startup)

Add a third WorkloadManager implementation that creates sandboxes via the compute gateway HTTP API (POST /api/sandboxes). Uses native fetch with no new dependencies. Enabled by setting COMPUTE_GATEWAY_URL, which takes priority over Kubernetes and Docker providers.

The fetch() call had no timeout, causing infinite hangs when the gateway accepted requests but never returned responses. Adds AbortSignal.timeout (30s) and consolidates all logging into a single structured event per create() call with timing, status, and error context.

Emit a single canonical log line in a finally block instead of scattered log calls at each early return. Adds business context (envId, envType, orgId, projectId, deploymentVersion, machine) and instanceName to the event. Always emits at info level with ok=true/false for queryability.

Pass business context (runId, envId, orgId, projectId, machine, etc.) as metadata on CreateSandboxRequest instead of relying on env vars. This enables wide event logging in the compute stack without parsing env or leaking secrets.

Passes machine preset cpu and memory as top-level fields on the CreateSandboxRequest so the compute stack can use them for admission control and resource allocation.

Thread timing context from queue consumer through to the compute workload manager's wide event: - dequeueResponseMs: platform dequeue HTTP round-trip - pollingIntervalMs: which polling interval was active (idle vs active) - warmStartCheckMs: warm start check duration All fields are optional to avoid breaking existing consumers.

…-manager

- Fix instance creation URL from /api/sandboxes to /api/instances - Pass name: runnerId when creating compute instances - Add snapshot(), deleteInstance(), and restore() methods to ComputeWorkloadManager - Add /api/v1/compute/snapshot-complete callback endpoint to WorkloadServer - Handle suspend requests in compute mode via fire-and-forget snapshot with callback - Handle restore in compute mode by calling gateway restore API directly - Wire computeManager into WorkloadServer for compute mode suspend/restore

…-manager

…re request Restore calls now send a request body with the runner name, env override metadata, cpu, and memory so the agent can inject them before the VM resumes. The runner fetches these overrides from TRIGGER_METADATA_URL at restore time. runnerId is derived per restore cycle as runner-{runIdShort}-{checkpointSuffix}, matching iceman's pattern.

Gates snapshot/restore behaviour independently of compute mode. When disabled, VMs won't receive the metadata URL and suspend/restore are no-ops. Defaults to off so compute mode can be used without snapshots.

changeset-bot · 2026-02-23T12:14:41Z

🦋 Changeset detected

Latest commit: 9925c72

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 29 packages

Name	Type
trigger.dev	Patch
d3-chat	Patch
references-d3-openai-agents	Patch
references-nextjs-realtime	Patch
references-realtime-hooks-test	Patch
references-realtime-streams	Patch
references-telemetry	Patch
@trigger.dev/build	Patch
@trigger.dev/core	Patch
@trigger.dev/python	Patch
@trigger.dev/react-hooks	Patch
@trigger.dev/redis-worker	Patch
@trigger.dev/rsc	Patch
@trigger.dev/schema-to-json	Patch
@trigger.dev/sdk	Patch
@trigger.dev/database	Patch
@trigger.dev/otlp-importer	Patch
@internal/cache	Patch
@internal/clickhouse	Patch
@internal/llm-model-catalog	Patch
@internal/redis	Patch
@internal/replication	Patch
@internal/run-engine	Patch
@internal/schedule-engine	Patch
@internal/testcontainers	Patch
@internal/tracing	Patch
@internal/tsql	Patch
@internal/zod-worker	Patch
@internal/sdk-compat-tests	Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

coderabbitai · 2026-02-23T12:14:55Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

Walkthrough

Adds end-to-end compute support: a new internal package @internal/compute (client, types, imageRef), supervisor compute workload manager and wiring (create/snapshot/restore), OTLP trace payload/dispatch, timer-wheel-based delayed snapshot orchestration and HTTP callback route, environment schema extensions, webapp compute template creation service with feature-flag and rollout logic, a DB migration adding WorkloadType and WorkerInstanceGroup.workloadType, propagation of dequeue/polling timing through the run queue, a CLI local-build --load behavior fix, and new tests and logging verbosity adjustments.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 37.50% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'feat(supervisor): compute workload manager' accurately and clearly describes the main change: adding a ComputeWorkloadManager to the supervisor component.
Description check	✅ Passed	The PR description is comprehensive and structured, covering key changes across multiple files and components. However, it does not follow the provided template structure (Closes #, Checklist, Testing, Changelog, Screenshots sections).

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/compute-workload-manager

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…nabled Remove the silent `localhost` fallback for the snapshot callback URL, which would be unreachable from external compute gateways. Add env validation and a runtime guard matching the existing metadata URL pattern.

…-manager

delay compute snapshot requests to avoid wasted work on short-lived waitpoints (e.g. triggerAndWait resolving in <5s). configurable via COMPUTE_SNAPSHOT_DELAY_MS (default 5s).

…-manager

…ompute package

coderabbitai

♻️ Duplicate comments (1)

apps/webapp/app/v3/services/computeTemplateCreation.server.ts (1)

109-131: ⚠️ Potential issue | 🟠 Major

Gateway misconfiguration can fail open for MICROVM projects.

When this.client is undefined (gateway URL not configured), resolveMode returns "skip" at line 114 before checking if the project's workloadType is MICROVM. This allows MICROVM deployments to finalize without template creation if the webapp is misconfigured.

Consider moving the client check after the workloadType check, failing explicitly for required projects:

🛠️ Proposed fix to fail closed for required mode

   async resolveMode(
     projectId: string,
     prisma: PrismaClientOrTransaction
   ): Promise<TemplateCreationMode> {
-    if (!this.client) {
-      return "skip";
-    }
-
     const project = await prisma.project.findFirst({
       where: { id: projectId },
       select: {
         defaultWorkerGroup: {
           select: { workloadType: true },
         },
         organization: {
           select: { featureFlags: true },
         },
       },
     });

     if (project?.defaultWorkerGroup?.workloadType === "MICROVM") {
+      if (!this.client) {
+        throw new Error("Compute gateway not configured but required for MICROVM workload");
+      }
       return "required";
     }

     const flag = makeFlag(prisma);
     const hasComputeAccess = await flag({
       key: FEATURE_FLAG.hasComputeAccess,
       defaultValue: false,
       overrides: (project?.organization?.featureFlags as Record<string, unknown>) ?? {},
     });

     if (hasComputeAccess) {
+      if (!this.client) {
+        throw new Error("Compute gateway not configured but required for project with compute access");
+      }
       return "required";
     }

+    if (!this.client) {
+      return "skip";
+    }
+
     const rolloutPct = Number(env.COMPUTE_TEMPLATE_SHADOW_ROLLOUT_PCT ?? "0");

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@apps/webapp/app/v3/services/computeTemplateCreation.server.ts` around lines
109 - 131, In resolveMode, the early return when this.client is falsy causes
MICROVM projects to be skipped; change the logic in resolveMode so you first
fetch the project and check project?.defaultWorkerGroup?.workloadType ===
"MICROVM" (using the existing project query) and return "required" for MICROVM
regardless of this.client, then after that check handle the case where
this.client is undefined by returning "skip" for non-MICROVM projects; update
the resolveMode function (referencing resolveMode and the
project.defaultWorkerGroup.workloadType check) so MICROVM cannot bypass template
creation when the gateway client is not configured.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@apps/webapp/app/v3/services/computeTemplateCreation.server.ts`:
- Around line 109-131: In resolveMode, the early return when this.client is
falsy causes MICROVM projects to be skipped; change the logic in resolveMode so
you first fetch the project and check project?.defaultWorkerGroup?.workloadType
=== "MICROVM" (using the existing project query) and return "required" for
MICROVM regardless of this.client, then after that check handle the case where
this.client is undefined by returning "skip" for non-MICROVM projects; update
the resolveMode function (referencing resolveMode and the
project.defaultWorkerGroup.workloadType check) so MICROVM cannot bypass template
creation when the gateway client is not configured.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: bc42b688-0ed1-4216-a21c-8b1cedba0f2a

📥 Commits

Reviewing files that changed from the base of the PR and between b7fa420 and d8e478a.

📒 Files selected for processing (5)

apps/supervisor/src/workloadManager/compute.ts
apps/webapp/app/v3/services/computeTemplateCreation.server.ts
internal-packages/compute/src/client.ts
internal-packages/compute/src/imageRef.ts
internal-packages/compute/src/index.ts

✅ Files skipped from review due to trivial changes (1)

internal-packages/compute/src/imageRef.ts

🚧 Files skipped from review as they are similar to previous changes (1)

internal-packages/compute/src/index.ts

📜 Review details

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (27)

GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (7, 8)
GitHub Check: units / internal / 🧪 Unit Tests: Internal (7, 8)
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (5, 8)
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (8, 8)
GitHub Check: units / internal / 🧪 Unit Tests: Internal (1, 8)
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (6, 8)
GitHub Check: units / internal / 🧪 Unit Tests: Internal (3, 8)
GitHub Check: units / internal / 🧪 Unit Tests: Internal (8, 8)
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (2, 8)
GitHub Check: units / internal / 🧪 Unit Tests: Internal (5, 8)
GitHub Check: units / internal / 🧪 Unit Tests: Internal (6, 8)
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (3, 8)
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (1, 8)
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (4, 8)
GitHub Check: units / internal / 🧪 Unit Tests: Internal (2, 8)
GitHub Check: units / internal / 🧪 Unit Tests: Internal (4, 8)
GitHub Check: units / packages / 🧪 Unit Tests: Packages (1, 1)
GitHub Check: sdk-compat / Node.js 22.12 (ubuntu-latest)
GitHub Check: sdk-compat / Cloudflare Workers
GitHub Check: sdk-compat / Bun Runtime
GitHub Check: sdk-compat / Deno Runtime
GitHub Check: typecheck / typecheck
GitHub Check: sdk-compat / Node.js 20.20 (ubuntu-latest)
GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - pnpm)
GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - npm)
GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - npm)
GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - pnpm)

🧰 Additional context used

📓 Path-based instructions (13)

**/*.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{ts,tsx}: Use types over interfaces for TypeScript
Avoid using enums; prefer string unions or const objects instead

**/*.{ts,tsx}: For apps and internal packages (apps/*, internal-packages/*), use pnpm run typecheck --filter <package> for verification, never use build as it proves almost nothing about correctness
Use testcontainers helpers (redisTest, postgresTest, containerTest from @internal/testcontainers) for integration tests with Redis and PostgreSQL instead of mocking
When writing Trigger.dev tasks, always import from @trigger.dev/sdk - never use @trigger.dev/sdk/v3 or deprecated client.defineJob

Files:

apps/webapp/app/v3/services/computeTemplateCreation.server.ts
apps/supervisor/src/workloadManager/compute.ts
internal-packages/compute/src/client.ts

{packages/core,apps/webapp}/**/*.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use zod for validation in packages/core and apps/webapp

Files:

apps/webapp/app/v3/services/computeTemplateCreation.server.ts

**/*.{ts,tsx,js,jsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use function declarations instead of default exports

**/*.{ts,tsx,js,jsx}: Use pnpm for package management in this monorepo (version 10.23.0) with Turborepo for orchestration - run commands from root with pnpm run
Add crumbs as you write code for debug tracing using // @Crumbs comments or `// `#region` `@crumbs blocks - they stay on the branch throughout development and are stripped via agentcrumbs strip before merge

Files:

apps/webapp/app/v3/services/computeTemplateCreation.server.ts
apps/supervisor/src/workloadManager/compute.ts
internal-packages/compute/src/client.ts

apps/webapp/app/**/*.{ts,tsx}

📄 CodeRabbit inference engine (.cursor/rules/webapp.mdc)

Access all environment variables through the env export of env.server.ts instead of directly accessing process.env in the Trigger.dev webapp

Files:

apps/webapp/app/v3/services/computeTemplateCreation.server.ts

apps/webapp/**/*.{ts,tsx}

📄 CodeRabbit inference engine (.cursor/rules/webapp.mdc)

apps/webapp/**/*.{ts,tsx}: When importing from @trigger.dev/core in the webapp, use subpath exports from the package.json instead of importing from the root path
Follow the Remix 2.1.0 and Express server conventions when updating the main trigger.dev webapp

Files:

apps/webapp/app/v3/services/computeTemplateCreation.server.ts

apps/webapp/app/v3/services/**/*.server.{ts,tsx}

📄 CodeRabbit inference engine (.cursor/rules/webapp.mdc)

Organize services in the webapp following the pattern app/v3/services/*/*.server.ts

Files:

apps/webapp/app/v3/services/computeTemplateCreation.server.ts

**/*.ts

📄 CodeRabbit inference engine (.cursor/rules/otel-metrics.mdc)

**/*.ts: When creating or editing OTEL metrics (counters, histograms, gauges), ensure metric attributes have low cardinality by using only enums, booleans, bounded error codes, or bounded shard IDs
Do not use high-cardinality attributes in OTEL metrics such as UUIDs/IDs (envId, userId, runId, projectId, organizationId), unbounded integers (itemCount, batchSize, retryCount), timestamps (createdAt, startTime), or free-form strings (errorMessage, taskName, queueName)
When exporting OTEL metrics via OTLP to Prometheus, be aware that the exporter automatically adds unit suffixes to metric names (e.g., 'my_duration_ms' becomes 'my_duration_ms_milliseconds', 'my_counter' becomes 'my_counter_total'). Account for these transformations when writing Grafana dashboards or Prometheus queries

Files:

apps/webapp/app/v3/services/computeTemplateCreation.server.ts
apps/supervisor/src/workloadManager/compute.ts
internal-packages/compute/src/client.ts

**/*.{js,ts,jsx,tsx,json,md,yaml,yml}

📄 CodeRabbit inference engine (AGENTS.md)

Format code using Prettier before committing

Files:

apps/webapp/app/v3/services/computeTemplateCreation.server.ts
apps/supervisor/src/workloadManager/compute.ts
internal-packages/compute/src/client.ts

apps/**/*.{ts,tsx,js,jsx}

📄 CodeRabbit inference engine (CLAUDE.md)

When modifying only server components (apps/webapp/, apps/supervisor/, etc.) with no package changes, add a .server-changes/ file instead of a changeset

Files:

apps/webapp/app/v3/services/computeTemplateCreation.server.ts
apps/supervisor/src/workloadManager/compute.ts

apps/webapp/app/v3/**/*.{ts,tsx}

📄 CodeRabbit inference engine (CLAUDE.md)

In the webapp v3 directory, only modify V2 code paths when encountering V1/V2 branching in services - all new work uses Run Engine 2.0 (@internal/run-engine) and redis-worker, not legacy V1 engine code

Files:

apps/webapp/app/v3/services/computeTemplateCreation.server.ts

apps/webapp/app/**/*.{ts,tsx,server.ts}

📄 CodeRabbit inference engine (apps/webapp/CLAUDE.md)

Access environment variables via env export from app/env.server.ts. Never use process.env directly

Files:

apps/webapp/app/v3/services/computeTemplateCreation.server.ts

apps/webapp/app/v3/services/**/*.server.ts

📄 CodeRabbit inference engine (apps/webapp/CLAUDE.md)

Only modify V2 code paths when editing services that branch on RunEngineVersion to support both V1 and V2 (e.g., cancelTaskRun.server.ts, batchTriggerV3.server.ts)

Files:

apps/webapp/app/v3/services/computeTemplateCreation.server.ts

apps/supervisor/src/workloadManager/**/*.{js,ts}

📄 CodeRabbit inference engine (apps/supervisor/CLAUDE.md)

Container orchestration abstraction (Docker or Kubernetes) should be implemented in src/workloadManager/

Files:

apps/supervisor/src/workloadManager/compute.ts

🧠 Learnings (13)

📓 Common learnings

Learnt from: nicktrn
Repo: triggerdotdev/trigger.dev PR: 3114
File: apps/supervisor/src/workloadServer/index.ts:832-840
Timestamp: 2026-03-27T11:45:37.910Z
Learning: In `apps/supervisor/src/workloadManager/compute.ts` and the supervisor restore flow, `TRIGGER_METADATA_URL` does not need to be re-injected on VM restore because it is baked into the instance environment at creation time and the environment is preserved through snapshot/restore. The Kubernetes restore path follows the same pattern. Do not flag the absence of `TRIGGER_METADATA_URL` re-injection on restore as a bug.

Learnt from: nicktrn
Repo: triggerdotdev/trigger.dev PR: 3114
File: apps/supervisor/src/workloadServer/index.ts:494-539
Timestamp: 2026-03-26T23:24:31.165Z
Learning: In `apps/supervisor/src/workloadServer/index.ts`, the `/api/v1/compute/snapshot-complete` POST endpoint always replies `200` even when `workerClient.submitSuspendCompletion` returns `result.success === false`. This is intentional: the compute gateway's callback is fire-and-forget, it has no retry logic, and the snapshot state is already determined at the time of the callback. Returning a non-2xx would only cause the gateway to log a spurious error it cannot remediate. Failures are already logged by the supervisor, and the platform will eventually time out the suspend if it never receives a completion. Do not flag this as a bug.

Learnt from: nicktrn
Repo: triggerdotdev/trigger.dev PR: 3114
File: apps/supervisor/src/index.ts:226-251
Timestamp: 2026-02-23T12:56:51.147Z
Learning: In the supervisor compute restore flow (apps/supervisor/src/index.ts), the run engine handles retries for failed restore operations, so runs won't get permanently stuck even if computeManager.restore returns false or throws an exception.

📚 Learning: 2026-03-02T12:43:34.140Z

Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: packages/cli-v3/CLAUDE.md:0-0
Timestamp: 2026-03-02T12:43:34.140Z
Learning: Applies to packages/cli-v3/src/commands/deploy.ts : Implement `deploy.ts` command in `src/commands/` for production deployment

Applied to files:

apps/webapp/app/v3/services/computeTemplateCreation.server.ts

📚 Learning: 2026-03-22T13:26:12.060Z

Learnt from: ericallam
Repo: triggerdotdev/trigger.dev PR: 3244
File: apps/webapp/app/components/code/TextEditor.tsx:81-86
Timestamp: 2026-03-22T13:26:12.060Z
Learning: In the triggerdotdev/trigger.dev codebase, do not flag `navigator.clipboard.writeText(...)` calls for `missing-await`/`unhandled-promise` issues. These clipboard writes are intentionally invoked without `await` and without `catch` handlers across the project; keep that behavior consistent when reviewing TypeScript/TSX files (e.g., usages like in `apps/webapp/app/components/code/TextEditor.tsx`).

Applied to files:

apps/webapp/app/v3/services/computeTemplateCreation.server.ts
apps/supervisor/src/workloadManager/compute.ts
internal-packages/compute/src/client.ts

📚 Learning: 2026-03-26T10:02:22.373Z

Learnt from: 0ski
Repo: triggerdotdev/trigger.dev PR: 3254
File: apps/webapp/app/services/platformNotifications.server.ts:363-385
Timestamp: 2026-03-26T10:02:22.373Z
Learning: In `triggerdotdev/trigger.dev`, the `getNextCliNotification` fallback in `apps/webapp/app/services/platformNotifications.server.ts` intentionally uses `prisma.orgMember.findFirst` (single org) when no `projectRef` is provided. This is acceptable for v1 because the CLI (`dev` and `login` commands) always passes `projectRef` in normal usage, making the fallback a rare edge case. Do not flag the single-org fallback as a multi-org correctness bug in this file.

Applied to files:

apps/webapp/app/v3/services/computeTemplateCreation.server.ts

📚 Learning: 2026-03-26T23:24:31.165Z

Learnt from: nicktrn
Repo: triggerdotdev/trigger.dev PR: 3114
File: apps/supervisor/src/workloadServer/index.ts:494-539
Timestamp: 2026-03-26T23:24:31.165Z
Learning: In `apps/supervisor/src/workloadServer/index.ts`, the `/api/v1/compute/snapshot-complete` POST endpoint always replies `200` even when `workerClient.submitSuspendCompletion` returns `result.success === false`. This is intentional: the compute gateway's callback is fire-and-forget, it has no retry logic, and the snapshot state is already determined at the time of the callback. Returning a non-2xx would only cause the gateway to log a spurious error it cannot remediate. Failures are already logged by the supervisor, and the platform will eventually time out the suspend if it never receives a completion. Do not flag this as a bug.

Applied to files:

apps/webapp/app/v3/services/computeTemplateCreation.server.ts
internal-packages/compute/src/client.ts

📚 Learning: 2026-03-10T17:56:20.938Z

Learnt from: samejr
Repo: triggerdotdev/trigger.dev PR: 3201
File: apps/webapp/app/v3/services/setSeatsAddOn.server.ts:25-29
Timestamp: 2026-03-10T17:56:20.938Z
Learning: Do not implement local userId-to-organizationId authorization checks inside org-scoped service classes (e.g., SetSeatsAddOnService, SetBranchesAddOnService) in the web app. Rely on route-layer authentication (requireUserId(request)) and org membership enforcement via the _app.orgs.$organizationSlug layout route. Any userId/organizationId that reaches these services from org-scoped routes has already been validated. Apply this pattern across all org-scoped services to avoid redundant auth checks and maintain consistency.

Applied to files:

apps/webapp/app/v3/services/computeTemplateCreation.server.ts

📚 Learning: 2026-03-22T19:24:14.403Z

Learnt from: matt-aitken
Repo: triggerdotdev/trigger.dev PR: 3187
File: apps/webapp/app/v3/services/alerts/deliverErrorGroupAlert.server.ts:200-204
Timestamp: 2026-03-22T19:24:14.403Z
Learning: In the triggerdotdev/trigger.dev codebase, webhook URLs are not expected to contain embedded credentials/secrets (e.g., fields like `ProjectAlertWebhookProperties` should only hold credential-free webhook endpoints). During code review, if you see logging or inclusion of raw webhook URLs in error messages, do not automatically treat it as a credential-leak/secrets-in-logs issue by default—first verify the URL does not contain embedded credentials (for example, no username/password in the URL, no obvious secret/token query params or fragments). If the URL is credential-free per this project’s conventions, allow the logging.

Applied to files:

apps/webapp/app/v3/services/computeTemplateCreation.server.ts
apps/supervisor/src/workloadManager/compute.ts
internal-packages/compute/src/client.ts

📚 Learning: 2026-03-02T12:42:47.652Z

Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: apps/supervisor/CLAUDE.md:0-0
Timestamp: 2026-03-02T12:42:47.652Z
Learning: Applies to apps/supervisor/src/workloadManager/**/*.{js,ts} : Container orchestration abstraction (Docker or Kubernetes) should be implemented in `src/workloadManager/`

Applied to files:

apps/supervisor/src/workloadManager/compute.ts

📚 Learning: 2026-03-02T12:42:47.652Z

Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: apps/supervisor/CLAUDE.md:0-0
Timestamp: 2026-03-02T12:42:47.652Z
Learning: Applies to apps/supervisor/src/workloadServer/**/*.{js,ts} : HTTP server for workload communication (heartbeats, snapshots) should be implemented in `src/workloadServer/`

Applied to files:

apps/supervisor/src/workloadManager/compute.ts

📚 Learning: 2026-02-23T12:56:51.147Z

Learnt from: nicktrn
Repo: triggerdotdev/trigger.dev PR: 3114
File: apps/supervisor/src/index.ts:226-251
Timestamp: 2026-02-23T12:56:51.147Z
Learning: In the supervisor compute restore flow (apps/supervisor/src/index.ts), the run engine handles retries for failed restore operations, so runs won't get permanently stuck even if computeManager.restore returns false or throws an exception.

Applied to files:

apps/supervisor/src/workloadManager/compute.ts

📚 Learning: 2026-03-27T11:45:37.910Z

Learnt from: nicktrn
Repo: triggerdotdev/trigger.dev PR: 3114
File: apps/supervisor/src/workloadServer/index.ts:832-840
Timestamp: 2026-03-27T11:45:37.910Z
Learning: In `apps/supervisor/src/workloadManager/compute.ts` and the supervisor restore flow, `TRIGGER_METADATA_URL` does not need to be re-injected on VM restore because it is baked into the instance environment at creation time and the environment is preserved through snapshot/restore. The Kubernetes restore path follows the same pattern. Do not flag the absence of `TRIGGER_METADATA_URL` re-injection on restore as a bug.

Applied to files:

apps/supervisor/src/workloadManager/compute.ts

📚 Learning: 2026-03-23T06:24:14.566Z

Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-23T06:24:14.566Z
Learning: Applies to Dockerfile : When updating Docker image references, always use multiplatform/index digests, not architecture-specific digests, to prevent CI failures on different build environments

Applied to files:

apps/supervisor/src/workloadManager/compute.ts

📚 Learning: 2026-03-26T23:24:51.644Z

Learnt from: nicktrn
Repo: triggerdotdev/trigger.dev PR: 3114
File: apps/supervisor/src/workloadServer/index.ts:817-825
Timestamp: 2026-03-26T23:24:51.644Z
Learning: In `apps/supervisor/src/workloadServer/index.ts` (`WorkloadServer.stop()`), pending items returned by `this.snapshotDelayWheel?.stop()` are intentionally logged and dropped rather than dispatched. The entire supervisor is shutting down, so the snapshot callback URL would point at a dead server; dispatching snapshots during teardown would create orphaned gateway callbacks. Runners detect the supervisor is gone and reconnect to a new supervisor instance, which re-triggers the snapshot workflow. Do not flag the drop-on-shutdown behavior as a bug.

Applied to files:

apps/supervisor/src/workloadManager/compute.ts

🔇 Additional comments (6)

internal-packages/compute/src/client.ts (1)

49-70: LGTM! Type safety issue has been addressed.

The post() method now correctly returns Promise<T | undefined>, and instances.create() properly validates the response at lines 121-124, throwing a clear error when the gateway returns no body. This addresses the previous concern about callers silently dereferencing undefined.

Also applies to: 117-126

apps/supervisor/src/workloadManager/compute.ts (3)

40-151: LGTM! Clean implementation of workload lifecycle management.

The create() method properly:

Builds environment variables with conditional flags for warm-start, metadata, heartbeat, and snapshot polling

Uses tryCatch for clean error handling with typed error classification (timeout vs fetch)

Emits a comprehensive wide-event log in finally for observability

Optionally emits OTLP trace spans when tracing is enabled

The image digest stripping at line 85 is now consistent with ComputeTemplateCreationService since both use stripImageDigest from @internal/compute.

153-192: LGTM! Consistent error handling pattern.

Both snapshot() and deleteInstance() follow a clean pattern with tryCatch, proper error logging, and boolean return values for caller-side handling.

252-312: LGTM! Restore implementation follows architecture correctly.

The metadata object deliberately excludes TRIGGER_METADATA_URL because, per the established architecture, this URL is baked into the instance environment at creation time and preserved through snapshot/restore. The Kubernetes restore path follows the same pattern. Based on learnings: "TRIGGER_METADATA_URL does not need to be re-injected on VM restore because it is baked into the instance environment at creation time and the environment is preserved through snapshot/restore."

apps/webapp/app/v3/services/computeTemplateCreation.server.ts (2)

48-67: LGTM! Shadow mode error handling fixed.

The .then((result) => {...}).catch(...) pattern correctly handles both the expected { success: false, error } return value and any unexpected thrown exceptions.

152-176: LGTM! Clean template creation with proper error handling.

The method catches exceptions, logs them, and returns a structured result that callers can handle appropriately. Image digest stripping is consistent with the supervisor's ComputeWorkloadManager.

…t for templates

…d headers

…-manager

…VM projects

…g deploy

…pTrace module

…ent helper

…ESTS=1

…n, OTLP endpoint, snapshot concurrency

…ISPATCH_LIMIT

…d version in compute package

…nal/compute, add zod pinning rule

…-manager

devin-ai-integration

Devin Review found 1 new potential issue.

View 18 additional findings in Devin Review.

devin-ai-integration · 2026-03-28T19:52:23Z

apps/supervisor/src/services/otlpTraceService.ts

+                startTimeUnixNano: String(span.startTimeMs * 1_000_000),
+                endTimeUnixNano: String(span.endTimeMs * 1_000_000),


🟡 OTLP timestamp conversion overflows Number.MAX_SAFE_INTEGER for real epoch values

The buildPayload function converts millisecond timestamps to nanoseconds via String(span.startTimeMs * 1_000_000). In production, startTimeMs is a real epoch value from Date.now() (~1.7e12), so the product (~1.7e18) exceeds Number.MAX_SAFE_INTEGER (~9.0e15), causing IEEE 754 precision loss. The resulting nanosecond string will have incorrect trailing digits. The test at otlpTraceService.test.ts:82-83 only uses startTimeMs: 1000 (product 1e9, well within safe range), so it doesn't catch this. All callers pass real epoch timestamps — e.g. compute.ts:234 passes opts.dequeuedAt.getTime() - 1, and computeSnapshotService.ts:210 passes Date.now().

Fix approach

Use string concatenation or BigInt to avoid floating-point overflow:
String(BigInt(span.startTimeMs) * 1_000_000n) or
span.startTimeMs.toString() + "000000"

Suggested change

startTimeUnixNano: String(span.startTimeMs * 1_000_000),

endTimeUnixNano: String(span.endTimeMs * 1_000_000),

startTimeUnixNano: String(BigInt(span.startTimeMs) * 1_000_000n),

endTimeUnixNano: String(BigInt(span.endTimeMs) * 1_000_000n),

Was this helpful? React with 👍 or 👎 to provide feedback.

nicktrn added 17 commits February 11, 2026 09:44

chore: merge main into feat/compute-workload-manager

ccc8fe2

fix(supervisor): strip image digest in ComputeWorkloadManager

3175a10

feat: make gateway fetch timeout configurable

1bccd1e

feat(supervisor): send machine cpu/memory in compute sandbox requests

ac3dadf

Passes machine preset cpu and memory as top-level fields on the CreateSandboxRequest so the compute stack can use them for admission control and resource allocation.

Merge branch 'main' into HEAD

7e251d4

Merge remote-tracking branch 'origin/main' into feat/compute-workload…

e4915c4

…-manager

Merge remote-tracking branch 'origin/main' into feat/compute-workload…

9466a47

…-manager

fix(cli): fix --load flag on local/self-hosted builds

c1511f9

feat(supervisor): add flag to enable compute snapshots

4332743

Gates snapshot/restore behaviour independently of compute mode. When disabled, VMs won't receive the metadata URL and suspend/restore are no-ops. Defaults to off so compute mode can be used without snapshots.

feat(supervisor): require metadata URL when compute snapshots enabled

5089bba

This comment was marked as resolved.

Sign in to view

nicktrn added 8 commits March 2, 2026 19:35

fix(supervisor): don't destroy compute instance after snapshot

e9b5fd3

Merge remote-tracking branch 'origin/main' into feat/compute-workload…

0531a23

…-manager

Merge remote-tracking branch 'origin/main' into feat/compute-workload…

9572c7d

…-manager

Merge remote-tracking branch 'origin/main' into feat/compute-workload…

5032b7f

…-manager

Merge remote-tracking branch 'origin/main' into feat/compute-workload…

f3e0cb8

…-manager

Merge remote-tracking branch 'origin/main' into feat/compute-workload…

0edc308

…-manager

feat(supervisor): add snapshot delay for compute path via timer wheel

63424fa

delay compute snapshot requests to avoid wasted work on short-lived waitpoints (e.g. triggerAndWait resolving in <5s). configurable via COMPUTE_SNAPSHOT_DELAY_MS (default 5s).

Merge remote-tracking branch 'origin/main' into feat/compute-workload…

80b62d4

…-manager

nicktrn added 2 commits March 27, 2026 11:16

fix: update consumerPool test assertion for optional timing parameter

2219d11

refactor: consolidate compute gateway clients into shared @internal/c…

b7fa420

…ompute package

This comment was marked as resolved.

Sign in to view

fix: add type-safe post return, strip image digests consistently

d8e478a

coderabbitai bot reviewed Mar 27, 2026

View reviewed changes

nicktrn marked this pull request as ready for review March 27, 2026 12:25

refactor: convert remaining compute types to zod schemas

641d6a3

This comment was marked as resolved.

Sign in to view

fix: bound trace context map, gate on compute mode, use machine prese…

c1021f2

…t for templates

This comment was marked as resolved.

Sign in to view

nicktrn added 2 commits March 27, 2026 16:51

fix: register trace context before restore/warm-start, sanitize logge…

1005428

…d headers

Merge remote-tracking branch 'origin/main' into feat/compute-workload…

5ffc7d4

…-manager

This comment was marked as resolved.

Sign in to view

fix: shadow mode for org-level compute access, require only for MICRO…

64729bb

…VM projects

This comment was marked as resolved.

Sign in to view

nicktrn added 13 commits March 27, 2026 22:47

fix: wrap writer.write in try/catch to handle client disconnect durin…

e9bcbe4

…g deploy

feat: add OtlpTraceService

061c2fb

refactor: move otlp trace tests to services/

8711f5b

refactor: remove env import from compute workload manager

9d72ae2

refactor: use OtlpTraceService in workload server

18eb7bb

refactor: wire up OtlpTraceService to workload server, delete old otl…

91f9fa3

…pTrace module

refactor: inline payload builder into trace service, extract tracepar…

36ecdb5

…ent helper

fix: skip k8s integration tests by default, require K8S_INTEGRATION_T…

30df9e2

…ESTS=1

fix: review fixes - COMPUTE checkpoint type, memory_gb standardizatio…

05a6721

…n, OTLP endpoint, snapshot concurrency

fix: make snapshot dispatch limit configurable via COMPUTE_SNAPSHOT_D…

cacee1e

…ISPATCH_LIMIT

refactor: extract ComputeSnapshotService from workload server, fix zo…

680f156

…d version in compute package

fix: remove unnecessary re-export, import schema directly from @inter…

5142954

…nal/compute, add zod pinning rule

Merge remote-tracking branch 'origin/main' into feat/compute-workload…

9925c72

…-manager

devin-ai-integration bot reviewed Mar 28, 2026

View reviewed changes

		startTimeUnixNano: String(span.startTimeMs * 1_000_000),
		endTimeUnixNano: String(span.endTimeMs * 1_000_000),

Uh oh!

Conversation

nicktrn commented Feb 23, 2026

Changes

Uh oh!

changeset-bot bot commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🦋 Changeset detected

Uh oh!

coderabbitai bot commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

devin-ai-integration bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration bot Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

changeset-bot bot commented Feb 23, 2026 •

edited

Loading

coderabbitai bot commented Feb 23, 2026 •

edited

Loading