Task Execution Engine

Relevant source files

The Task Execution Engine is the core orchestration system responsible for managing the complete lifecycle of task runs in Trigger.dev. It handles task triggering, queuing, execution attempts, state management, retries, concurrency control, and advanced features like checkpoints, waitpoints, and batch processing. The engine coordinates between the webapp API, Redis-backed queues, PostgreSQL state storage, and worker processes that execute user tasks.

For information about the queue system and concurrency management, see Queue Management. For details on how workers execute tasks, see Worker Execution. For batch processing specifics, see Batch Processing.

Architecture Overview

The Task Execution Engine is implemented as the RunEngine class, which coordinates multiple specialized subsystems to manage task execution. The engine is instantiated as a singleton in the webapp and relies on PostgreSQL for persistent state, Redis for queuing and locking, and background workers for asynchronous job processing.

RunEngine Class Structure

Sources: internal-packages/run-engine/src/engine/index.ts76-387 internal-packages/run-engine/src/engine/systems/

Core Components

RunEngine Class

The RunEngine class is the main orchestrator that provides the public API for task execution operations. It initializes all subsystems and coordinates their interactions.

Key Initialization Parameters:

Parameter	Purpose	Default
`prisma`	Database client for state persistence	Required
`worker.redis`	Redis connection for background jobs	Required
`queue.redis`	Redis connection for task queues	Required
`runLock.redis`	Redis connection for distributed locks	Required
`machines`	Machine preset configurations	Required
`heartbeatTimeoutsMs`	Timeout durations for execution states	See defaults
`retryWarmStartThresholdMs`	Threshold for checkpoint-based retries	`30000`

Sources: internal-packages/run-engine/src/engine/types.ts23-112 apps/webapp/app/v3/runEngine.server.ts15-198

SystemResources

All subsystems share a common set of resources defined by the SystemResources interface:

Sources: internal-packages/run-engine/src/engine/systems/systems.ts

Execution Lifecycle

The execution lifecycle follows a well-defined flow from triggering a task to its completion or failure. Each stage involves state transitions tracked through execution snapshots.

Task Execution Flow with Code Methods

Sources: internal-packages/run-engine/src/engine/index.ts392-733 internal-packages/run-engine/src/engine/systems/dequeueSystem.ts105-603 internal-packages/run-engine/src/engine/systems/runAttemptSystem.ts298-632

Triggering a Run

The RunEngine.trigger() method creates a new task run and queues it for execution:

Method Signature:

Key Steps:

Debounce Check: If debounce param provided, call debounceSystem.handleDebounce()
- If existing run found, return it (status: existing)
- If max duration exceeded, proceed with new run (status: max_duration_exceeded)
- If new claim, store debounceClaimId (status: new)
Create TaskRun: Call prisma.taskRun.create() with:
- id: Generated via RunId.fromFriendlyId()
- status: "DELAYED" if delayUntil provided, else "PENDING"
- executionSnapshots: Initial snapshot with status "DELAYED" or "RUN_CREATED"
- associatedWaitpoint: Created via waitpointSystem.buildRunAssociatedWaitpoint()
Handle triggerAndWait: If resumeParentOnCompletion and parentTaskRunId:
- Call waitpointSystem.blockRunWithWaitpoint() to block parent run
Schedule or Enqueue:
- If delayUntil: Call delayedRunSystem.scheduleDelayedRunEnqueuing()
- Else: Call enqueueSystem.enqueueRun() and optionally ttlSystem.scheduleExpireRun()
Emit Event: Fire eventBus.emit('runCreated')

Sources: internal-packages/run-engine/src/engine/index.ts392-733 internal-packages/run-engine/src/engine/types.ts122-194

Dequeuing and Locking

The DequeueSystem handles retrieving runs from queues and preparing them for execution:

Key Steps:

RunQueue.dequeueMessageFromWorkerQueue() - Get run from Redis queue
Acquire distributed lock via RunLocker
Validate execution snapshot state (must be QUEUED or QUEUED_EXECUTING)
Resolve BackgroundWorkerTask, BackgroundWorker, TaskQueue, WorkerDeployment
Update TaskRun with lock information (lockedAt, lockedById, lockedToVersionId)
Create snapshot with status PENDING_EXECUTING
Return DequeuedMessage with execution details

Sources: internal-packages/run-engine/src/engine/systems/dequeueSystem.ts105-603

Starting an Attempt

The RunAttemptSystem.startRunAttempt() method transitions a dequeued run to active execution:

Key Steps:

Acquire distributed lock on run
Verify snapshot ID matches latest (prevent race conditions)
Increment attemptNumber (starting at 1)
Check against MAX_TASK_RUN_ATTEMPTS (100)
Update TaskRun status to EXECUTING
Create new snapshot with status EXECUTING
Cancel any scheduled TTL expiration
Resolve execution context (task, queue, organization, project, machine preset, deployment)
Return TaskRunExecution object for worker

Sources: internal-packages/run-engine/src/engine/systems/runAttemptSystem.ts300-635

Completing an Attempt

The completeRunAttempt() method finalizes an execution attempt, handling both success and failure:

Success Path (attemptSucceeded):

Validate snapshot is latest and not already finished
Calculate usage totals (duration, cost)
Update TaskRun status to COMPLETED_SUCCESSFULLY
Create FINISHED snapshot
Complete associated waitpoint (for triggerAndWait)
Release concurrency tokens
Emit runAttemptCompleted event

Failure Path (attemptFailed):

Determine retry outcome via retryOutcomeFromCompletion()
If retrying:
- Calculate delay using exponential backoff
- Call EnqueueSystem.enqueueRun() with retry metadata
- Update status to WAITING_FOR_DEPLOY or PENDING
If not retrying:
- Update status to final state (COMPLETED_WITH_ERRORS, SYSTEM_FAILURE, etc.)
- Complete associated waitpoint with error
- Release concurrency tokens
Emit appropriate events

Sources: internal-packages/run-engine/src/engine/systems/runAttemptSystem.ts637-1120

State Management

The execution engine tracks run state using two primary mechanisms: TaskRunStatus (on the TaskRun table) and TaskRunExecutionStatus (on the TaskRunExecutionSnapshot table). The snapshot-based approach provides an audit trail of all state transitions.

Execution Snapshot System

The ExecutionSnapshotSystem manages the creation and retrieval of execution snapshots, which track the detailed state of a run at specific points in time.

TaskRunExecutionSnapshot Fields:

Field	Type	Purpose
`id`	String	Unique snapshot ID
`runId`	String	Associated task run
`executionStatus`	Enum	Current execution state
`runStatus`	Enum	TaskRun status at snapshot time
`attemptNumber`	Int	Attempt number (1-based)
`description`	String	Human-readable state description
`previousSnapshotId`	String	Previous snapshot (linked list)
`checkpointId`	String	Associated checkpoint if applicable
`batchId`	String	Batch context if applicable
`completedWaitpoints`	Relation	Waitpoints completed at this point
`metadata`	JSON	Additional state metadata

Sources: internal-packages/run-engine/src/engine/systems/executionSnapshotSystem.ts1-393

Execution States

Valid TaskRunExecutionStatus Values:

Sources: internal-packages/run-engine/src/engine/statuses.ts1-62

State Predicates

The engine uses helper functions to classify execution states:

Sources: internal-packages/run-engine/src/engine/statuses.ts3-62

Subsystems

Execution Snapshot System

The ExecutionSnapshotSystem provides centralized snapshot management with helpers for creating snapshots and retrieving execution state.

Key Methods:

Method	Purpose
`createExecutionSnapshot()`	Creates a new snapshot with state transition
`getLatestExecutionSnapshot()`	Retrieves most recent valid snapshot
`getExecutionSnapshotsSince()`	Gets all snapshots after a given snapshot
`executionResultFromSnapshot()`	Converts snapshot to API result format
`executionDataFromSnapshot()`	Converts to full execution data with waitpoints

Snapshot Heartbeat Monitoring:

The system schedules heartbeat timeout jobs based on execution status:

When a heartbeat timeout expires, the heartbeatSnapshot worker job is executed to handle stalled runs.

Sources: internal-packages/run-engine/src/engine/systems/executionSnapshotSystem.ts226-393

Run Attempt System

The RunAttemptSystem manages the lifecycle of individual execution attempts, including starting, completing, and handling failures.

Caching Strategy:

The system uses a multi-tier cache (UnkeyCache) with memory and Redis stores to minimize database queries:

Cache Namespace	Fresh TTL	Stale TTL	Contents
`orgs`	24h	48h	Organization info
`projects`	24h	48h	Project info
`tasks`	24h	48h	Task metadata
`machinePresets`	24h	48h	Machine configurations
`deployments`	24h	48h	Deployment info
`queues`	1h	2h	Queue configurations

Retry Logic:

The retryOutcomeFromCompletion() function determines whether a failed attempt should be retried:

Check if error type is retryable (user errors vs system errors)
Verify attemptNumber < maxAttempts
Calculate exponential backoff delay
Return RetryOutcome with delay and reason

Sources: internal-packages/run-engine/src/engine/systems/runAttemptSystem.ts110-1631 internal-packages/run-engine/src/engine/retrying.ts

Dequeue System

The DequeueSystem handles the complex process of selecting and preparing runs for execution.

Dequeue Process with Code Methods

Background Worker Resolution:

The system must resolve the following before execution:

BackgroundWorker - Specific deployed version
BackgroundWorkerTask - Task definition within that version
TaskQueue - Queue configuration
WorkerDeployment - Container image reference (for production)

If any are missing, the run enters WAITING_FOR_DEPLOY status.

Sources: internal-packages/run-engine/src/engine/systems/dequeueSystem.ts88-630 internal-packages/run-engine/src/engine/statuses.ts3-6

Waitpoint System

The WaitpointSystem enables runs to block and wait for external events or conditions. Waitpoints support orchestration patterns like triggerAndWait, waitForDuration, and waitUntil.

Waitpoint Types:

Type	Purpose	Completion Trigger
`RUN`	Wait for another task run to complete	Task run completes
`DATETIME`	Wait until a specific time	Worker job at scheduled time
`MANUAL`	Wait for manual approval/signal	External API call
`BATCH`	Wait for all batch items to complete	All runs in batch complete

Key Operations:

Blocking Mechanism:

When a run is blocked:

Insert rows into TaskRunWaitpoint table linking run to waitpoint(s)
Query to count pending waitpoints
If any pending, transition to EXECUTING_WITH_WAITPOINTS or SUSPENDED
When waitpoint completes, schedule continueRunIfUnblocked job
Job checks if all waitpoints completed, then either enqueues run or sends notification to worker

Sources: internal-packages/run-engine/src/engine/systems/waitpointSystem.ts40-622

Checkpoint System

The CheckpointSystem enables runs to save execution state and resume later, supporting long-running tasks and graceful shutdowns.

Checkpoint Creation Process with Code Methods

Checkpoint Data:

Field	Type	Purpose
`type`	Enum	`DOCKER` or `KUBERNETES`
`location`	String	Storage location (S3, registry, etc.)
`imageRef`	String	Container image reference
`reason`	String	Why checkpoint was created

Resume Flow:

Worker calls continueRunExecution(runId, snapshotId)
System validates snapshot is PENDING_EXECUTING
Updates TaskRun status to EXECUTING
Creates new EXECUTING snapshot
Returns execution result with checkpoint data
Worker restores from checkpoint and continues

Sources: internal-packages/run-engine/src/engine/systems/checkpointSystem.ts21-250 internal-packages/run-engine/src/engine/statuses.ts21-32

Batch System

The BatchSystem manages batch operations using a waitpoint-based approach where a batch run waits for all child runs to complete.

Batch Waitpoint Flow:

createBatchWaitpoint() creates a BATCH type waitpoint
Each child run is triggered with batchId and resumeParentOnCompletion=true
Child runs create TaskRunWaitpoint entries linking to batch waitpoint
As child runs complete, incrementCompletedBatchItems() is called
When itemCount == completedItemCount, batch waitpoint completes
Parent run unblocks and continues

Key Methods:

Sources: internal-packages/run-engine/src/engine/systems/batchSystem.ts

Enqueue System

The EnqueueSystem handles adding runs to the execution queue, including retry scenarios.

Enqueue Process:

Create execution snapshot (status QUEUED or QUEUED_EXECUTING)
Call RunQueue.enqueue() with organization ID and run ID
Queue selects appropriate environment queue using fair selection algorithm
Run is added to Redis queue for that environment

Key Parameters:

Parameter	Purpose
`snapshot.status`	`QUEUED` (normal) or `QUEUED_EXECUTING` (while executing)
`snapshot.metadata`	Retry information, delay data, etc.
`previousSnapshotId`	Links to previous snapshot
`checkpointId`	If resuming from checkpoint
`completedWaitpoints`	Carry forward completed waitpoints

Sources: internal-packages/run-engine/src/engine/systems/enqueueSystem.ts16-92

Delayed Run System

The DelayedRunSystem handles runs scheduled to start at a future time.

Delayed Run Flow:

Key Operations:

scheduleDelayedRunEnqueuing() - Schedule worker job for future time
enqueueDelayedRun() - Execute scheduled enqueue operation
rescheduleDelayedRun() - Change delay time (used by debounce)

Sources: internal-packages/run-engine/src/engine/systems/delayedRunSystem.ts14-192

Debounce System

The DebounceSystem implements leading and trailing debounce patterns for task triggering.

Debounce Modes:

Mode	Behavior
`leading`	Execute first trigger immediately, ignore subsequent within delay
`trailing`	Schedule execution after delay, reschedule on new triggers

Implementation:

Uses Redis for coordination:

Key pattern: debounce:{environmentId}:{taskIdentifier}:{debounceKey}
Stores: {claimId, runId, delayUntil}
TTL: Maximum debounce duration (default 1 hour)

Trailing Mode Update:

When a debounced run already exists:

Update TaskRun payload/metadata with latest values
Reschedule enqueueDelayedRun to new delay time
Return existing run to caller

Sources: internal-packages/run-engine/src/engine/systems/debounceSystem.ts

TTL System

The TtlSystem handles automatic expiration of runs that exceed their time-to-live.

TTL Expiration Process:

When run is created/enqueued, schedule expireRun worker job for TTL time
When job executes, verify run hasn't already finished
If still pending/executing, fail run with EXPIRED status
Complete associated waitpoint with timeout error
Release concurrency tokens

Sources: internal-packages/run-engine/src/engine/systems/ttlSystem.ts15-63

Distributed Locking

The RunLocker class provides distributed locking using Redlock to prevent race conditions when multiple processes access the same run.

Locking Architecture:

Configuration:

Parameter	Default	Purpose
`duration`	5000ms	Lock duration before expiry
`automaticExtensionThreshold`	1000ms	When to auto-extend lock
`retryConfig.maxAttempts`	10	Maximum acquisition attempts
`retryConfig.baseDelay`	100ms	Initial retry delay
`retryConfig.maxDelay`	3000ms	Maximum retry delay
`retryConfig.backoffMultiplier`	1.8	Exponential backoff factor
`retryConfig.jitterFactor`	0.15	Random jitter percentage
`retryConfig.maxTotalWaitTime`	15000ms	Total retry timeout

Nested Lock Optimization:

When a lock is requested for resources already held, the system reuses the existing lock instead of acquiring a new one, preventing deadlocks and improving performance.

Manual Locking:

For long-running operations, the system supports manual lock management:

Sources: internal-packages/run-engine/src/engine/locking.ts70-497

Queue Architecture

The RunQueue manages task queuing using a hierarchical Redis-based system with fair queue selection and concurrency control.

Queue Hierarchy:

Fair Queue Selection Algorithm

The FairQueueSelectionStrategy uses a weighted scoring algorithm to select which environment to dequeue from:

Algorithm Components:

Component	Code Reference	Default	Purpose
`concurrencyLimitBias`	`biases.concurrencyLimitBias`	0.75	Prefer environments with higher concurrency limits
`availableCapacityBias`	`biases.availableCapacityBias`	0.3	Prefer environments with more available capacity
`queueAgeRandomization`	`biases.queueAgeRandomization`	0.25	Add randomness to prevent starvation

Concurrency Management Methods:

Each environment has concurrency limits tracked in Redis:

Redis Key Structure:

Concurrency counter: runqueue:concurrency:{environmentId}
Concurrency tokens: runqueue:concurrency:tokens:{environmentId}:{runId}
Base limit (default 100) from RuntimeEnvironment.maximumConcurrencyLimit
Burst factor (default 1.0) from RuntimeEnvironment.concurrencyLimitBurstFactor

Sources: internal-packages/run-engine/src/run-queue/index.ts internal-packages/run-engine/src/run-queue/fairQueueSelectionStrategy.ts1-346

Background Worker Catalog

The engine uses a Worker (from @trigger.dev/redis-worker) to process background jobs. Jobs are defined in the worker catalog:

Job Type	Payload	Purpose
`finishWaitpoint`	`{waitpointId, error?}`	Complete a waitpoint
`heartbeatSnapshot`	`{runId, snapshotId, restartAttempt?}`	Handle stalled snapshot
`repairSnapshot`	`{runId, snapshotId, executionStatus}`	Repair invalid snapshot
`expireRun`	`{runId}`	Expire run on TTL
`cancelRun`	`{runId, completedAt, reason?}`	Cancel a run
`queueRunsPendingVersion`	`{backgroundWorkerId}`	Enqueue runs waiting for deployment
`tryCompleteBatch`	`{batchId}`	Attempt batch completion
`continueRunIfUnblocked`	`{runId}`	Resume run if waitpoints completed
`enqueueDelayedRun`	`{runId}`	Enqueue delayed run

Sources: internal-packages/run-engine/src/engine/workerCatalog.ts1-66

Configuration

The RunEngine is configured via the RunEngineOptions interface. The webapp instantiates it with environment variables:

Key Configuration Groups:

Group	Environment Variables	Purpose
Worker	`RUN_ENGINE_WORKER_COUNT`, `RUN_ENGINE_TASKS_PER_WORKER`	Background job processing
Queue	`RUN_ENGINE_RUN_QUEUE_REDIS_*`, `DEFAULT_ENV_EXECUTION_CONCURRENCY_LIMIT`	Queue connections and limits
Locks	`RUN_ENGINE_RUN_LOCK_REDIS_*`, `RUN_ENGINE_RUN_LOCK_DURATION`	Distributed locking
Timeouts	`RUN_ENGINE_TIMEOUT_PENDING_EXECUTING`, `RUN_ENGINE_TIMEOUT_EXECUTING`	Heartbeat timeouts
Retries	`RUN_ENGINE_RETRY_WARM_START_THRESHOLD_MS`	Checkpoint-based retry threshold
Batch	`BATCH_QUEUE_CONSUMER_COUNT`, `BATCH_QUEUE_DRR_QUANTUM`	Batch processing via DRR

Example Initialization:

The webapp creates a singleton instance with production configuration loaded from environment variables and platform services.

Sources: apps/webapp/app/v3/runEngine.server.ts15-198 apps/webapp/app/env.server.ts560-694

Event Bus

The RunEngine emits events through an EventBus (Node.js EventEmitter) to notify other parts of the system about state changes:

Key Events:

Event	Payload	Purpose
`runCreated`	`{time, runId}`	New run created
`runLocked`	`{time, run, organization, project, environment}`	Run dequeued and locked
`runAttemptStarted`	`{time, run, organization, project, environment}`	Attempt started
`runAttemptCompleted`	`{time, run, organization, project, environment}`	Attempt finished
`runStatusChanged`	`{time, run, organization, project, environment}`	Status changed
`runEnqueuedAfterDelay`	`{time, run, organization, project, environment}`	Delayed run queued
`runDelayRescheduled`	`{time, run, organization, project, environment}`	Delay time changed
`cachedRunCompleted`	`{time, span, blockedRunId, hasError, cachedRunId}`	Cached run completed (debounce)
`incomingCheckpointDiscarded`	`{time, run, checkpoint, snapshot}`	Invalid checkpoint rejected

These events can be subscribed to for metrics, logging, and integration with other systems.

Sources: internal-packages/run-engine/src/engine/eventBus.ts

Task Execution Engine

Relevant source files

Architecture Overview

RunEngine Class Structure

Sources: internal-packages/run-engine/src/engine/index.ts76-387 internal-packages/run-engine/src/engine/systems/

Core Components

RunEngine Class

The RunEngine class is the main orchestrator that provides the public API for task execution operations. It initializes all subsystems and coordinates their interactions.

Key Initialization Parameters:

Parameter	Purpose	Default
`prisma`	Database client for state persistence	Required
`worker.redis`	Redis connection for background jobs	Required
`queue.redis`	Redis connection for task queues	Required
`runLock.redis`	Redis connection for distributed locks	Required
`machines`	Machine preset configurations	Required
`heartbeatTimeoutsMs`	Timeout durations for execution states	See defaults
`retryWarmStartThresholdMs`	Threshold for checkpoint-based retries	`30000`

Sources: internal-packages/run-engine/src/engine/types.ts23-112 apps/webapp/app/v3/runEngine.server.ts15-198

SystemResources

All subsystems share a common set of resources defined by the SystemResources interface:

Sources: internal-packages/run-engine/src/engine/systems/systems.ts

Execution Lifecycle

The execution lifecycle follows a well-defined flow from triggering a task to its completion or failure. Each stage involves state transitions tracked through execution snapshots.

Task Execution Flow with Code Methods

Triggering a Run

The RunEngine.trigger() method creates a new task run and queues it for execution:

Method Signature:

Key Steps:

Debounce Check: If debounce param provided, call debounceSystem.handleDebounce()
- If existing run found, return it (status: existing)
- If max duration exceeded, proceed with new run (status: max_duration_exceeded)
- If new claim, store debounceClaimId (status: new)
Create TaskRun: Call prisma.taskRun.create() with:
- id: Generated via RunId.fromFriendlyId()
- status: "DELAYED" if delayUntil provided, else "PENDING"
- executionSnapshots: Initial snapshot with status "DELAYED" or "RUN_CREATED"
- associatedWaitpoint: Created via waitpointSystem.buildRunAssociatedWaitpoint()
Handle triggerAndWait: If resumeParentOnCompletion and parentTaskRunId:
- Call waitpointSystem.blockRunWithWaitpoint() to block parent run
Schedule or Enqueue:
- If delayUntil: Call delayedRunSystem.scheduleDelayedRunEnqueuing()
- Else: Call enqueueSystem.enqueueRun() and optionally ttlSystem.scheduleExpireRun()
Emit Event: Fire eventBus.emit('runCreated')

Sources: internal-packages/run-engine/src/engine/index.ts392-733 internal-packages/run-engine/src/engine/types.ts122-194

Dequeuing and Locking

The DequeueSystem handles retrieving runs from queues and preparing them for execution:

Key Steps:

RunQueue.dequeueMessageFromWorkerQueue() - Get run from Redis queue
Acquire distributed lock via RunLocker
Validate execution snapshot state (must be QUEUED or QUEUED_EXECUTING)
Resolve BackgroundWorkerTask, BackgroundWorker, TaskQueue, WorkerDeployment
Update TaskRun with lock information (lockedAt, lockedById, lockedToVersionId)
Create snapshot with status PENDING_EXECUTING
Return DequeuedMessage with execution details

Sources: internal-packages/run-engine/src/engine/systems/dequeueSystem.ts105-603

Starting an Attempt

The RunAttemptSystem.startRunAttempt() method transitions a dequeued run to active execution:

Key Steps:

Acquire distributed lock on run
Verify snapshot ID matches latest (prevent race conditions)
Increment attemptNumber (starting at 1)
Check against MAX_TASK_RUN_ATTEMPTS (100)
Update TaskRun status to EXECUTING
Create new snapshot with status EXECUTING
Cancel any scheduled TTL expiration
Resolve execution context (task, queue, organization, project, machine preset, deployment)
Return TaskRunExecution object for worker

Sources: internal-packages/run-engine/src/engine/systems/runAttemptSystem.ts300-635

Completing an Attempt

The completeRunAttempt() method finalizes an execution attempt, handling both success and failure:

Success Path (attemptSucceeded):

Validate snapshot is latest and not already finished
Calculate usage totals (duration, cost)
Update TaskRun status to COMPLETED_SUCCESSFULLY
Create FINISHED snapshot
Complete associated waitpoint (for triggerAndWait)
Release concurrency tokens
Emit runAttemptCompleted event

Failure Path (attemptFailed):

Determine retry outcome via retryOutcomeFromCompletion()
If retrying:
- Calculate delay using exponential backoff
- Call EnqueueSystem.enqueueRun() with retry metadata
- Update status to WAITING_FOR_DEPLOY or PENDING
If not retrying:
- Update status to final state (COMPLETED_WITH_ERRORS, SYSTEM_FAILURE, etc.)
- Complete associated waitpoint with error
- Release concurrency tokens
Emit appropriate events

Sources: internal-packages/run-engine/src/engine/systems/runAttemptSystem.ts637-1120

State Management

Execution Snapshot System

The ExecutionSnapshotSystem manages the creation and retrieval of execution snapshots, which track the detailed state of a run at specific points in time.

TaskRunExecutionSnapshot Fields:

Field	Type	Purpose
`id`	String	Unique snapshot ID
`runId`	String	Associated task run
`executionStatus`	Enum	Current execution state
`runStatus`	Enum	TaskRun status at snapshot time
`attemptNumber`	Int	Attempt number (1-based)
`description`	String	Human-readable state description
`previousSnapshotId`	String	Previous snapshot (linked list)
`checkpointId`	String	Associated checkpoint if applicable
`batchId`	String	Batch context if applicable
`completedWaitpoints`	Relation	Waitpoints completed at this point
`metadata`	JSON	Additional state metadata

Sources: internal-packages/run-engine/src/engine/systems/executionSnapshotSystem.ts1-393

Execution States

Valid TaskRunExecutionStatus Values:

Sources: internal-packages/run-engine/src/engine/statuses.ts1-62

State Predicates

The engine uses helper functions to classify execution states:

Sources: internal-packages/run-engine/src/engine/statuses.ts3-62

Subsystems

Execution Snapshot System

The ExecutionSnapshotSystem provides centralized snapshot management with helpers for creating snapshots and retrieving execution state.

Key Methods:

Method	Purpose
`createExecutionSnapshot()`	Creates a new snapshot with state transition
`getLatestExecutionSnapshot()`	Retrieves most recent valid snapshot
`getExecutionSnapshotsSince()`	Gets all snapshots after a given snapshot
`executionResultFromSnapshot()`	Converts snapshot to API result format
`executionDataFromSnapshot()`	Converts to full execution data with waitpoints

Snapshot Heartbeat Monitoring:

The system schedules heartbeat timeout jobs based on execution status:

When a heartbeat timeout expires, the heartbeatSnapshot worker job is executed to handle stalled runs.

Sources: internal-packages/run-engine/src/engine/systems/executionSnapshotSystem.ts226-393

Run Attempt System

The RunAttemptSystem manages the lifecycle of individual execution attempts, including starting, completing, and handling failures.

Caching Strategy:

The system uses a multi-tier cache (UnkeyCache) with memory and Redis stores to minimize database queries:

Cache Namespace	Fresh TTL	Stale TTL	Contents
`orgs`	24h	48h	Organization info
`projects`	24h	48h	Project info
`tasks`	24h	48h	Task metadata
`machinePresets`	24h	48h	Machine configurations
`deployments`	24h	48h	Deployment info
`queues`	1h	2h	Queue configurations

Retry Logic:

The retryOutcomeFromCompletion() function determines whether a failed attempt should be retried:

Check if error type is retryable (user errors vs system errors)
Verify attemptNumber < maxAttempts
Calculate exponential backoff delay
Return RetryOutcome with delay and reason

Sources: internal-packages/run-engine/src/engine/systems/runAttemptSystem.ts110-1631 internal-packages/run-engine/src/engine/retrying.ts

Dequeue System

The DequeueSystem handles the complex process of selecting and preparing runs for execution.

Dequeue Process with Code Methods

Background Worker Resolution:

The system must resolve the following before execution:

BackgroundWorker - Specific deployed version
BackgroundWorkerTask - Task definition within that version
TaskQueue - Queue configuration
WorkerDeployment - Container image reference (for production)

If any are missing, the run enters WAITING_FOR_DEPLOY status.

Sources: internal-packages/run-engine/src/engine/systems/dequeueSystem.ts88-630 internal-packages/run-engine/src/engine/statuses.ts3-6

Waitpoint System

The WaitpointSystem enables runs to block and wait for external events or conditions. Waitpoints support orchestration patterns like triggerAndWait, waitForDuration, and waitUntil.

Waitpoint Types:

Type	Purpose	Completion Trigger
`RUN`	Wait for another task run to complete	Task run completes
`DATETIME`	Wait until a specific time	Worker job at scheduled time
`MANUAL`	Wait for manual approval/signal	External API call
`BATCH`	Wait for all batch items to complete	All runs in batch complete

Key Operations:

Blocking Mechanism:

When a run is blocked:

Insert rows into TaskRunWaitpoint table linking run to waitpoint(s)
Query to count pending waitpoints
If any pending, transition to EXECUTING_WITH_WAITPOINTS or SUSPENDED
When waitpoint completes, schedule continueRunIfUnblocked job
Job checks if all waitpoints completed, then either enqueues run or sends notification to worker

Sources: internal-packages/run-engine/src/engine/systems/waitpointSystem.ts40-622

Checkpoint System

The CheckpointSystem enables runs to save execution state and resume later, supporting long-running tasks and graceful shutdowns.

Checkpoint Creation Process with Code Methods

Checkpoint Data:

Field	Type	Purpose
`type`	Enum	`DOCKER` or `KUBERNETES`
`location`	String	Storage location (S3, registry, etc.)
`imageRef`	String	Container image reference
`reason`	String	Why checkpoint was created

Resume Flow:

Worker calls continueRunExecution(runId, snapshotId)
System validates snapshot is PENDING_EXECUTING
Updates TaskRun status to EXECUTING
Creates new EXECUTING snapshot
Returns execution result with checkpoint data
Worker restores from checkpoint and continues

Sources: internal-packages/run-engine/src/engine/systems/checkpointSystem.ts21-250 internal-packages/run-engine/src/engine/statuses.ts21-32

Batch System

The BatchSystem manages batch operations using a waitpoint-based approach where a batch run waits for all child runs to complete.

Batch Waitpoint Flow:

createBatchWaitpoint() creates a BATCH type waitpoint
Each child run is triggered with batchId and resumeParentOnCompletion=true
Child runs create TaskRunWaitpoint entries linking to batch waitpoint
As child runs complete, incrementCompletedBatchItems() is called
When itemCount == completedItemCount, batch waitpoint completes
Parent run unblocks and continues

Key Methods:

Sources: internal-packages/run-engine/src/engine/systems/batchSystem.ts

Enqueue System

The EnqueueSystem handles adding runs to the execution queue, including retry scenarios.

Enqueue Process:

Create execution snapshot (status QUEUED or QUEUED_EXECUTING)
Call RunQueue.enqueue() with organization ID and run ID
Queue selects appropriate environment queue using fair selection algorithm
Run is added to Redis queue for that environment

Key Parameters:

Parameter	Purpose
`snapshot.status`	`QUEUED` (normal) or `QUEUED_EXECUTING` (while executing)
`snapshot.metadata`	Retry information, delay data, etc.
`previousSnapshotId`	Links to previous snapshot
`checkpointId`	If resuming from checkpoint
`completedWaitpoints`	Carry forward completed waitpoints

Sources: internal-packages/run-engine/src/engine/systems/enqueueSystem.ts16-92

Delayed Run System

The DelayedRunSystem handles runs scheduled to start at a future time.

Delayed Run Flow:

Key Operations:

scheduleDelayedRunEnqueuing() - Schedule worker job for future time
enqueueDelayedRun() - Execute scheduled enqueue operation
rescheduleDelayedRun() - Change delay time (used by debounce)

Sources: internal-packages/run-engine/src/engine/systems/delayedRunSystem.ts14-192

Debounce System

The DebounceSystem implements leading and trailing debounce patterns for task triggering.

Debounce Modes:

Mode	Behavior
`leading`	Execute first trigger immediately, ignore subsequent within delay
`trailing`	Schedule execution after delay, reschedule on new triggers

Implementation:

Uses Redis for coordination:

Key pattern: debounce:{environmentId}:{taskIdentifier}:{debounceKey}
Stores: {claimId, runId, delayUntil}
TTL: Maximum debounce duration (default 1 hour)

Trailing Mode Update:

When a debounced run already exists:

Update TaskRun payload/metadata with latest values
Reschedule enqueueDelayedRun to new delay time
Return existing run to caller

Sources: internal-packages/run-engine/src/engine/systems/debounceSystem.ts

TTL System

The TtlSystem handles automatic expiration of runs that exceed their time-to-live.

TTL Expiration Process:

When run is created/enqueued, schedule expireRun worker job for TTL time
When job executes, verify run hasn't already finished
If still pending/executing, fail run with EXPIRED status
Complete associated waitpoint with timeout error
Release concurrency tokens

Sources: internal-packages/run-engine/src/engine/systems/ttlSystem.ts15-63

Distributed Locking

The RunLocker class provides distributed locking using Redlock to prevent race conditions when multiple processes access the same run.

Locking Architecture:

Configuration:

Parameter	Default	Purpose
`duration`	5000ms	Lock duration before expiry
`automaticExtensionThreshold`	1000ms	When to auto-extend lock
`retryConfig.maxAttempts`	10	Maximum acquisition attempts
`retryConfig.baseDelay`	100ms	Initial retry delay
`retryConfig.maxDelay`	3000ms	Maximum retry delay
`retryConfig.backoffMultiplier`	1.8	Exponential backoff factor
`retryConfig.jitterFactor`	0.15	Random jitter percentage
`retryConfig.maxTotalWaitTime`	15000ms	Total retry timeout

Nested Lock Optimization:

When a lock is requested for resources already held, the system reuses the existing lock instead of acquiring a new one, preventing deadlocks and improving performance.

Manual Locking:

For long-running operations, the system supports manual lock management:

Sources: internal-packages/run-engine/src/engine/locking.ts70-497

Queue Architecture

The RunQueue manages task queuing using a hierarchical Redis-based system with fair queue selection and concurrency control.

Queue Hierarchy:

Fair Queue Selection Algorithm

The FairQueueSelectionStrategy uses a weighted scoring algorithm to select which environment to dequeue from:

Algorithm Components:

Component	Code Reference	Default	Purpose
`concurrencyLimitBias`	`biases.concurrencyLimitBias`	0.75	Prefer environments with higher concurrency limits
`availableCapacityBias`	`biases.availableCapacityBias`	0.3	Prefer environments with more available capacity
`queueAgeRandomization`	`biases.queueAgeRandomization`	0.25	Add randomness to prevent starvation

Concurrency Management Methods:

Each environment has concurrency limits tracked in Redis:

Redis Key Structure:

Concurrency counter: runqueue:concurrency:{environmentId}
Concurrency tokens: runqueue:concurrency:tokens:{environmentId}:{runId}
Base limit (default 100) from RuntimeEnvironment.maximumConcurrencyLimit
Burst factor (default 1.0) from RuntimeEnvironment.concurrencyLimitBurstFactor

Sources: internal-packages/run-engine/src/run-queue/index.ts internal-packages/run-engine/src/run-queue/fairQueueSelectionStrategy.ts1-346

Background Worker Catalog

The engine uses a Worker (from @trigger.dev/redis-worker) to process background jobs. Jobs are defined in the worker catalog:

Job Type	Payload	Purpose
`finishWaitpoint`	`{waitpointId, error?}`	Complete a waitpoint
`heartbeatSnapshot`	`{runId, snapshotId, restartAttempt?}`	Handle stalled snapshot
`repairSnapshot`	`{runId, snapshotId, executionStatus}`	Repair invalid snapshot
`expireRun`	`{runId}`	Expire run on TTL
`cancelRun`	`{runId, completedAt, reason?}`	Cancel a run
`queueRunsPendingVersion`	`{backgroundWorkerId}`	Enqueue runs waiting for deployment
`tryCompleteBatch`	`{batchId}`	Attempt batch completion
`continueRunIfUnblocked`	`{runId}`	Resume run if waitpoints completed
`enqueueDelayedRun`	`{runId}`	Enqueue delayed run

Sources: internal-packages/run-engine/src/engine/workerCatalog.ts1-66

Configuration

The RunEngine is configured via the RunEngineOptions interface. The webapp instantiates it with environment variables:

Key Configuration Groups:

Group	Environment Variables	Purpose
Worker	`RUN_ENGINE_WORKER_COUNT`, `RUN_ENGINE_TASKS_PER_WORKER`	Background job processing
Queue	`RUN_ENGINE_RUN_QUEUE_REDIS_*`, `DEFAULT_ENV_EXECUTION_CONCURRENCY_LIMIT`	Queue connections and limits
Locks	`RUN_ENGINE_RUN_LOCK_REDIS_*`, `RUN_ENGINE_RUN_LOCK_DURATION`	Distributed locking
Timeouts	`RUN_ENGINE_TIMEOUT_PENDING_EXECUTING`, `RUN_ENGINE_TIMEOUT_EXECUTING`	Heartbeat timeouts
Retries	`RUN_ENGINE_RETRY_WARM_START_THRESHOLD_MS`	Checkpoint-based retry threshold
Batch	`BATCH_QUEUE_CONSUMER_COUNT`, `BATCH_QUEUE_DRR_QUANTUM`	Batch processing via DRR

Example Initialization:

The webapp creates a singleton instance with production configuration loaded from environment variables and platform services.

Sources: apps/webapp/app/v3/runEngine.server.ts15-198 apps/webapp/app/env.server.ts560-694

Event Bus

The RunEngine emits events through an EventBus (Node.js EventEmitter) to notify other parts of the system about state changes:

Key Events:

Event	Payload	Purpose
`runCreated`	`{time, runId}`	New run created
`runLocked`	`{time, run, organization, project, environment}`	Run dequeued and locked
`runAttemptStarted`	`{time, run, organization, project, environment}`	Attempt started
`runAttemptCompleted`	`{time, run, organization, project, environment}`	Attempt finished
`runStatusChanged`	`{time, run, organization, project, environment}`	Status changed
`runEnqueuedAfterDelay`	`{time, run, organization, project, environment}`	Delayed run queued
`runDelayRescheduled`	`{time, run, organization, project, environment}`	Delay time changed
`cachedRunCompleted`	`{time, span, blockedRunId, hasError, cachedRunId}`	Cached run completed (debounce)
`incomingCheckpointDiscarded`	`{time, run, checkpoint, snapshot}`	Invalid checkpoint rejected

These events can be subscribed to for metrics, logging, and integration with other systems.

Sources: internal-packages/run-engine/src/engine/eventBus.ts

Task Execution Engine

Architecture Overview

Core Components

RunEngine Class

SystemResources

Execution Lifecycle

Triggering a Run

Dequeuing and Locking

Starting an Attempt

Completing an Attempt

State Management

Execution Snapshot System

Execution States

State Predicates

Subsystems

Execution Snapshot System

Run Attempt System

Dequeue System

Waitpoint System

Checkpoint System

Batch System

Enqueue System

Delayed Run System

Debounce System

TTL System

Distributed Locking

Queue Architecture

Background Worker Catalog

Configuration

Event Bus

On this page

Task Execution Engine

Architecture Overview

Core Components

RunEngine Class

SystemResources

Execution Lifecycle

Triggering a Run

Dequeuing and Locking

Starting an Attempt

Completing an Attempt

State Management

Execution Snapshot System

Execution States

State Predicates

Subsystems

Execution Snapshot System

Run Attempt System

Dequeue System

Waitpoint System

Checkpoint System

Batch System

Enqueue System

Delayed Run System

Debounce System

TTL System

Distributed Locking

Queue Architecture

Background Worker Catalog

Configuration

Event Bus

On this page