Execution Tracking and State Management

Relevant source files

Purpose and Scope

This document describes the execution tracking and state management system used throughout workflow and API deployment executions. The system maintains execution status at multiple granularity levels (workflow-level, file-level, and tool-level) using a dual-layer architecture combining Redis for real-time tracking and PostgreSQL for persistent records.

For information about the overall workflow execution orchestration, see Workflow Orchestration and Management. For batch processing strategies, see File Processing Pipeline and Batching. For container orchestration, see Tool Sandbox and Container Execution.

Overview

The execution tracking system provides:

Real-time status visibility - Redis-based trackers enable WebSocket updates and active execution detection
Persistent audit trail - Database models maintain complete execution history
Crash recovery - State in Redis allows workers to resume interrupted executions
Deduplication - Prevents concurrent processing of the same file across multiple workers
Error propagation - Ensures errors at any stage are properly captured and reported

Two-Layer Architecture

Sources: unstract/tool-sandbox/src/unstract/tool_sandbox/helper.py1-766 runner/src/unstract/runner/runner.py1-574 backend/workflow_manager/endpoint_v2/source.py558-661

Redis-Based Status Trackers

FileExecutionStatusTracker

The FileExecutionStatusTracker maintains real-time execution state in Redis with a 24-hour TTL. It enables crash recovery by allowing workers to query execution state and resume from the last completed stage.

Key Data Structure:

Redis Key Pattern: file_execution_status:{execution_id}:{file_execution_id}

Operations:

set_data(data) - Initialize or overwrite execution data
get_data(execution_id, file_execution_id) - Retrieve current state
update_stage_status(execution_id, file_execution_id, stage_status) - Advance stage
update_tool_container_name(execution_id, file_execution_id, tool_container_name) - Store container name for cleanup

Update Locations:

UnstractRunner - Updates to TOOL_EXECUTION stage and stores container name at runner/src/unstract/runner/runner.py450-463
ToolSandboxHelper - Updates stage status throughout execution lifecycle at unstract/tool-sandbox/src/unstract/tool_sandbox/helper.py524-570

Sources: unstract/tool-sandbox/src/unstract/tool_sandbox/helper.py279-360 runner/src/unstract/runner/runner.py449-463

ToolExecutionTracker

The ToolExecutionTracker captures tool container exit status with a 1-hour TTL. Tool containers write their final status to Redis before exiting, which is then read by ToolSandboxHelper to determine success/failure.

Key Data Structure:

Redis Key Pattern: tool_execution_status:{execution_id}:{file_execution_id}

Write Flow:

Tool container completes execution
Tool writes ToolExecutionStatusData to Redis using ToolExecutionTracker.set_status()
Container exits

Read Flow:

ToolSandboxHelper.poll_tool_status() detects container in final state
ToolSandboxHelper._handle_tool_execution_status() reads status from Redis at unstract/tool-sandbox/src/unstract/tool_sandbox/helper.py708-765
If status is FAILED, error is propagated as RunnerContainerRunResponse.error

TTL Management:

Initial TTL: Configurable, default 1 hour
After read: TTL reduced to TOOL_EXECUTION_TRACKER_COMPLETED_TTL_IN_SECOND at unstract/tool-sandbox/src/unstract/tool_sandbox/helper.py762-765

Sources: unstract/tool-sandbox/src/unstract/tool_sandbox/helper.py708-765 runner/src/unstract/runner/runner.py331-378

Database Models

WorkflowExecution Model

Represents a single workflow execution instance. Created at execution start and updated on completion/failure.

Key Fields:

workflow - Foreign key to Workflow
execution_id - UUID, indexed
status - Enum: PENDING, EXECUTING, SUCCESS, FAILURE, ERROR
execution_mode - Enum: ETL, API, INSTANT, SCHEDULED
result_acknowledged - Boolean flag for API deployments
error - Captured error message

Status Lifecycle:

Update Signal: WorkflowExecution.update_execution() triggers post-save signal that releases API rate limit slots at backend/workflow_manager/workflow_v2/models/execution.py

Sources: backend/workflow_manager/workflow_v2/dto.py30-117 backend/workflow_manager/endpoint_v2/source.py579-586

WorkflowFileExecution Model

Represents execution for a single file within a workflow execution. One WorkflowExecution has many WorkflowFileExecution records.

Key Fields:

workflow_execution - Foreign key to WorkflowExecution
file_execution_id - UUID, indexed
file_path - Source file path
file_hash - Content hash for deduplication
provider_file_uuid - Provider-specific file identifier (e.g., S3 ETag, Google Drive file ID)
status - Enum: PENDING, EXECUTING, SUCCESS, FAILURE, ERROR
error - Captured error message

Deduplication Keys: Both file_hash and provider_file_uuid are indexed for fast lookups during active execution checks.

Sources: backend/workflow_manager/endpoint_v2/source.py613-649 backend/workflow_manager/endpoint_v2/dto.py122-155

FileHistory Model

Tracks file processing completion for deduplication across executions. Prevents reprocessing of previously completed files unless file content changes.

Key Fields:

workflow - Foreign key to Workflow
file_path - Source file path
file_hash - Content hash
provider_file_uuid - Provider-specific identifier
status - Completion status
completed_at - Timestamp of completion

Usage in SourceConnector:

Sources: backend/workflow_manager/endpoint_v2/source.py805-848

Execution Stages and State Transitions

Stage Definitions

Stage Progression Diagram:

Status Definitions

Stage + Status Tracking: Each stage has an associated status. ToolSandboxHelper.call_tool_handler() checks current stage and resumes appropriately at unstract/tool-sandbox/src/unstract/tool_sandbox/helper.py260-360:

Sources: unstract/tool-sandbox/src/unstract/tool_sandbox/helper.py260-360

Container Execution Lifecycle

Polling Mechanism

ToolSandboxHelper.poll_tool_status() polls the runner service to check container status with configurable intervals:

Configuration:

MAX_RUNNER_POLLING_WAIT_SECONDS - Maximum polling duration (default: 10,800s / 3 hours)
RUNNER_POLLING_INTERVAL_SECONDS - Poll interval (default: 2s)

Polling Loop:

Final Statuses: Polling terminates when container reaches one of these states defined at unstract/tool-sandbox/src/unstract/tool_sandbox/helper.py41-46:

EXITED - Normal container exit
DEAD - Container terminated abnormally
ERROR - Docker error
NOT_FOUND - Container not found (with grace period)

Sources: unstract/tool-sandbox/src/unstract/tool_sandbox/helper.py123-258

Grace Period Handling

To prevent race conditions during container startup, the system implements a grace period for NOT_FOUND status.

Problem: When ToolSandboxHelper starts polling immediately after dispatching container creation to the runner, the container might not exist yet in Docker's registry. Without a grace period, this would be treated as immediate failure.

Solution: Grace period of 40 seconds (configurable via POLL_NOT_FOUND_GRACE_PERIOD) at unstract/tool-sandbox/src/unstract/tool_sandbox/helper.py39

Grace Period Logic:

Duplicate Detection: During the grace period, the system checks if another worker has already completed the file by querying FileExecutionStatusTracker at unstract/tool-sandbox/src/unstract/tool_sandbox/helper.py167-193:

Sources: unstract/tool-sandbox/src/unstract/tool_sandbox/helper.py123-258

Status Detection Flow

Sources: unstract/tool-sandbox/src/unstract/tool_sandbox/helper.py123-258 unstract/tool-sandbox/src/unstract/tool_sandbox/helper.py708-765

Error Propagation and Handling

Error Capture Points

Errors are captured at multiple levels and propagated upward:

Level 1: Tool Container

Tool writes ToolExecutionStatusData with status=FAILED and error message to Redis
Location: Tool container's SDK writes to ToolExecutionTracker before exit

Level 2: ToolSandboxHelper

Reads tool status from Redis at unstract/tool-sandbox/src/unstract/tool_sandbox/helper.py708-765
Propagates error as RunnerContainerRunResponse.error
Updates FileExecutionStatusTracker stage status to FAILED

Level 3: Worker

Receives error response from ToolSandboxHelper.call_tool_handler()
Creates FileExecutionResult with error at backend/workflow_manager/endpoint_v2/dto.py122-155
Updates WorkflowFileExecution.error and status=FAILURE

Level 4: WorkflowExecution

Aggregates file-level results in callback
Sets WorkflowExecution.status to FAILURE if any files failed
Sets WorkflowExecution.error to summary message

Error Propagation Flow:

Sources: unstract/tool-sandbox/src/unstract/tool_sandbox/helper.py708-765 backend/workflow_manager/endpoint_v2/dto.py122-155

Retry Mechanism

ToolsUtils.run_tool_with_retry() implements retry logic with a maximum of 3 attempts (configurable via ToolExecution.MAXIMUM_RETRY) at unstract/workflow-execution/src/unstract/workflow_execution/tools_utils.py185-213:

Retry Strategy:

Maximum retries: 1 (total 2 attempts) by default
No backoff delay between retries
Retry on any exception except ToolNotFoundInRegistryError

Non-Retriable Errors:

ToolNotFoundInRegistryError - Tool image not in registry, retrying won't help at unstract/workflow-execution/src/unstract/workflow_execution/tools_utils.py199-202

Retry Counter: Passed to ToolSandboxHelper.call_tool_handler() and used in container naming at unstract/tool-sandbox/src/unstract/tool_sandbox/helper.py637-650:

Sources: unstract/workflow-execution/src/unstract/workflow_execution/tools_utils.py185-213 unstract/workflow-execution/src/unstract/workflow_execution/constants.py6-11

Deduplication and Active Execution Checking

FileHistory-Based Deduplication

SourceConnector._is_new_file() checks FileHistory to prevent reprocessing completed files at backend/workflow_manager/endpoint_v2/source.py805-848:

Deduplication Logic:

Identifier Priority:

provider_file_uuid - Preferred (stable across content changes)
file_hash - Fallback (changes with content)

Sources: backend/workflow_manager/endpoint_v2/source.py805-848

Active Execution Checking

To prevent multiple workers from processing the same file concurrently, SourceConnector._is_file_being_processed() queries active executions at backend/workflow_manager/endpoint_v2/source.py558-611:

Check Algorithm:

Skip-Processing Statuses: Defined in ExecutionStatus.get_skip_processing_statuses():

EXECUTING - Currently being processed
SUCCESS - Already completed successfully
(Excludes FAILURE, ERROR - allow retry)

Query at backend/workflow_manager/endpoint_v2/source.py613-649:

Sources: backend/workflow_manager/endpoint_v2/source.py558-661

Combined Deduplication Flow

Sources: backend/workflow_manager/endpoint_v2/source.py662-690

Status Update Flow

Who Updates What

Update Timeline

Initialization Phase:

Worker creates WorkflowExecution (status: PENDING)
Worker creates WorkflowFileExecution for each file (status: PENDING)
ToolSandboxHelper.call_tool_handler() creates FileExecutionData in Redis (stage: INITIALIZATION)

Source Phase: 4. Worker copies file to execution volume 5. Worker updates stage to SOURCE in Redis 6. Worker creates metadata file

Tool Execution Phase: 7. UnstractRunner.run_container() updates stage to TOOL_EXECUTION at runner/src/unstract/runner/runner.py450-463 8. UnstractRunner stores tool_container_name in Redis for cleanup 9. Tool container runs 10. Tool container writes ToolExecutionStatusData to Redis on exit 11. ToolSandboxHelper.poll_tool_status() detects container exit 12. ToolSandboxHelper reads tool status from ToolExecutionTracker 13. ToolSandboxHelper updates stage status to SUCCESS/FAILED

Finalization Phase: 14. Worker updates stage to FINALIZATION in Redis 15. Worker routes output to destination 16. Worker updates WorkflowFileExecution (status: SUCCESS/FAILURE) 17. Worker creates/updates FileHistory with completion status

Completion Phase: 18. Callback aggregates all file results 19. Callback updates WorkflowExecution (status: SUCCESS/FAILURE) 20. Post-save signal releases API rate limit slots (if API deployment)

Sources: unstract/tool-sandbox/src/unstract/tool_sandbox/helper.py260-360 runner/src/unstract/runner/runner.py404-574 backend/workflow_manager/endpoint_v2/source.py805-848

Cleanup and TTL Management

Redis TTLs:

FileExecutionStatusTracker: 24 hours (allows late queries for debugging)
ToolExecutionTracker: 1 hour initially, reduced after read at unstract/tool-sandbox/src/unstract/tool_sandbox/helper.py762-765

Container Cleanup:

Triggered by ToolSandboxHelper.cleanup_tool_container() after polling completes
Also called in finally block if exceptions occur at unstract/tool-sandbox/src/unstract/tool_sandbox/helper.py422-431
Removes both main container and sidecar (if enabled)

File Execution Directory Cleanup:

Optional cleanup via ExecutionFileHandler.delete_file_execution_directory() at unstract/workflow-execution/src/unstract/workflow_execution/execution_file_handler.py267-294
Removes temporary files created during execution

Sources: unstract/tool-sandbox/src/unstract/tool_sandbox/helper.py422-431 unstract/tool-sandbox/src/unstract/tool_sandbox/helper.py762-765 unstract/workflow-execution/src/unstract/workflow_execution/execution_file_handler.py267-294

Execution Tracking and State Management

Purpose and Scope

Overview

Two-Layer Architecture

Redis-Based Status Trackers

FileExecutionStatusTracker

ToolExecutionTracker

Database Models

WorkflowExecution Model

WorkflowFileExecution Model

FileHistory Model

Execution Stages and State Transitions

Stage Definitions

Status Definitions

Container Execution Lifecycle

Polling Mechanism

Grace Period Handling

Status Detection Flow

Error Propagation and Handling

Error Capture Points

Retry Mechanism

Deduplication and Active Execution Checking

FileHistory-Based Deduplication

Active Execution Checking

Combined Deduplication Flow

Status Update Flow

Who Updates What

Update Timeline

Cleanup and TTL Management

On this page

Execution Tracking and State Management

Purpose and Scope

Overview

Two-Layer Architecture

Redis-Based Status Trackers

FileExecutionStatusTracker

ToolExecutionTracker

Database Models

WorkflowExecution Model

WorkflowFileExecution Model

FileHistory Model

Execution Stages and State Transitions

Stage Definitions

Status Definitions

Container Execution Lifecycle

Polling Mechanism

Grace Period Handling

Status Detection Flow

Error Propagation and Handling

Error Capture Points

Retry Mechanism

Deduplication and Active Execution Checking

FileHistory-Based Deduplication

Active Execution Checking

Combined Deduplication Flow

Status Update Flow

Who Updates What

Update Timeline

Cleanup and TTL Management

On this page