The Workflow Execution System orchestrates the end-to-end processing of documents through configured workflows. It handles file discovery, parallel batch processing, tool container execution, status tracking, and result delivery. This system supports three execution modes: manual workflow runs from the UI, scheduled ETL pipelines, and API deployments (REST endpoints).
This document covers the complete execution lifecycle from workflow initiation to result delivery. For workflow definition and tool configuration, see System Architecture. For Prompt Studio tool execution details, see Structure Tool and Extraction Pipeline. For API deployment configuration, see API Deployment System.
The workflow execution system supports three entry points, each with different orchestration paths:
| Entry Point | Trigger | Queue | Use Case |
|---|---|---|---|
| Manual Execution | User clicks "Run" in UI | celery (default) | Testing, development, one-off processing |
| ETL Pipeline | Scheduled via Celery Beat | celery (default) | Recurring document processing |
| API Deployment | HTTP POST to /deployment/{org}/{api}/execute | api_deployments | Production REST API endpoints |
Sources: backend/workflow_manager/workflow_v2/workflow_helper.py690-780 backend/api_v2/api_deployment_views.py50-158
Sources: backend/workflow_manager/workflow_v2/workflow_helper.py292-380 backend/workflow_manager/workflow_v2/workflow_helper.py466-593 backend/api_v2/api_deployment_views.py63-158 backend/api_v2/deployment_helper.py147-294
The WorkflowHelper class (backend/workflow_manager/workflow_v2/workflow_helper.py72-1050) serves as the primary orchestrator for workflow execution. It coordinates source processing, file batching, task distribution, and execution monitoring.
Key Methods:
| Method | Purpose | Line Reference |
|---|---|---|
run_workflow() | Main execution entry point | 292-380 |
get_file_batches() | Split files into parallel batches | 89-134 |
process_input_files() | Dispatch batches to workers | 136-247 |
execute_workflow_async() | Queue workflow for async execution | 466-593 |
complete_execution() | Handle pipeline/scheduled execution | 690-754 |
Sources: backend/workflow_manager/workflow_v2/workflow_helper.py72-1050
Sources: backend/workflow_manager/workflow_v2/execution.py125-169 backend/workflow_manager/workflow_v2/models/execution.py104-247
Before execution begins, all tool instances in the workflow are validated to ensure their metadata and adapter configurations are correct:
Sources: backend/workflow_manager/workflow_v2/workflow_helper.py275-289
The SourceConnector class (backend/workflow_manager/endpoint_v2/source.py56-1000) handles file discovery, deduplication, and copying files to the execution volume. It supports multiple source types: filesystem connectors, API storage, and ETL sources.
Key Operations:
FileHistory to avoid reprocessingSources: backend/workflow_manager/endpoint_v2/source.py56-1000
Sources: backend/workflow_manager/endpoint_v2/source.py163-453 backend/workflow_manager/workflow_v2/file_history_helper.py1-200
The FileHash dataclass (backend/workflow_manager/endpoint_v2/dto.py11-54) captures all metadata about a source file:
| Field | Type | Purpose |
|---|---|---|
file_path | str | Full path to file |
file_name | str | Basename of file |
file_hash | str | SHA-256 hash for deduplication |
file_size | int | Size in bytes |
mime_type | str | MIME type |
source_connection_type | str | API, FILESYSTEM, etc. |
file_destination | tuple | Target destination for HITL routing |
is_executed | bool | Execution status flag |
file_number | int | Position in batch |
Sources: backend/workflow_manager/endpoint_v2/dto.py11-54
The get_file_batches() method splits files into parallel batches for concurrent processing. Batch size is configurable per organization via MAX_PARALLEL_FILE_BATCHES configuration.
Batching Algorithm:
This ensures even distribution across batches while maintaining file order within each batch.
Sources: backend/workflow_manager/workflow_v2/workflow_helper.py89-134 backend/backend/settings/base.py198-205
Workflow execution uses Celery's chord pattern for parallel batch processing with a final aggregation callback:
Implementation:
Sources: backend/workflow_manager/workflow_v2/workflow_helper.py178-244 backend/celery_tasks/workflow_tasks.py1-500
Each batch task receives a FileBatchData object containing:
| Field | Type | Description |
|---|---|---|
files | list[tuple] | List of (file_name, FileHash) tuples |
file_data | FileData | Shared execution context |
FileData fields:
workflow_id, execution_id, pipeline_idsource_config, destination_configorganization_id, execution_modeuse_file_history, q_file_no_listllm_profile_id, custom_dataSources: backend/workflow_manager/workflow_v2/dto.py142-186
The system uses 8 specialized Celery worker types, each consuming from dedicated RabbitMQ queues:
| Worker | Queue | Purpose | Pool Type |
|---|---|---|---|
worker-file-processing-v2 | file_processing | Process file batches (manual/ETL) | prefork |
worker-api-deployment-v2 | api_deployments | Process API deployment batches | prefork |
worker-callback-v2 | file_processing_callback | Aggregate batch results | solo |
worker-api-callback-v2 | api_file_processing_callback | Aggregate API batch results | solo |
worker-general-v2 | celery | General tasks, webhooks | prefork |
worker-notification-v2 | notification | Send notifications | prefork |
worker-log-consumer-v2 | log_consumer | Consume and publish logs | prefork |
worker-scheduler-v2 | celery_periodic | Scheduled tasks | solo |
Queue Routing Logic:
Sources: backend/workflow_manager/workflow_v2/workflow_helper.py209-232 docker-compose.yaml200-400
The process_file_batch task (backend/celery_tasks/workflow_tasks.py100-300) processes each file in a batch sequentially:
Sources: backend/celery_tasks/workflow_tasks.py100-300
The process_batch_callback task aggregates results after all batches complete:
Sources: backend/celery_tasks/workflow_tasks.py300-400
The ToolSandboxHelper class (unstract/tool-sandbox/src/unstract/tool_sandbox/helper.py53-700) orchestrates tool execution in isolated containers. It manages the container lifecycle, status polling, and result retrieval.
Key Responsibilities:
Sources: unstract/tool-sandbox/src/unstract/tool_sandbox/helper.py53-700
Sources: unstract/tool-sandbox/src/unstract/tool_sandbox/helper.py123-258 unstract/tool-sandbox/src/unstract/tool_sandbox/helper.py260-445
The polling mechanism includes a grace period to handle race conditions during container startup:
This prevents duplicate processing when multiple workers attempt to process the same file simultaneously.
Sources: unstract/tool-sandbox/src/unstract/tool_sandbox/helper.py139-206
The UnstractRunner class (runner/src/unstract/runner/runner.py38-600) is a Flask service that manages Docker/Kubernetes containers on behalf of workers. It provides REST endpoints for container lifecycle management:
| Endpoint | Method | Purpose |
|---|---|---|
/api/v1/run | POST | Launch tool container |
/api/v1/tool/status | GET | Get container status |
/api/v1/tool | DELETE | Remove container |
Container Configuration:
The runner configures two containers per tool execution:
Sources: runner/src/unstract/runner/runner.py38-600 runner/src/unstract/runner/api.py1-300
Sources: runner/src/unstract/runner/runner.py217-254 tool-sidecar/entrypoint.sh1-17 unstract/core/src/unstract/core/tool_execution_status.py1-100
The FileExecutionStatusTracker (unstract/core/src/unstract/core/file_execution_tracker.py1-200) tracks file processing stages using Redis as a distributed state store. Each file execution progresses through defined stages:
Execution Stages:
FileExecutionData Structure:
Redis Storage:
file_execution:status:{execution_id}:{file_execution_id}FILE_ACTIVE_CACHE_REDIS_DB)Sources: unstract/core/src/unstract/core/file_execution_tracker.py1-200 backend/backend/settings/base.py179-188
The ToolExecutionTracker (unstract/core/src/unstract/core/tool_execution_status.py1-100) stores tool container exit status in Redis. Tools write their exit status before terminating:
ToolExecutionData Structure:
Redis Storage:
tool_execution:status:{container_name}Sources: unstract/core/src/unstract/core/tool_execution_status.py1-100
Sources: backend/celery_tasks/workflow_tasks.py1-500 unstract/tool-sandbox/src/unstract/tool_sandbox/helper.py278-350
The DestinationConnector class (backend/workflow_manager/endpoint_v2/destination.py47-1200) routes workflow results to configured destinations. It supports four destination types:
| Destination Type | Configuration | Use Case |
|---|---|---|
FILESYSTEM | connector_instance + output_folder | Write files to cloud storage |
DATABASE | connector_instance + table + columns | Insert records into DB |
API | No connector needed | Return JSON via REST API |
MANUALREVIEW | No connector needed | Send to HITL queue |
Sources: backend/workflow_manager/endpoint_v2/destination.py47-1200
The _should_handle_hitl() method (backend/workflow_manager/endpoint_v2/destination.py172-260) determines whether a file should be sent to a manual review queue based on configured rules:
Decision Logic:
Rule Evaluation:
Rules are defined in the workflow configuration and evaluated against the tool execution result:
Sources: backend/workflow_manager/endpoint_v2/destination.py172-260 backend/plugins/workflow_manager/workflow_v2/utils.py1-300
For DATABASE destinations, the connector inserts structured data into configured tables:
Column Modes:
Additional Metadata Columns:
file_path: Source file pathexecution_id: Workflow execution IDagent_name: Tool that produced the outputtimestamp: Insertion timestamperror: Error message (if execution failed)Implementation:
Sources: backend/workflow_manager/endpoint_v2/destination.py390-505 backend/workflow_manager/endpoint_v2/database_utils.py1-500
For API destinations, results are cached in Redis for retrieval by the client:
Result Cache Structure:
Update Flow:
Sources: backend/workflow_manager/endpoint_v2/result_cache_utils.py1-200 backend/workflow_manager/endpoint_v2/destination.py516-557
API deployments (and only API deployments) use a dual-layer rate limiting system to prevent resource exhaustion:
Important: Rate limiting does NOT apply to:
Sources: backend/api_v2/rate_limiter.py1-400 backend/backend/settings/base.py111-131 docs/API_DEPLOYMENT_RATE_LIMITING.md1-500
Sources: backend/api_v2/rate_limiter.py30-200 backend/api_v2/api_deployment_views.py94-132
Sorted Sets (ZSETs) for Tracking Active Executions:
Distributed Locks:
Django Cache for Org Limits:
Sources: backend/api_v2/rate_limiter.py50-300 backend/api_v2/rate_limit_constants.py8-63
Acquisition (Pre-Execution):
Release (Post-Execution):
Automatic Cleanup:
A periodic task removes stale entries based on timestamp:
Sources: backend/api_v2/rate_limiter.py30-400 backend/api_v2/deployment_helper.py280-293
Tool Execution Retries:
The system retries tool execution on transient failures:
Sources: unstract/workflow-execution/src/unstract/workflow_execution/tools_utils.py185-200
When a tool image is not found in the container registry, the system uses a deterministic error code to identify this condition:
Error Code Pattern:
Detection and HTTP Status:
This returns 500 (server-side deployment issue) rather than 422 (client error) because the missing tool image is a platform configuration problem, not a user-actionable error.
Sources: backend/api_v2/exceptions.py95-165 backend/api_v2/api_deployment_views.py139-149 runner/src/unstract/runner/exception.py7-21
Exception Handler Example:
Sources: backend/api_v2/api_deployment_views.py111-133 backend/workflow_manager/workflow_v2/workflow_helper.py248-273 backend/api_v2/deployment_helper.py280-293
Tool containers handle SIGTERM signals gracefully to ensure proper cleanup:
Tool Container Shell Script:
Sidecar Container:
This ensures:
Sources: runner/src/unstract/runner/runner.py293-329 tool-sidecar/entrypoint.sh1-17
Refresh this wiki