Skip to content

ENH: error_handling options (raise and continue)#854

Merged
basnijholt merged 158 commits intomainfrom
continue-on-errors2
Dec 12, 2025
Merged

ENH: error_handling options (raise and continue)#854
basnijholt merged 158 commits intomainfrom
continue-on-errors2

Conversation

@basnijholt
Copy link
Collaborator

@basnijholt basnijholt commented Jul 7, 2025

Supersedes #842.

@codspeed-hq
Copy link

codspeed-hq bot commented Jul 7, 2025

CodSpeed Performance Report

Merging #854 will not alter performance

Comparing continue-on-errors2 (ffb7f3d) with main (2a859f7)

Summary

✅ 6 untouched

@codecov
Copy link

codecov bot commented Jul 7, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.

Files with missing lines Coverage Δ
pipefunc/_error_handling.py 100.00% <100.00%> (ø)
pipefunc/_pipefunc.py 100.00% <100.00%> (ø)
pipefunc/_pipefunc_utils.py 100.00% <100.00%> (ø)
pipefunc/_pipeline/_base.py 100.00% <100.00%> (ø)
pipefunc/exceptions.py 100.00% <100.00%> (ø)
pipefunc/map/_adaptive_scheduler_slurm_executor.py 100.00% <100.00%> (ø)
pipefunc/map/_prepare.py 100.00% <ø> (ø)
pipefunc/map/_progress.py 100.00% <100.00%> (ø)
pipefunc/map/_run.py 100.00% <100.00%> (ø)
pipefunc/map/_run_eager.py 100.00% <100.00%> (ø)
... and 3 more
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

- Add pipefunc/_error_handling.py with common error handling utilities:
  - check_for_error_inputs(): Check if inputs contain ErrorSnapshot objects
  - create_propagated_error(): Create PropagatedErrorSnapshot instances
  - handle_error_inputs(): Combined error checking and propagation
  - cloudpickle_function_state/cloudunpickle_function_state(): Common pickling helpers

- Refactor duplicate error detection code in pipefunc/map/_run.py to use common functions
- Update ErrorSnapshot and PropagatedErrorSnapshot pickling to use common helpers
- Pass error_handling parameter to prepare_run in async and eager execution modules
- Add missing error_handling parameter descriptions to docstrings

All tests pass with both parallel=False and parallel=True execution modes.
- Consolidate development notes into DEVELOPMENT_NOTES.md
- Update PLAN.md to reflect completed implementation and refactoring
- Remove redundant documentation files
- Document parallel execution fix and code refactoring work
- Created ErrorInfo dataclass to improve type safety and structure
- Added factory methods from_full_error() and from_partial_error()
- Updated PropagatedErrorSnapshot to use ErrorInfo instead of dict
- Modified pickling/unpickling methods to handle the new dataclass
- All error handling tests pass with both parallel=False and parallel=True
…ling

- Created test_error_handling_coverage.py with tests for:
  - ErrorInfo dataclass creation methods
  - PropagatedErrorSnapshot with default reason
  - handle_error_inputs with raise mode
  - cloudpickle functions with missing keys
  - ErrorSnapshot save/load functionality
  - PropagatedErrorSnapshot pickling with complex error_info

- Created test_error_handling_final_coverage.py with tests for:
  - PropagatedErrorSnapshot string repr with partial errors
  - get_root_causes with partial errors (empty list case)

Both pipefunc._error_handling.py and pipefunc.exceptions.py now have 100% test coverage.
- Test error handling with dict, file_array, and zarr_memory storage
- Verify that ErrorSnapshot objects are preserved in dict storage
- Document expected behavior for storage backends that may not preserve error objects
- Add comprehensive error handling documentation with examples
…rror object storage

This commit addresses two critical issues in the error handling system that were
causing failures in parallel execution with file-based storage backends.

## Issue 1: Missing `in_executor` parameter in error handling

The `in_executor` parameter was removed in commit 0d8e419b but was still being
used in the error handling logic. This parameter is crucial for distinguishing
between sequential (main process) and parallel (subprocess) execution contexts.

### Root Cause
- Error handling needs to know execution context to properly wrap exceptions
- In sequential mode: exceptions are raised directly (in_executor=False)
- In parallel mode: exceptions are caught and wrapped as ErrorSnapshot (in_executor=True)
- The parameter was accidentally removed during refactoring

### Fix
Restored the conditional logic in `_run_iteration()`:
```python
in_executor = executor is not None
if error_handling == "raise":
    yield from func(*args, **kwargs, _return_mapspec_names=True, _in_executor=in_executor)
```

## Issue 2: FileArray not preserving ErrorSnapshot objects in parallel execution

FileArray storage was returning `np.ma.masked` instead of `PropagatedErrorSnapshot`
objects when errors occurred in parallel execution, causing test failures.

### Root Cause Analysis

1. **Storage Logic**: FileArray uses XOR logic to determine when to dump:
   ```python
   if force_dump or (array.dump_in_subprocess ^ in_post_process):
   ```

2. **The Problem**:
   - FileArray has `dump_in_subprocess=True`
   - In subprocess: `True ^ False = True` → dumps correctly
   - In main process: `True ^ True = False` → doesn't dump
   - PropagatedErrorSnapshot created in main process wasn't being stored

3. **Execution Flow**:
   - Sequential: ErrorSnapshot created and stored → PropagatedErrorSnapshot created and stored
   - Parallel: ErrorSnapshot created in subprocess → stored correctly
              PropagatedErrorSnapshot created in main process → NOT stored due to XOR logic

### Fix
Modified `_update_array()` to always dump error objects regardless of process context:
```python
from pipefunc.exceptions import ErrorSnapshot, PropagatedErrorSnapshot

# Always dump error objects, or follow normal dump rules
is_error_object = isinstance(_output, (ErrorSnapshot, PropagatedErrorSnapshot))
if is_error_object or force_dump or (array.dump_in_subprocess ^ in_post_process):
    # ... dump logic ...
```

## Testing and Validation

Created extensive test suite to understand the issue:
- `test_file_array_error.py`: Verified FileArray can store/retrieve ErrorSnapshot
- `test_error_propagation.py`: Confirmed error propagation works with dict storage
- `test_error_propagation_trace.py`: Traced execution flow differences
- `test_debug_storage.py`: Compared parallel vs sequential execution

Key findings:
- Dict storage worked correctly (no serialization involved)
- FileArray failed only in parallel mode when storing PropagatedErrorSnapshot
- The issue was specific to the XOR dumping logic, not serialization

## Impact

This fix ensures:
1. Error handling works correctly in both sequential and parallel execution
2. All storage backends (dict, file_array, zarr) properly preserve error objects
3. Error propagation chain is maintained across pipeline stages
4. Tests in `test_error_handling_storage.py` now pass for all storage backends

## Technical Details

- ErrorSnapshot: Captures exception info with metadata (timestamp, user, machine, IP)
- PropagatedErrorSnapshot: Created when a function receives ErrorSnapshot as input
- FileArray: Uses cloudpickle for serialization, returns np.ma.masked for missing files
- The XOR logic `dump_in_subprocess ^ in_post_process` is preserved for normal data
  but bypassed for error objects to ensure they are always persisted
- Fix _default_output_picker to handle ErrorSnapshot objects properly
- When a function returns an error, it's now returned as-is for all output names in a tuple
- Add comprehensive tests covering tuple outputs with error handling including:
  - Basic tuple outputs with element-wise mapspecs
  - Tuple outputs with reduction mapspecs
  - Custom output pickers with error handling
  - Parallel and sequential execution
  - Complex multi-dimensional mapspecs
  - Edge cases where all outputs in a tuple get the same error

This ensures that error_handling='continue' works correctly with functions that have tuple output names.
The error_handling parameter was present in run_map_async and run_map_eager_async
but missing from the Pipeline.map_async method signature and documentation.
This caused the docstring check to fail.
- Cache _ERROR_INFOS_NONE singleton to avoid NamedTuple creation per iteration
- Cache _RESOURCES_SKIPPED singleton to avoid dataclass creation per iteration
- Reorder eval_resources to check callable(func.resources) first, avoiding
  unnecessary dict lookups when resources is not callable (most common case)
- Remove unnecessary dict(kwargs) copy in _call_user
…cache logic

- Cache _ERROR_CONTEXT_RAISE_NONE singleton for the most common case
  (raise mode with no error info) to avoid dataclass creation per iteration
- Inline _get_or_set_cache logic for cache=None case to avoid inner
  function definition overhead on every call
- Together with previous optimizations, reduces overhead by ~6.5% locally
- Fix resume validation to reject runs with different error_handling modes
  to prevent silently reusing ErrorSnapshots from continue-mode runs
- Fix scan_inputs_for_errors crash on 0-D object arrays by handling them
  specially (np.where doesn't work on 0-D arrays)
- Fix output_picker exceptions bypassing continue handling by wrapping
  picker calls with error handling in both _pick_output and _dump_single_output

Tests added:
- test_empty_inputs_continue_mode
- test_resume_rejects_different_error_handling_mode
- test_output_picker_exception_in_continue_mode
- test_output_picker_exception_raises_in_raise_mode
- test_scan_inputs_for_errors_0d_object_array
- test_scan_inputs_for_errors_0d_numeric_array
Add FINAL_REVIEW.md documenting all bugs fixed and issues investigated
for the error handling "continue" mode feature:

- 3 confirmed bugs with regression tests
- 4 potential issues investigated (1 inefficiency, 2 by design, 1 inconclusive)

Also add test_error_handling_potential_issues.py with tests that document
the investigated potential issues and their findings.
Add assessment of new concerns from PR_REVIEW.md:
- Performance optimization already in place (StorageBase dtype check)
- Thread-safety: by design, test validates correct per-thread snapshots
- Deep recursion: theoretical risk only for 1000+ stage pipelines
- Partial array errors: documented behavior in get_root_causes()

Reorganize remaining TODOs into "Investigated" vs "Still Unvalidated" sections.
…tion regex

- Updated  in  to allow filtering for any resource scope when using  with SLURM. This prevents useless job submissions for map-scope resources when upstream errors exist.
- Added regression test .
- Fixed a regex pattern in  to match the actual error message.
- Added  with investigation details.
- Updated  to mark the issue as resolved.
Error objects were being dumped twice (once in subprocess, once in main
process) because they bypassed the XOR logic in _update_array. This was
inefficient I/O overhead.

Fix: Remove the is_error_object bypass - error objects now follow the
same XOR dump logic as normal values, ensuring exactly one dump per output.

Test: test_error_objects_dump_count_with_mock (previously xfail) now passes.
Merge two redundant tests into one focused test after the `func` argument
was removed from `should_filter_error_indices`. The previous tests created
multiple PipeFunc objects with different resource scopes but never used
them since the function no longer depends on scope.
- Add tests for output_picker exceptions with file storage (mapped functions)
- Add tests for output_picker exceptions in non-mapped functions with file
  storage (covers _dump_single_output lines 649-659)
- Add test for all-error indices being run locally instead of submitted to
  executor (covers _all_indices_propagate_errors path)
- Remove FINAL_REVIEW.md (PR review document)
- Remove docs/dev/ (development notes)
- Move test_output_picker_error_snapshot.py to tests/unit/error_handling/
@github-actions
Copy link
Contributor

✅ PR Title Formatted Correctly

The title of this PR has been updated to match the correct format. Thank you!

@basnijholt basnijholt merged commit 11f11fa into main Dec 12, 2025
37 checks passed
@basnijholt basnijholt deleted the continue-on-errors2 branch December 12, 2025 07:05
@basnijholt
Copy link
Collaborator Author

@MitchellAcoustics try out v0.90.0! 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants