ENH: error_handling options (raise and continue)#854
Merged
basnijholt merged 158 commits intomainfrom Dec 12, 2025
Merged
Conversation
CodSpeed Performance ReportMerging #854 will not alter performanceComparing Summary
|
Codecov Report✅ All modified and coverable lines are covered by tests.
🚀 New features to boost your workflow:
|
- Add pipefunc/_error_handling.py with common error handling utilities: - check_for_error_inputs(): Check if inputs contain ErrorSnapshot objects - create_propagated_error(): Create PropagatedErrorSnapshot instances - handle_error_inputs(): Combined error checking and propagation - cloudpickle_function_state/cloudunpickle_function_state(): Common pickling helpers - Refactor duplicate error detection code in pipefunc/map/_run.py to use common functions - Update ErrorSnapshot and PropagatedErrorSnapshot pickling to use common helpers - Pass error_handling parameter to prepare_run in async and eager execution modules - Add missing error_handling parameter descriptions to docstrings All tests pass with both parallel=False and parallel=True execution modes.
- Consolidate development notes into DEVELOPMENT_NOTES.md - Update PLAN.md to reflect completed implementation and refactoring - Remove redundant documentation files - Document parallel execution fix and code refactoring work
- Created ErrorInfo dataclass to improve type safety and structure - Added factory methods from_full_error() and from_partial_error() - Updated PropagatedErrorSnapshot to use ErrorInfo instead of dict - Modified pickling/unpickling methods to handle the new dataclass - All error handling tests pass with both parallel=False and parallel=True
…ling - Created test_error_handling_coverage.py with tests for: - ErrorInfo dataclass creation methods - PropagatedErrorSnapshot with default reason - handle_error_inputs with raise mode - cloudpickle functions with missing keys - ErrorSnapshot save/load functionality - PropagatedErrorSnapshot pickling with complex error_info - Created test_error_handling_final_coverage.py with tests for: - PropagatedErrorSnapshot string repr with partial errors - get_root_causes with partial errors (empty list case) Both pipefunc._error_handling.py and pipefunc.exceptions.py now have 100% test coverage.
- Test error handling with dict, file_array, and zarr_memory storage - Verify that ErrorSnapshot objects are preserved in dict storage - Document expected behavior for storage backends that may not preserve error objects - Add comprehensive error handling documentation with examples
…rror object storage
This commit addresses two critical issues in the error handling system that were
causing failures in parallel execution with file-based storage backends.
## Issue 1: Missing `in_executor` parameter in error handling
The `in_executor` parameter was removed in commit 0d8e419b but was still being
used in the error handling logic. This parameter is crucial for distinguishing
between sequential (main process) and parallel (subprocess) execution contexts.
### Root Cause
- Error handling needs to know execution context to properly wrap exceptions
- In sequential mode: exceptions are raised directly (in_executor=False)
- In parallel mode: exceptions are caught and wrapped as ErrorSnapshot (in_executor=True)
- The parameter was accidentally removed during refactoring
### Fix
Restored the conditional logic in `_run_iteration()`:
```python
in_executor = executor is not None
if error_handling == "raise":
yield from func(*args, **kwargs, _return_mapspec_names=True, _in_executor=in_executor)
```
## Issue 2: FileArray not preserving ErrorSnapshot objects in parallel execution
FileArray storage was returning `np.ma.masked` instead of `PropagatedErrorSnapshot`
objects when errors occurred in parallel execution, causing test failures.
### Root Cause Analysis
1. **Storage Logic**: FileArray uses XOR logic to determine when to dump:
```python
if force_dump or (array.dump_in_subprocess ^ in_post_process):
```
2. **The Problem**:
- FileArray has `dump_in_subprocess=True`
- In subprocess: `True ^ False = True` → dumps correctly
- In main process: `True ^ True = False` → doesn't dump
- PropagatedErrorSnapshot created in main process wasn't being stored
3. **Execution Flow**:
- Sequential: ErrorSnapshot created and stored → PropagatedErrorSnapshot created and stored
- Parallel: ErrorSnapshot created in subprocess → stored correctly
PropagatedErrorSnapshot created in main process → NOT stored due to XOR logic
### Fix
Modified `_update_array()` to always dump error objects regardless of process context:
```python
from pipefunc.exceptions import ErrorSnapshot, PropagatedErrorSnapshot
# Always dump error objects, or follow normal dump rules
is_error_object = isinstance(_output, (ErrorSnapshot, PropagatedErrorSnapshot))
if is_error_object or force_dump or (array.dump_in_subprocess ^ in_post_process):
# ... dump logic ...
```
## Testing and Validation
Created extensive test suite to understand the issue:
- `test_file_array_error.py`: Verified FileArray can store/retrieve ErrorSnapshot
- `test_error_propagation.py`: Confirmed error propagation works with dict storage
- `test_error_propagation_trace.py`: Traced execution flow differences
- `test_debug_storage.py`: Compared parallel vs sequential execution
Key findings:
- Dict storage worked correctly (no serialization involved)
- FileArray failed only in parallel mode when storing PropagatedErrorSnapshot
- The issue was specific to the XOR dumping logic, not serialization
## Impact
This fix ensures:
1. Error handling works correctly in both sequential and parallel execution
2. All storage backends (dict, file_array, zarr) properly preserve error objects
3. Error propagation chain is maintained across pipeline stages
4. Tests in `test_error_handling_storage.py` now pass for all storage backends
## Technical Details
- ErrorSnapshot: Captures exception info with metadata (timestamp, user, machine, IP)
- PropagatedErrorSnapshot: Created when a function receives ErrorSnapshot as input
- FileArray: Uses cloudpickle for serialization, returns np.ma.masked for missing files
- The XOR logic `dump_in_subprocess ^ in_post_process` is preserved for normal data
but bypassed for error objects to ensure they are always persisted
- Fix _default_output_picker to handle ErrorSnapshot objects properly - When a function returns an error, it's now returned as-is for all output names in a tuple - Add comprehensive tests covering tuple outputs with error handling including: - Basic tuple outputs with element-wise mapspecs - Tuple outputs with reduction mapspecs - Custom output pickers with error handling - Parallel and sequential execution - Complex multi-dimensional mapspecs - Edge cases where all outputs in a tuple get the same error This ensures that error_handling='continue' works correctly with functions that have tuple output names.
The error_handling parameter was present in run_map_async and run_map_eager_async but missing from the Pipeline.map_async method signature and documentation. This caused the docstring check to fail.
- Cache _ERROR_INFOS_NONE singleton to avoid NamedTuple creation per iteration - Cache _RESOURCES_SKIPPED singleton to avoid dataclass creation per iteration - Reorder eval_resources to check callable(func.resources) first, avoiding unnecessary dict lookups when resources is not callable (most common case) - Remove unnecessary dict(kwargs) copy in _call_user
…cache logic - Cache _ERROR_CONTEXT_RAISE_NONE singleton for the most common case (raise mode with no error info) to avoid dataclass creation per iteration - Inline _get_or_set_cache logic for cache=None case to avoid inner function definition overhead on every call - Together with previous optimizations, reduces overhead by ~6.5% locally
…inlined cache logic" This reverts commit 7b64c4c.
- Fix resume validation to reject runs with different error_handling modes to prevent silently reusing ErrorSnapshots from continue-mode runs - Fix scan_inputs_for_errors crash on 0-D object arrays by handling them specially (np.where doesn't work on 0-D arrays) - Fix output_picker exceptions bypassing continue handling by wrapping picker calls with error handling in both _pick_output and _dump_single_output Tests added: - test_empty_inputs_continue_mode - test_resume_rejects_different_error_handling_mode - test_output_picker_exception_in_continue_mode - test_output_picker_exception_raises_in_raise_mode - test_scan_inputs_for_errors_0d_object_array - test_scan_inputs_for_errors_0d_numeric_array
Add FINAL_REVIEW.md documenting all bugs fixed and issues investigated for the error handling "continue" mode feature: - 3 confirmed bugs with regression tests - 4 potential issues investigated (1 inefficiency, 2 by design, 1 inconclusive) Also add test_error_handling_potential_issues.py with tests that document the investigated potential issues and their findings.
Add assessment of new concerns from PR_REVIEW.md: - Performance optimization already in place (StorageBase dtype check) - Thread-safety: by design, test validates correct per-thread snapshots - Deep recursion: theoretical risk only for 1000+ stage pipelines - Partial array errors: documented behavior in get_root_causes() Reorganize remaining TODOs into "Investigated" vs "Still Unvalidated" sections.
…tion regex - Updated in to allow filtering for any resource scope when using with SLURM. This prevents useless job submissions for map-scope resources when upstream errors exist. - Added regression test . - Fixed a regex pattern in to match the actual error message. - Added with investigation details. - Updated to mark the issue as resolved.
Error objects were being dumped twice (once in subprocess, once in main process) because they bypassed the XOR logic in _update_array. This was inefficient I/O overhead. Fix: Remove the is_error_object bypass - error objects now follow the same XOR dump logic as normal values, ensuring exactly one dump per output. Test: test_error_objects_dump_count_with_mock (previously xfail) now passes.
Merge two redundant tests into one focused test after the `func` argument was removed from `should_filter_error_indices`. The previous tests created multiple PipeFunc objects with different resource scopes but never used them since the function no longer depends on scope.
- Add tests for output_picker exceptions with file storage (mapped functions) - Add tests for output_picker exceptions in non-mapped functions with file storage (covers _dump_single_output lines 649-659) - Add test for all-error indices being run locally instead of submitted to executor (covers _all_indices_propagate_errors path)
- Remove FINAL_REVIEW.md (PR review document) - Remove docs/dev/ (development notes) - Move test_output_picker_error_snapshot.py to tests/unit/error_handling/
0d74d31 to
ffb7f3d
Compare
Contributor
✅ PR Title Formatted CorrectlyThe title of this PR has been updated to match the correct format. Thank you! |
Collaborator
Author
|
@MitchellAcoustics try out v0.90.0! 😄 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Supersedes #842.