-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Closed
Labels
Description
Expected Behavior
Bytewax materialization should run all pods once successfully and then set job status as success
Current Behavior
In the event that a node crashes, successful pod records can be lost and the job will rerun all of those lost pods. If these node crashes occur often enough, this can result in a job continuously rerunning successful pods and never completing.
Steps to reproduce
Run a materialization job against a multi-node kubernetes cluster. Terminate one of the nodes, observe that pods are lost and rerun
Specifications
- Version: 0.31
- Platform: fedora linux
- Subsystem: bytewax batch_engine
Possible Solution
For safety, the job should have a configurable activeDeadlineSeconds. The larger job should also be able to be split into smaller batches to mitigate the effect a node crash can have