-
Notifications
You must be signed in to change notification settings - Fork 106
Description
During delta table snapshot loading, transient S3 timeout errors cause batches to be silently skipped rather than retried or failing the pipeline. The snapshot is then incorrectly marked as completed with fewer records than expected, resulting in silent data loss.
Steps to Reproduce:
- Configure a delta table input connector pointing to a large Delta table on S3 (snapshot mode).
- Start the pipeline with unstable or slow S3 connectivity.
- Observe that S3 timeout errors are logged mid-snapshot but the pipeline continues and reports snapshot load as completed.
Observed Behavior:
The pipeline logs transport errors for the failing batches but proceeds to declare snapshot load completed with a lower record count. Ad-hoc queries confirm the missing records. The Errors tab shows a generic transport error message without enough detail about data loss.
- Logs show transport timeouts (e.g., ParquetError/HttpError Timeout) for failed batches.
- Pipeline skips them, logs partial completion, and commits the transaction.
- Ad-hoc pipeline queries (e.g., count(*)) confirm missing records.
- Errors tab shows generic message without data loss details.
Example log - Successful load (full records):
INFO dbsp_adapters::integrated::delta_table::input: delta_table my_table.unnamed-0: snapshot load completed (records: 3436416, version: 17)
Example log - Failed load (735K records missing):
ERROR dbsp_adapters::server: error on input endpoint 'my_table.unnamed-0': error retrieving batch 333 of initial snapshot query ... ParquetError(External(Generic { store: "S3", source: HttpError { kind: Timeout, source: reqwest::Error { kind: Body, source: reqwest::Error { kind: Decode, source: reqwest::Error { kind: Body, source: TimedOut } } } } }))
INFO dbsp_adapters::integrated::delta_table::input: delta_table my_table.unnamed-0: snapshot load completed (records: 2701184, version: 17)
Both of them end with:
INFO dbsp_adapters::controller: Committing transaction 1
INFO dbsp_adapters::controller: Transaction 1 committed
Proposed Fix / Options:
- Make it a fatal error — if a batch cannot be retrieved after all retries are exhausted, fail the pipeline rather than silently skipping the batch.
- Retry the full snapshot — on transient errors (e.g. timeout), reset the job queue and recreate the stream, up to a configurable number of retries (e.g. 2–3), before giving up.
Ideally, the behavior could be user-configurable via a connector config option.
Additional Context:
S3 region: us-west-1
Table size: ~3.4M records across multiple Parquet files