Skip to content

[adapters] delta input connectors drops batches due to transient S3 timeout #5750

@swanandx

Description

@swanandx

During delta table snapshot loading, transient S3 timeout errors cause batches to be silently skipped rather than retried or failing the pipeline. The snapshot is then incorrectly marked as completed with fewer records than expected, resulting in silent data loss.

Steps to Reproduce:

  • Configure a delta table input connector pointing to a large Delta table on S3 (snapshot mode).
  • Start the pipeline with unstable or slow S3 connectivity.
  • Observe that S3 timeout errors are logged mid-snapshot but the pipeline continues and reports snapshot load as completed.

Observed Behavior:

The pipeline logs transport errors for the failing batches but proceeds to declare snapshot load completed with a lower record count. Ad-hoc queries confirm the missing records. The Errors tab shows a generic transport error message without enough detail about data loss.

  • Logs show transport timeouts (e.g., ParquetError/HttpError Timeout) for failed batches.
  • Pipeline skips them, logs partial completion, and commits the transaction.
  • Ad-hoc pipeline queries (e.g., count(*)) confirm missing records.
  • Errors tab shows generic message without data loss details.

Example log - Successful load (full records):

INFO dbsp_adapters::integrated::delta_table::input: delta_table my_table.unnamed-0: snapshot load completed (records: 3436416, version: 17)

Example log - Failed load (735K records missing):


ERROR dbsp_adapters::server: error on input endpoint 'my_table.unnamed-0': error retrieving batch 333 of initial snapshot query ... ParquetError(External(Generic { store: "S3", source: HttpError { kind: Timeout, source: reqwest::Error { kind: Body, source: reqwest::Error { kind: Decode, source: reqwest::Error { kind: Body, source: TimedOut } } } } }))
INFO dbsp_adapters::integrated::delta_table::input: delta_table my_table.unnamed-0: snapshot load completed (records: 2701184, version: 17)

Both of them end with:

INFO dbsp_adapters::controller:  Committing transaction 1
INFO dbsp_adapters::controller:  Transaction 1 committed

Proposed Fix / Options:

  • Make it a fatal error — if a batch cannot be retrieved after all retries are exhausted, fail the pipeline rather than silently skipping the batch.
  • Retry the full snapshot — on transient errors (e.g. timeout), reset the job queue and recreate the stream, up to a configurable number of retries (e.g. 2–3), before giving up.

Ideally, the behavior could be user-configurable via a connector config option.

Additional Context:

S3 region: us-west-1
Table size: ~3.4M records across multiple Parquet files

Metadata

Metadata

Assignees

Labels

connectorsIssues related to the adapters/connectors crate

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions