Skip to content

adapters: parallel delta output encoder#5869

Open
swanandx wants to merge 2 commits intomainfrom
parallel-delta-output
Open

adapters: parallel delta output encoder#5869
swanandx wants to merge 2 commits intomainfrom
parallel-delta-output

Conversation

@swanandx
Copy link
Copy Markdown
Contributor

@swanandx swanandx commented Mar 19, 2026

  • Parallelize Delta table output encoding by splitting batches across configurable worker threads, each writing independent Parquet files concurrently
  • Add threads config option (default: 1) to DeltaTableWriterConfig to control the number of parallel encoding threads

Describe Manual Test Plan

Added unit tests and ran locally

for write retry, I changed local permissions for local path to simulate write failure

chmod 000 /tmp/feldera-delta-profiling

and to restore

chmod 755 /tmp/feldera-delta-profiling

& commit retries were tried by renaming the directory

what can be tested further:

  • on fatal error do we want to stop whole pipeline, just stall it or do something else?
  • try this out with IRSA and expiring credentials?

Checklist

  • Unit tests added/updated
  • Integration tests added/updated
  • Documentation updated
  • Changelog updated

Breaking Changes?

we're adding new config options for delta connector but should not be a breaking change.

@swanandx swanandx force-pushed the parallel-delta-output branch 3 times, most recently from 56f4f84 to 52065ab Compare March 23, 2026 13:06
@swanandx swanandx requested a review from mythical-fred March 23, 2026 13:10
@swanandx swanandx marked this pull request as ready for review March 23, 2026 13:10
Copy link
Copy Markdown

@mythical-fred mythical-fred left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Three hard blocks, one correctness concern.

Copy link
Copy Markdown

@mythical-fred mythical-fred left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two code bugs (inline) plus missing docs — blocking on the docs.

threads and fatal_on_max_retries are new user-visible config options but neither appears in docs.feldera.com/docs/connectors/sinks/delta.md. The Documentation checkbox is unchecked. Please add entries to the config table for both options.

@ryzhyk ryzhyk self-requested a review March 23, 2026 19:04
@swanandx swanandx force-pushed the parallel-delta-output branch from 6738817 to ac1dfbd Compare March 24, 2026 06:31
@swanandx swanandx requested a review from mythical-fred March 24, 2026 06:32
Use SplitCursor to split the batch and distribute it across tasks,
each task retries encoding and writing to delta lake and then returns
Add actions which main task retries to commit to delta lake

Signed-off-by: Swanand Mulay <73115739+swanandx@users.noreply.github.com>
@swanandx swanandx force-pushed the parallel-delta-output branch from ac1dfbd to adf3e2c Compare March 24, 2026 06:52
Copy link
Copy Markdown
Contributor

@ryzhyk ryzhyk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we don't do anything meaningful with fatal errors today, let's not bother adding that option. We'll do another pass to rework connector error handling later.

}

// prints timing for manual comparison, and asserts output correctness
#[test]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's turn this into a proper benchmark under adapters/bench.
What benchmark results did you get?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$cargo bench --package dbsp_adapters --bench delta_encoder --features with-deltalake

     Running benches/delta_encoder.rs (target/release/deps/delta_encoder-836871c7c7c0c9cd)
Gnuplot not found, using plotters backend
delta_indexed_encode/workers/1
                        time:   [35.170 ms 35.201 ms 35.233 ms]
                        thrpt:  [2.8382 Melem/s 2.8408 Melem/s 2.8433 Melem/s]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
delta_indexed_encode/workers/2
                        time:   [24.756 ms 25.310 ms 25.873 ms]
                        thrpt:  [3.8650 Melem/s 3.9511 Melem/s 4.0394 Melem/s]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
delta_indexed_encode/workers/4
                        time:   [15.615 ms 15.899 ms 16.199 ms]
                        thrpt:  [6.1734 Melem/s 6.2897 Melem/s 6.4042 Melem/s]
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
delta_indexed_encode/workers/8
                        time:   [9.1359 ms 9.2460 ms 9.3869 ms]
                        thrpt:  [10.653 Melem/s 10.816 Melem/s 10.946 Melem/s]
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
  1 (1.00%) high severe

@swanandx swanandx requested review from mythical-fred and ryzhyk and removed request for mythical-fred March 25, 2026 16:47
Copy link
Copy Markdown

@mythical-fred mythical-fred left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My previous two blockers (schema minimum = 1 and docs for threads) are resolved — thank you. One remaining issue:

pub max_retries: Option<u32>,

/// Number of parallel threads used by the connector.
///
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit message requested changes on the latest commit is a hard block. Please squash or rebase: git rebase -i and rewrite it with an imperative summary describing what changed, e.g. [adapters] Use Option<usize> for threads; remove fatal_on_max_retries. Resources: https://tbaggery.com/2008/04/19/a-note-about-git-commit-messages.html

Copy link
Copy Markdown
Contributor

@ryzhyk ryzhyk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing all comments! I only have a couple of questions left (inline)

if threads > 1 && key_schema.is_none() {
return Err(ControllerError::invalid_transport_configuration(
endpoint_name,
"parallel writes (threads > 1) require an index (key_schema) to ensure \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

            "Parallel writes (threads > 1) require the view to have a unique key to ensure correct ordering of inserts and deletes. Please specify the `index` property in the connector configuration. For more details, see: https://docs.feldera.com/connectors/unique_keys"

// cargo bench -p dbsp_adapters --bench delta_encoder
#[test]
#[ignore]
fn bench_parallel_encoding() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why keep the test and why #[ignore] it? If it improves test coverage, perhaps we can make a version with less data that can run as a test?

let mut storage_options: serde_json::Value = serde_json::to_value(object_store_config).unwrap();
storage_options["uri"] = json!(table_uri);
storage_options["mode"] = json!("truncate");
if let Some(threads) = threads {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see it used in any test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants