Replies: 10 comments
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
-
Micro-benchmark suite for individual DBSP operatorsMentors: Simon, Gerd Expanded Description: The goal of this project is to add to our micro-benchmarking framework that exercises and profiles individual operators in the DBSP runtime, the core incremental query engine within feldera. DBSP (Database Stream Processor) is the foundational model that underlies Feldera’s incremental execution engine. It expresses streaming computation as circuits of primitive operators such as lifting, delay, integration/differentiation and others that collectively implement relational algebra and SQL semantics fully incrementally. These operators maintain state and process change-sets (Z-sets) over time, which directly impacts performance across workloads. In this project the aim is to a systematically evaluate the performance characteristics of individual DBSP operators in isolation. Understanding per-operator scalability, latency, memory/disk behavior, and CPU cost is critical for both core engine optimization and guiding SQL planner decisions. In this project you will: Identify the primitive operators in the DBSP runtime (Rust code) that form the building blocks of incremental query execution. Examples include stateful operators like delay (z⁻¹), merge, join, aggregate, and stateless transforms. Extend our micro-benchmark suite in Rust and Python that can instantiate small DBSP circuits as pipelines, exercising one operator at a time, feed controlled input streams, and measure performance metrics such as throughput, latency, and memory usage/bandwidth, disk usage/bandwidth. Integrate with existing Feldera benchmarking tooling so that operator benchmarks can be run alongside or as part of the current benchmark harness. Ensure that results are reproducible, logged efficiently, and comparable over time. Build parsers and analyzers in Python to automate result aggregation, compute statistics (e.g., warm-up effects, variance), and generate plots or reports (e.g., latency histograms, throughput curves) for each operator. Document findings and performance profiles, highlighting performance bottlenecks, scaling behavior, and opportunities for optimization within the DBSP runtime. (Stretch) Optionally explore statistical approaches to reduce measurement noise and warm-up costs—e.g., adopting ideas from microbenchmark stability metrics research—so that the suite yields robust and trustworthy numbers. By focusing on micro-benchmarks rather than full pipelines, this project will expose fine-grained performance characteristics of DBSP operators, enabling data-driven engineering decisions in the Feldera runtime. |
Beta Was this translation helpful? Give feedback.
-
Incremental lakehouse formatMentors: Swanand, Mihai Modern lakehouse formats such as Delta Lake, Apache Iceberg, and Apache Hudi are designed around batch-oriented data production: writers periodically commit new files, and readers reason about snapshots of large immutable datasets. This model works well for analytical workloads, but it is a poor fit for incremental and streaming computation, where systems like Feldera continuously maintain derived tables and materialized views from fine-grained changes. The goal of this project is to explore and prototype an incremental lakehouse format, a storage layout and commit protocol that makes small, frequent updates a first-class concept, rather than an afterthought. Feldera’s execution engine (DBSP) naturally produces change streams (insertions, deletions, and updates) to tables at very high rates. Today, persisting these results into object storage typically requires batching changes into large files or repeatedly rewriting data, which introduces latency, inefficiency, and complexity. This project asks: what would a lakehouse format look like if it were designed for incremental computation from day one? In this project you will: Study existing lakehouse formats (Delta, Iceberg, Hudi) and identify where their assumptions conflict with incremental workloads: e.g., file granularity, snapshot semantics, commit frequency, and metadata scaling. Define requirements for an incremental lakehouse format, such as:
Design a prototype format and commit protocol, potentially inspired by:
Document the design and findings, highlighting how incremental-first storage changes the performance and usability envelope for real-time analytics and streaming SQL systems. This project sits at the intersection of streaming systems, storage formats, and cloud object storage, and offers a chance to rethink a core piece of modern data infrastructure from an incremental-computation perspective. Note that this is a very ambitious project and focusing on just parts what's mentioned above is perfectly acceptable. |
Beta Was this translation helpful? Give feedback.
-
NATS output connectorMentors: Abhinav, Kristoffer Description: NATS is a lightweight, high-performance messaging system widely used for real-time event distribution, microservices communication, and streaming pipelines. Feldera already supports NATS as an input connector, allowing streaming data to flow into the DBSP-based incremental engine. This project completes the picture by adding a NATS output connector, enabling Feldera to publish incremental results back into NATS. Feldera’s execution model naturally produces change streams—fine-grained insertions, deletions, and updates to tables and views. Exposing these changes directly via NATS allows Feldera to integrate seamlessly into event-driven architectures, where downstream systems react to updates in real time rather than polling databases or object storage. In this project you will:
Implement a NATS publisher in Rust that can:
By adding a NATS output connector, this project enables closed-loop, event-driven dataflows with Feldera at the core, making it easy to plug incremental SQL pipelines into modern microservice and streaming ecosystems. |
Beta Was this translation helpful? Give feedback.
-
Improved SQLancer supportMentors: Abhinav, Mihai SQLancer is a SQL-level fuzzer that automatically generates schemas, data, and queries to uncover logic bugs (incorrect result sets) and, in some modes, performance/pathology issues in database systems. It does this by repeatedly building random databases and using oracles to cross-check results. Feldera already has an active fork of SQLancer at: https://github.com/feldera/sqlancer This project continues the Feldera-specific SQLancer integration |
Beta Was this translation helpful? Give feedback.
-
CDF Support in Delta Lake ConnectorMentors: Swanand, Ben Description: Current Delta input tails the transaction log, which explodes on your attribute table's giant MERGE overwrites—full protocol buffer unpacking per commit even when net row changes are small. This project adds native bidirectional CDF to the Delta connector, handling those massive transactions efficiently by consuming/emitting fine-grained row changes directly. |
Beta Was this translation helpful? Give feedback.
-
Built-in Testing and Benchmarking for Feldera PipelinesMentors: George, ??? Description: Feldera compiles SQL programs into data processing pipelines - programs that ingest streams of input data and stream computation outputs. This project proposes adding built-in testing and benchmarking capabilities that allow users to declare test cases with input data tables and expected output views, execute them on-demand or automatically after compilation, and track pass/fail status. Test data can be defined inline or referenced through existing input connectors. Users would be able to mark tests as manual, mandatory (must pass before pipeline starts) or as benchmarks (averaging execution statistics across multiple runs), providing a streamlined workflow for validating query behavior and measuring performance on target infrastructure. The benchmark functionality would allow for the precise measurement of completion of the computation which is currently not possible. The implementation requires several core features: end-of-input events for connectors, forced computation steps when all inputs complete, end-of-processing event detection, connector re-initialization without releasing resources, and pipeline state reset for low-latency back-to-back test execution. Output validation would accumulate insert/delete changes to compare final view state against expectations. This approach eliminates the current alternatives of manual ad-hoc queries, duplicate pipeline configurations, or external automation scripts—instead providing integrated test organization, automatic execution tracking, and performance measurement directly within the Feldera platform. |
Beta Was this translation helpful? Give feedback.
-
Unified Connector Interface for FelderaMentors: George, Gerd Description: Feldera's DBSP engine transforms SQL queries into incremental streaming circuits, with connectors handling data ingestion and output to external systems like Kafka, Delta Lake, and S3. Currently, these connectors implement similar features (fault tolerance, backfill, seeking) without a shared API, making it difficult for external contributors to create new connectors with first-class engine integration. This project proposes refactoring existing connectors into standalone Rust libraries with a unified interface that maintains high performance through static compilation while enabling independent development and distribution of new connectors via Cargo or GitHub sources. |
Beta Was this translation helpful? Give feedback.
-
Pluggable Language Compiler Framework for FelderaMentors: George, ??? Description: Feldera compiles SQL queries into high-performance DBSP circuits through a two-stage pipeline: SQL-to-DBSP transformation (Java/Calcite-based) followed by DBSP-to-Rust code generation. This project proposes exposing the compilation pipeline as a configurable system of Rust libraries, enabling community-developed language frontends. Contributors could create compilers that either emit SQL text (leveraging the existing SQL -> DBSP -> Rust pipeline) or generate DBSP Rust code directly for languages where SQL semantics are a poor fit — such as dataflow DSLs, streaming-native languages, or domain-specific notations for financial or scientific computing. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Please write down the ideas you'd be happy to mentor as a reply
Beta Was this translation helpful? Give feedback.
All reactions