GSoC 2026 Ideas List #5476

gz · 2026-01-20T18:04:25Z

gz
Jan 20, 2026
Maintainer

Please write down the ideas you'd be happy to mentor as a reply

lalithsuresh · 2026-02-02T22:33:10Z

lalithsuresh
Feb 2, 2026
Maintainer

Micro-benchmark suite for individual DBSP operators

Mentors: Simon, Gerd
Requirements: Rust, Python, Performance engineering, Databases
Difficulty: Easy-Medium
Description:

Expanded Description:

The goal of this project is to add to our micro-benchmarking framework that exercises and profiles individual operators in the DBSP runtime, the core incremental query engine within feldera.

DBSP (Database Stream Processor) is the foundational model that underlies Feldera’s incremental execution engine. It expresses streaming computation as circuits of primitive operators such as lifting, delay, integration/differentiation and others that collectively implement relational algebra and SQL semantics fully incrementally. These operators maintain state and process change-sets (Z-sets) over time, which directly impacts performance across workloads.

In this project the aim is to a systematically evaluate the performance characteristics of individual DBSP operators in isolation. Understanding per-operator scalability, latency, memory/disk behavior, and CPU cost is critical for both core engine optimization and guiding SQL planner decisions.

In this project you will:

Identify the primitive operators in the DBSP runtime (Rust code) that form the building blocks of incremental query execution. Examples include stateful operators like delay (z⁻¹), merge, join, aggregate, and stateless transforms.

Extend our micro-benchmark suite in Rust and Python that can instantiate small DBSP circuits as pipelines, exercising one operator at a time, feed controlled input streams, and measure performance metrics such as throughput, latency, and memory usage/bandwidth, disk usage/bandwidth.

Integrate with existing Feldera benchmarking tooling so that operator benchmarks can be run alongside or as part of the current benchmark harness. Ensure that results are reproducible, logged efficiently, and comparable over time.

Build parsers and analyzers in Python to automate result aggregation, compute statistics (e.g., warm-up effects, variance), and generate plots or reports (e.g., latency histograms, throughput curves) for each operator.

Document findings and performance profiles, highlighting performance bottlenecks, scaling behavior, and opportunities for optimization within the DBSP runtime.

(Stretch) Optionally explore statistical approaches to reduce measurement noise and warm-up costs—e.g., adopting ideas from microbenchmark stability metrics research—so that the suite yields robust and trustworthy numbers.

By focusing on micro-benchmarks rather than full pipelines, this project will expose fine-grained performance characteristics of DBSP operators, enabling data-driven engineering decisions in the Feldera runtime.

0 replies

lalithsuresh · 2026-02-02T22:33:33Z

lalithsuresh
Feb 2, 2026
Maintainer

Incremental lakehouse format

Mentors: Swanand, Mihai
Requirements: Rust, Lakehouse Formats, Cloud storage
Difficulty: High
Description:

Modern lakehouse formats such as Delta Lake, Apache Iceberg, and Apache Hudi are designed around batch-oriented data production: writers periodically commit new files, and readers reason about snapshots of large immutable datasets. This model works well for analytical workloads, but it is a poor fit for incremental and streaming computation, where systems like Feldera continuously maintain derived tables and materialized views from fine-grained changes.

The goal of this project is to explore and prototype an incremental lakehouse format, a storage layout and commit protocol that makes small, frequent updates a first-class concept, rather than an afterthought.

Feldera’s execution engine (DBSP) naturally produces change streams (insertions, deletions, and updates) to tables at very high rates. Today, persisting these results into object storage typically requires batching changes into large files or repeatedly rewriting data, which introduces latency, inefficiency, and complexity. This project asks: what would a lakehouse format look like if it were designed for incremental computation from day one?

In this project you will:

Study existing lakehouse formats (Delta, Iceberg, Hudi) and identify where their assumptions conflict with incremental workloads: e.g., file granularity, snapshot semantics, commit frequency, and metadata scaling.

Define requirements for an incremental lakehouse format, such as:

Efficient representation of fine-grained changes (deltas, Z-sets, or change logs).
Fast snapshot reconstruction without rewriting large files.
Compatibility with cloud object stores (S3, GCS, Azure Blob).
Clear semantics for time, versions, and consistency.

Design a prototype format and commit protocol, potentially inspired by:

Append-only delta logs plus compacted base snapshots.
Multi-level storage (hot incremental state vs. cold historical data).
Explicit support for incremental view maintenance outputs.
Implement a reference prototype in Rust, focusing on:
Writing incremental updates produced by Feldera into object storage.
Reading consistent snapshots back efficiently.
Managing metadata and versioning at scale.
Evaluate trade-offs such as write amplification, read latency, metadata growth, and compaction strategies, comparing the prototype against existing lakehouse formats under incremental workloads.

Document the design and findings, highlighting how incremental-first storage changes the performance and usability envelope for real-time analytics and streaming SQL systems.

This project sits at the intersection of streaming systems, storage formats, and cloud object storage, and offers a chance to rethink a core piece of modern data infrastructure from an incremental-computation perspective.

Note that this is a very ambitious project and focusing on just parts what's mentioned above is perfectly acceptable.

0 replies

lalithsuresh · 2026-02-02T22:40:52Z

lalithsuresh
Feb 2, 2026
Maintainer

NATS output connector

Mentors: Abhinav, Kristoffer
Requirements: Rust
Difficulty: Easy-Medium

Description:

NATS is a lightweight, high-performance messaging system widely used for real-time event distribution, microservices communication, and streaming pipelines. Feldera already supports NATS as an input connector, allowing streaming data to flow into the DBSP-based incremental engine. This project completes the picture by adding a NATS output connector, enabling Feldera to publish incremental results back into NATS.

Feldera’s execution model naturally produces change streams—fine-grained insertions, deletions, and updates to tables and views. Exposing these changes directly via NATS allows Feldera to integrate seamlessly into event-driven architectures, where downstream systems react to updates in real time rather than polling databases or object storage.

In this project you will:

Design an output connector interface for NATS, aligned with Feldera’s existing connector architecture and configuration model.

Implement a NATS publisher in Rust that can:

Publish row-level changes (inserts, deletes, updates) emitted by DBSP.
Support configurable subjects, serialization formats (e.g., JSON / NDJSON), and batching behavior.
Handle backpressure and transient failures gracefully.
Define delivery semantics, such as:
- At-most-once vs. at-least-once delivery.
Ordering guarantees per subject.
Interaction with Feldera’s checkpointing and fault-tolerance mechanisms.
Ensure performance and scalability, validating that the connector can sustain high-throughput incremental workloads without becoming a bottleneck.
Add tests and examples, demonstrating end-to-end pipelines where Feldera ingests data from NATS, incrementally processes it, and publishes derived results back to NATS for downstream consumers.
Document configuration and usage, including example SQL pipelines and NATS subject layouts.

By adding a NATS output connector, this project enables closed-loop, event-driven dataflows with Feldera at the core, making it easy to plug incremental SQL pipelines into modern microservice and streaming ecosystems.

0 replies

lalithsuresh · 2026-02-02T22:42:13Z

lalithsuresh
Feb 2, 2026
Maintainer

Improved SQLancer support

Mentors: Abhinav, Mihai
Requirements: Java, Databases, Testing
Difficulty: Medium
Description:

SQLancer is a SQL-level fuzzer that automatically generates schemas, data, and queries to uncover logic bugs (incorrect result sets) and, in some modes, performance/pathology issues in database systems. It does this by repeatedly building random databases and using oracles to cross-check results.

Feldera already has an active fork of SQLancer at:

https://github.com/feldera/sqlancer

This project continues the Feldera-specific SQLancer integration

0 replies

swanandx · 2026-02-03T15:11:44Z

swanandx
Feb 3, 2026
Collaborator

CDF Support in Delta Lake Connector

Mentors: Swanand, Ben
Requirements: Rust, Delta Lake
Difficulty: Medium

Description:
Delta Lake's Change Data Feed (CDF) captures row-level changes (inserts, updates, deletes) as a queryable table, perfect for incremental ingestion without parsing giant merge transactions.

Current Delta input tails the transaction log, which explodes on your attribute table's giant MERGE overwrites—full protocol buffer unpacking per commit even when net row changes are small.

This project adds native bidirectional CDF to the Delta connector, handling those massive transactions efficiently by consuming/emitting fine-grained row changes directly.

0 replies

Karakatiza666 · 2026-02-03T16:24:39Z

Karakatiza666
Feb 3, 2026
Collaborator

Built-in Testing and Benchmarking for Feldera Pipelines

Mentors: George, ???
Requirements: Rust, TypeScript, Svelte 5
Difficulty: Hard

Description:

Feldera compiles SQL programs into data processing pipelines - programs that ingest streams of input data and stream computation outputs. This project proposes adding built-in testing and benchmarking capabilities that allow users to declare test cases with input data tables and expected output views, execute them on-demand or automatically after compilation, and track pass/fail status. Test data can be defined inline or referenced through existing input connectors. Users would be able to mark tests as manual, mandatory (must pass before pipeline starts) or as benchmarks (averaging execution statistics across multiple runs), providing a streamlined workflow for validating query behavior and measuring performance on target infrastructure. The benchmark functionality would allow for the precise measurement of completion of the computation which is currently not possible.

The implementation requires several core features: end-of-input events for connectors, forced computation steps when all inputs complete, end-of-processing event detection, connector re-initialization without releasing resources, and pipeline state reset for low-latency back-to-back test execution. Output validation would accumulate insert/delete changes to compare final view state against expectations. This approach eliminates the current alternatives of manual ad-hoc queries, duplicate pipeline configurations, or external automation scripts—instead providing integrated test organization, automatic execution tracking, and performance measurement directly within the Feldera platform.

0 replies

Karakatiza666 · 2026-02-03T16:36:53Z

Karakatiza666
Feb 3, 2026
Collaborator

Unified Connector Interface for Feldera

Mentors: George, Gerd
Requirements: Rust, Rust API design, messaging protocols semantics
Difficulty: Hard

Description:

Feldera's DBSP engine transforms SQL queries into incremental streaming circuits, with connectors handling data ingestion and output to external systems like Kafka, Delta Lake, and S3. Currently, these connectors implement similar features (fault tolerance, backfill, seeking) without a shared API, making it difficult for external contributors to create new connectors with first-class engine integration. This project proposes refactoring existing connectors into standalone Rust libraries with a unified interface that maintains high performance through static compilation while enabling independent development and distribution of new connectors via Cargo or GitHub sources.
The implementation involves auditing existing connectors for feature parity, then distilling a common interface covering schema discovery, fault tolerance, backfill, seeking and other capabilities. Users would declare connector dependencies in a connectors.toml file, and the SQL compiler would generate Rust code referencing these connectors through the unified interface. Each connector library would expose a validateConfiguration function, allowing pipeline-manager to trigger cacheable builds and invoke validation on configuration updates. This architecture enables enterprise teams and open-source contributors to independently develop connectors with the same quality guarantees as built-in ones — expanding Feldera's ecosystem while maintaining the performance characteristics essential for incremental stream processing.

0 replies

Karakatiza666 · 2026-02-03T17:07:15Z

Karakatiza666
Feb 3, 2026
Collaborator

Pluggable Language Compiler Framework for Feldera

Mentors: George, ???
Requirements: Rust
Difficulty: Medium

Description:

Feldera compiles SQL queries into high-performance DBSP circuits through a two-stage pipeline: SQL-to-DBSP transformation (Java/Calcite-based) followed by DBSP-to-Rust code generation. This project proposes exposing the compilation pipeline as a configurable system of Rust libraries, enabling community-developed language frontends. Contributors could create compilers that either emit SQL text (leveraging the existing SQL -> DBSP -> Rust pipeline) or generate DBSP Rust code directly for languages where SQL semantics are a poor fit — such as dataflow DSLs, streaming-native languages, or domain-specific notations for financial or scientific computing.
The implementation would define a simple library interface where each compiler takes source text and produces output text (SQL or Rust), with validation at stage boundaries to ensure generated code is syntactically correct before proceeding. Users would declare compiler libraries in a configuration file, specifying the compilation chain for each program type. Pipeline-manager would orchestrate multi-stage compilation with caching based on source hashes. This text-to-text approach keeps the interface straightforward while maintaining Feldera's static compilation performance—enabling the community to expand language support without forking the core engine or learning custom intermediate representations.

0 replies

This comment has been hidden.

Sign in to view

GSoC 2026 Ideas List #5476

Uh oh!

gz Jan 20, 2026 Maintainer

Replies: 10 comments

This comment has been hidden.

This comment has been hidden.

Uh oh!

Uh oh!

lalithsuresh Feb 2, 2026 Maintainer

Micro-benchmark suite for individual DBSP operators

Uh oh!

Uh oh!

lalithsuresh Feb 2, 2026 Maintainer

Incremental lakehouse format

Uh oh!

Uh oh!

lalithsuresh Feb 2, 2026 Maintainer

NATS output connector

Uh oh!

Uh oh!

lalithsuresh Feb 2, 2026 Maintainer

Improved SQLancer support

Uh oh!

swanandx Feb 3, 2026 Collaborator

CDF Support in Delta Lake Connector

Uh oh!

Uh oh!

Karakatiza666 Feb 3, 2026 Collaborator

Built-in Testing and Benchmarking for Feldera Pipelines

Uh oh!

Uh oh!

Karakatiza666 Feb 3, 2026 Collaborator

Unified Connector Interface for Feldera

Uh oh!

Karakatiza666 Feb 3, 2026 Collaborator

Pluggable Language Compiler Framework for Feldera

gz
Jan 20, 2026
Maintainer

lalithsuresh
Feb 2, 2026
Maintainer

lalithsuresh
Feb 2, 2026
Maintainer

lalithsuresh
Feb 2, 2026
Maintainer

lalithsuresh
Feb 2, 2026
Maintainer

swanandx
Feb 3, 2026
Collaborator

Karakatiza666
Feb 3, 2026
Collaborator

Karakatiza666
Feb 3, 2026
Collaborator

Karakatiza666
Feb 3, 2026
Collaborator