-
Notifications
You must be signed in to change notification settings - Fork 108
[migrations] Spark-to-Feldera migration tool PoC. #5837
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
wilmaontherun
wants to merge
31
commits into
main
Choose a base branch
from
felderize
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Changes from all commits
Commits
Show all changes
31 commits
Select commit
Hold shift + click to select a range
046f795
[migrations] Spark-to-Feldera migration tool PoC.
wilmaontherun bb1b24c
[ci] apply automatic fixes
feldera-bot 871d79c
Intermediate progress based on Mihai's comments.
wilmaontherun e856080
fixed comments on skills
wilmaontherun b672679
Fixed all comments before we refactor skills
wilmaontherun 3d8536f
Merge skills
wilmaontherun 219f8b7
Fixed the rest of the code w.r.t. new skill file
wilmaontherun 83d7731
Revised doc indexing
wilmaontherun 98f7dee
merged skills
wilmaontherun 56bd911
add --verbose flag, translate-file, combined demos, and Feldera PK/qu…
wilmaontherun b10f7a4
more demo files
wilmaontherun 7d56201
revised samples and skills
wilmaontherun 9e17a33
[ci] apply automatic fixes
feldera-bot 3a75739
add --compiler option, fix no-compiler handling, improve example list…
wilmaontherun ab4f746
[ci] apply automatic fixes
feldera-bot 6672146
fixed readme
wilmaontherun 9975f33
[ci] apply automatic fixes
feldera-bot 14e7cc6
Add --model option, remove OpenAI support and hardcoded compiler path
wilmaontherun 7ac2fb9
Use sqlparse for SQL splitting, fix README inconsistencies
wilmaontherun 6655f7c
Add prompt caching and rate limit retry; skip examples on first pass
wilmaontherun 8fa1210
Clean up code quality: fix imports, types, and consistency issues
wilmaontherun 4358104
Fix spark_skills.md inconsistencies
wilmaontherun 80a5e7f
[ci] apply automatic fixes
feldera-bot 7da4c30
Verify and fix spark_skills.md against Apache Spark SQL reference
wilmaontherun 589406c
Overhaul spark/data/samples: fix errors, add new patterns, remove tri…
wilmaontherun b53a97a
Fix skills inconsistencies: QUARTER unsupported, contains/binary, pmo…
wilmaontherun 05388f9
Rename misnamed sample files to match their content
wilmaontherun 643ba2b
Improve and expand sample demos
wilmaontherun d1b0c95
Fix demo files: remove unsupported patterns, add dates and arithmetic…
wilmaontherun d7057b9
Fix aggregations and arithmetic demos to use only supported Feldera f…
wilmaontherun b438037
split_part skill & base_url config
anandbraman File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,129 @@ | ||
| # Felderize — Spark SQL to Feldera SQL Translator | ||
|
|
||
| felderize attempts to translate Spark SQL schemas and queries into valid [Feldera](https://www.feldera.com/) SQL using LLM-based translation with optional compiler validation. | ||
|
|
||
| ## Setup | ||
|
|
||
| ```bash | ||
| cd python/felderize | ||
| python3 -m venv .venv | ||
| source .venv/bin/activate | ||
| pip install -e . | ||
| ``` | ||
|
|
||
| > **Note:** `pip install -e .` is required before running `felderize`. It registers the package and CLI command. | ||
|
|
||
| Create a `.env` file: | ||
|
|
||
| ```bash | ||
| ANTHROPIC_API_KEY=your-key-here | ||
| FELDERA_COMPILER=/path/to/sql-to-dbsp # in Feldera repo: ../../sql-to-dbsp-compiler/SQL-compiler/sql-to-dbsp | ||
| FELDERIZE_MODEL=claude-sonnet-4-5 | ||
| ``` | ||
|
|
||
| The `FELDERA_COMPILER` path is required for validation. Without it, translation still works but output SQL is not verified. You can also pass it per-command with `--compiler PATH`. | ||
|
|
||
| The compiler must be built before use (requires Java 19–21 and Maven): | ||
|
|
||
| ```bash | ||
| cd sql-to-dbsp-compiler | ||
| ./build.sh | ||
| ``` | ||
|
|
||
| ## Usage | ||
|
|
||
| ### Run a built-in example | ||
|
|
||
| ```bash | ||
| # List available examples | ||
| felderize example | ||
|
|
||
| # Translate an example (validates by default) | ||
| felderize example simple | ||
|
|
||
| # Without compiler validation | ||
| felderize example simple --no-validate | ||
|
|
||
| # Log SQL submitted to the validator at each attempt | ||
| felderize example json --verbose | ||
|
|
||
| # Use a specific compiler binary | ||
| felderize example simple --compiler /path/to/sql-to-dbsp | ||
|
|
||
| # Output as JSON | ||
| felderize example simple --json-output | ||
| ``` | ||
|
|
||
| Available examples: | ||
|
|
||
| | Name | Description | | ||
| |------|-------------| | ||
| | `simple` | Date truncation, GROUP BY | | ||
| | `strings` | INITCAP, LPAD, NVL, CONCAT_WS | | ||
| | `arrays` | array_contains, size, element_at | | ||
| | `joins` | Null-safe equality (`<=>`) | | ||
| | `windows` | LAG, running SUM OVER | | ||
| | `aggregations` | COUNT DISTINCT, HAVING (includes unsupported: COLLECT_LIST, PERCENTILE_APPROX) | | ||
| | `json` | get_json_object → PARSE_JSON + VARIANT access *(combined file)* | | ||
| | `topk` | ROW_NUMBER TopK, QUALIFY, DATEDIFF → TIMESTAMPDIFF *(combined file)* | | ||
|
|
||
| The JSON output contains: | ||
|
|
||
| ```json | ||
| { | ||
| "feldera_schema": "...", // translated DDL (CREATE TABLE statements) | ||
| "feldera_query": "...", // translated query (CREATE VIEW statements) | ||
| "unsupported": [...], // unsupported Spark features found | ||
| "warnings": [...], // non-fatal issues | ||
| "explanations": [...], // explanations for translation decisions | ||
| "status": "success|unsupported|error" | ||
| } | ||
| ``` | ||
|
|
||
| ### Translate your own SQL | ||
|
|
||
| Two input formats are supported: | ||
|
|
||
| **Separate schema and query files:** | ||
| ```bash | ||
| felderize translate path/to/schema.sql path/to/query.sql | ||
| felderize translate path/to/schema.sql path/to/query.sql --validate | ||
wilmaontherun marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| ``` | ||
|
|
||
| **Single combined file** (CREATE TABLE and CREATE VIEW statements in one file): | ||
| ```bash | ||
| felderize translate-file path/to/combined.sql | ||
| felderize translate-file path/to/combined.sql --validate | ||
| ``` | ||
|
|
||
| > **Note:** Running without `--validate` prints a warning — the output SQL has not been verified against the Feldera compiler. | ||
|
|
||
| Both commands accept: | ||
| - `--validate` to validate output against the Feldera compiler (opt-in; `example` validates by default, use `--no-validate` to skip) | ||
| - `--compiler PATH` to specify the path to the Feldera compiler binary (overrides `FELDERA_COMPILER` env var) | ||
| - `--model MODEL` to specify the LLM model (overrides `FELDERIZE_MODEL` env var) | ||
| - `--no-docs` to disable Feldera SQL reference docs in the prompt | ||
| - `--force-docs` to include docs on the first pass instead of only as a fallback | ||
| - `--verbose` to log the SQL submitted to the validator at each repair attempt | ||
| - `--json-output` to output results as JSON | ||
|
|
||
| ## Configuration | ||
|
|
||
| Environment variables (set in `.env`): | ||
|
|
||
| | Variable | Description | Default | | ||
| |---|---|---| | ||
| | `ANTHROPIC_API_KEY` | Anthropic API key | (required) | | ||
| | `FELDERIZE_MODEL` | LLM model to use (can also be set with `--model`) | (required, set in `.env`) | | ||
| | `FELDERA_COMPILER` | Path to sql-to-dbsp compiler (can also be set with `--compiler`) | (required for validation) | | ||
|
|
||
| ## How it works | ||
|
|
||
| 1. Loads translation rules from a single skill file (`spark/data/skills/spark_skills.md`) | ||
| 2. Sends Spark SQL to the LLM with rules, validated examples, and relevant Feldera SQL documentation (from `docs.feldera.com/docs/sql/`) | ||
| 3. Parses the translated Feldera SQL from the LLM response | ||
| 4. Optionally validates output against the Feldera compiler, retrying with error feedback if needed | ||
|
|
||
| ## Support | ||
|
|
||
| Contact us at support@feldera.com for assistance with unsupported Spark SQL features. | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,30 @@ | ||
| [build-system] | ||
| requires = ["setuptools>=68.0"] | ||
| build-backend = "setuptools.build_meta" | ||
|
|
||
| [project] | ||
| name = "felderize" | ||
| version = "0.1.0" | ||
| description = "Spark SQL to Feldera SQL translator agent" | ||
| requires-python = ">=3.10" | ||
| dependencies = [ | ||
| "anthropic>=0.39.0", | ||
| "sqlparse>=0.5.0", | ||
| "click>=8.1.0", | ||
| "pyyaml>=6.0", | ||
| "python-dotenv>=1.0.0", | ||
| ] | ||
|
|
||
| [tool.setuptools.package-dir] | ||
| felderize = "spark" | ||
|
|
||
| [tool.setuptools.package-data] | ||
| felderize = [ | ||
| "data/skills/*.md", | ||
| "data/skills/**/*.md", | ||
| "data/samples/*.md", | ||
| "data/demo/*.sql", | ||
| ] | ||
|
|
||
| [project.scripts] | ||
| felderize = "felderize.cli:cli" |
Empty file.
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.