Skip to content

[migrations] Spark-to-Feldera migration tool PoC.#5837

Draft
wilmaontherun wants to merge 31 commits intomainfrom
felderize
Draft

[migrations] Spark-to-Feldera migration tool PoC.#5837
wilmaontherun wants to merge 31 commits intomainfrom
felderize

Conversation

@wilmaontherun
Copy link
Copy Markdown
Collaborator

CLI tool using LLM to translate and syntactically validate Spark SQL programs to Feldera SQL.

Requires Anthropic API key in felderize/.env

Describe Manual Test Plan

No automated tests yet. Tested manually using examples in the demo folder.

@wilmaontherun wilmaontherun force-pushed the felderize branch 3 times, most recently from f878931 to 3f816fb Compare March 16, 2026 18:48
CLI tool using LLM to translate and syntactically validate Spark SQL programs to Feldera SQL.

Signed-off-by: Wilma <wilmaontherun@gmail.com>
Signed-off-by: feldera-bot <feldera-bot@feldera.com>
@mihaibudiu
Copy link
Copy Markdown
Contributor

We should build a library with compatibility functions that people can just reuse, especially if they can be written in SQL.

wilmaontherun and others added 13 commits March 19, 2026 10:26
addressed remaining comments
Signed-off-by: feldera-bot <feldera-bot@feldera.com>
Signed-off-by: feldera-bot <feldera-bot@feldera.com>
Signed-off-by: feldera-bot <feldera-bot@feldera.com>
- Added --model CLI option to translate, translate-file, and example commands
- Model and compiler path now read exclusively from .env / CLI flags
- Removed OpenAI provider support (untested)
- Removed hardcoded default compiler path
- Updated README for consistency

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
wilmaontherun and others added 2 commits March 19, 2026 20:17
- Replaced custom semicolon scanner with sqlparse.split() — handles string literals, comments, block comments correctly
- Added sqlparse>=0.5.0 to dependencies, removed openai dependency
- Fixed README: clarified FELDERA_COMPILER comment (not a default, just repo location)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- llm.py: wrap system prompt in cache_control ephemeral block to enable
  Anthropic prompt caching; add retry with exponential backoff on rate limits
- translator.py: omit examples on first translation attempt (skills only)
  to reduce token usage and latency (~20s → ~4s for simple queries)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
wilmaontherun and others added 10 commits March 21, 2026 16:56
- docs.py: replace module-level _FUNC_ANCHORS with per-dir _get_cats_and_anchors() cache
- llm.py: move imports to top level, add unreachable guard
- translator.py: move sqlparse import to top level, fix LLMClient type annotation, remove double-strip
- feldera_client.py: keep f.name usage inside with block
- skills.py: remove redundant intermediate sort
- cli.py: remove untested batch command, fix Status import, add missing --compiler/--model to all commands
- pyproject.toml: remove unused httpx dependency
- README.md: update to reflect removed batch command and full options list
- spark_skills.md: add rewrite rules and unsupported constructs from test investigation

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- STRING/TEXT type mapping: STRING→VARCHAR, TEXT→VARCHAR
- Remove duplicate HEX/UNHEX from Hashing section
- Remove CAST(INTERVAL SECOND AS DECIMAL) from Unsupported (contradicted Supported section)
- Window: unify ROWS/RANGE BETWEEN into one Unsupported entry
- split(str,delim,limit): clarify 2-arg form is supported
- LN/LOG10: "runtime error" → "drops the row (WorkerPanic)" for negative input
- TIMESTAMP_NTZ: clarify "replace with TIMESTAMP in DDL"
- FIRST_VALUE/LAST_VALUE notes consistent with Window unsupported section
- Scalar subquery rule: fix incorrect "subquery with FROM → mark unsupported"
- Remove unexplained CREATE TYPE + jsonstring_as_ hint
- trunc(d,'Q'): move to Unsupported (DATE_TRUNC QUARTER fails at runtime)
- make_timestamp: move to Rewritable with PARSE_TIMESTAMP rewrite
- from_unixtime: use TIMESTAMPADD directly (consistent with to_timestamp)
- encode/decode: remove misleading "IS rewritable as CASE WHEN" note
- width_bucket: remove stray extra column
- SIGN: remove misleading "Input/output: DECIMAL" note
- date_format: handle TIMESTAMP input via CAST to DATE
- log(base,x): add examples to reinforce arg swap rule
- [GBD-ARRAY-ORDER]: new GBD entry; annotate ARRAY_UNION/ARRAY_EXCEPT
- Bitwise scalar operators moved to Unsupported section

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: feldera-bot <feldera-bot@feldera.com>
- Add ground truth note: all signatures from spark.apache.org/docs/latest/api/sql/index.html
- Fix unix_millis/unix_micros: take timestamp arg (not no-arg current-time)
- Fix pmod: unified formula MOD(MOD(a,ABS(b))+ABS(b),ABS(b)) for all divisor signs
- Move try_divide/try_add/try_subtract/try_multiply to unsupported (semantic mismatch)
- Move map_entries to unsupported (returns array of structs, no Feldera equivalent)
- Fix from_unixtime: note STRING vs TIMESTAMP type difference, mark fmt-arg as unsupported
- Fix posexplode: subtract 1 from ORDINALITY (Spark 0-based, SQL 1-based)
- Add translate warning: REGEXP_REPLACE treats chars as regex patterns
- Fix lpad/rpad: document optional pad arg (defaults to space)
- Fix months_between: add roundOff note, precise fractional example
- Fix trunc WEEK: move to unsupported (same Sunday/Monday mismatch as date_trunc WEEK)
- Move try_* from String to Math in unsupported section
- Add trunc YYYY/MM/MON aliases, to_date using PARSE_TIMESTAMP

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…vial entries

- Fix json_extraction: lateral alias not supported in Feldera; repeat PARSE_JSON per field
- Fix datediff: use correct DATEDIFF(DAY, start, end) instead of TIMESTAMPDIFF
- Replace null_safe_equality with LOG argument order reversal (critical gotcha)
- Replace nvl_coalesce with LPAD/RPAD rewrite (no native support in Feldera)
- Improve array_map_functions: add element_at(map,key) → map[key], CARDINALITY NULL note
- Add explode_unnest: LATERAL VIEW explode/posexplode/inline → UNNEST patterns
- Add json_extraction: get_json_object → PARSE_JSON + bracket syntax, CTE for GROUP BY
- Remove array_lambda (unsupported-only, no rewrite value)
- Remove row_number_topk (trivial CREATE VIEW wrapper, no translation needed)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…d, to_date, date_format

- Remove trunc(d,'Q') from supported section — DATE_TRUNC(QUARTER) fails at
  runtime in Feldera (date_trunc_quarter_DateN missing); unsupported entry at
  line 716 was already correct
- Fix contains(binary,...) — POSITION rewrite works for binary args; only
  boolean args are truly unsupported
- Fix pmod formula to CASE WHEN MOD(a,b)<0 AND b>0 THEN MOD(a,b)+b ELSE
  MOD(a,b) END (empirically verified against all sign combinations)
- Fix to_date: use PARSE_DATE not PARSE_TIMESTAMP (panics on date-only strings)
- Fix date_format: FORMAT_DATE truncates time; use CONCAT+EXTRACT for time
  components; FORMAT_TIMESTAMP does not exist; LPAD does not work here
- Fix JSON lateral alias note: Feldera does NOT support lateral aliases

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- nvl_coalesce.md → lpad_rpad.md (contains LPAD/RPAD rewrite)
- null_safe_equality.md → log_arg_order.md (contains LOG arg order reversal)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- left_semi_join: add CREATE TABLE schemas, fix WHERE→ON clause placement
- to_date_date_format (new): PARSE_DATE vs PARSE_TIMESTAMP, FORMAT_DATE+EXTRACT
  pattern for time components, FORMAT_TIMESTAMP nonexistence
- window_functions (new): ROW_NUMBER TopK, LAG/LEAD, partition SUM; notes on
  ROWS/RANGE frames unsupported and TopK outer-WHERE requirement
- pmod_try_arithmetic (new): pmod CASE WHEN rewrite, try_divide NULL approximation,
  try_subtract direct translation with overflow warning

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… examples

- windows_query: remove ROWS BETWEEN frame (unsupported in Feldera)
- aggregations_query: replace PERCENTILE_APPROX with STDDEV (no Feldera equivalent)
- json_combined: replace $.items[0] array path with scalar path (array paths unsupported)
- topk_combined: replace Feldera 3-arg DATEDIFF with Spark 2-arg datediff (Spark input)
- Add dates_combined: to_date / date_format Spark input demo
- Add arithmetic_combined: pmod / try_divide / try_subtract Spark input demo

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…eatures

- aggregations: remove COLLECT_LIST (unsupported), STDDEV → SUM
- arithmetic: replace try_divide/try_subtract with NULLIF division and direct subtraction

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@mihaibudiu mihaibudiu requested a review from anandbraman March 25, 2026 17:04
@mihaibudiu
Copy link
Copy Markdown
Contributor

@anandbraman I think we should both review this PR

Copy link
Copy Markdown
Contributor

@mihaibudiu mihaibudiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe @anandbraman and I should make the fixes I recommend and take over this project.

@@ -0,0 +1,129 @@
# Felderize — Spark SQL to Feldera SQL Translator
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expect we will have multiple felderize programs, one for each input dialect.
We should plan to organize them in this way. At least the directory, if not the program name, should reflect the source dialect.

```sql
CREATE TABLE orders (
order_id BIGINT, customer_id BIGINT, region STRING,
amount DECIMAL(12,2), status STRING, created_at TIMESTAMP
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this actually works well because this is a DECIMAL, but our INTEGER aggregations, particularly AVG, would require casting to DECIMAL to get non-truncated results.

Not sure where this could be described, though.


-- NOTE: CARDINALITY returns NULL for NULL input; Spark size() returns -1.
SELECT user_id,
CARDINALITY(tags) AS tag_count,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be COALESCE(CARDINALITY(tags), -1) for an exact translation?


Feldera uses VARIANT type. Parse with PARSE_JSON, then access fields with bracket syntax.
get_json_object returns STRING only — numbers/booleans need CAST to correct type.
Feldera does NOT support lateral aliases — repeat PARSE_JSON(col) per field or use a CTE.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feldera DOES support lateral aliases, these are not necessary.


Feldera:
```sql
-- Repeat PARSE_JSON per field (no lateral alias in Feldera)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we do have it, so SELECT PARSE_JSON(payload) as X, X['user_id'], X['amount'] will work

|----------|-------|
| `map_entries(m)` | Returns an array of `{key, value}` structs — no equivalent in Feldera |
| `map_concat(m1, m2)` | No equivalent |
| `map_contains_key(m, k)` | No equivalent |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is supported


| Function | Notes |
|----------|-------|
| `try_divide(a, b)` | Spark always returns DOUBLE and returns NULL on divide-by-zero; Feldera integer division returns INT and drops the row on divide-by-zero — not safely rewritable → [GBD-DIV-ZERO] |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we never drop any rows, we may panic

@click.option(
"--verbose", is_flag=True, help="Log SQL submitted to validator at each attempt"
)
def example(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the point of the examples? Are they for humans or for the LLM?
Isn't the skills file sufficient for the llm?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not use examples now, I removed them from the flow to make sure that skills are as complete as possible. At some point, we will add them back.


_CONTACT_MESSAGE = (
"\n Some Spark SQL features are not yet supported in Feldera.\n"
" Contact us at support@feldera.com to request support for these features."
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or consider filing an issue at github.com/feldera/feldera/issue

] = {}


def _get_cats_and_anchors(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for a dog person you like cats too much


| Function | Notes |
|----------|-------|
| `split_part(str, delim, n)` | Feldera's `SPLIT_PART` treats the delimiter as a regex — special chars like `.` match any character and produce wrong results. Negative indices not supported. Always mark unsupported. |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor correction - feldera split_part delimiter is not a regex.

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants