Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
046f795
[migrations] Spark-to-Feldera migration tool PoC.
wilmaontherun Mar 16, 2026
bb1b24c
[ci] apply automatic fixes
feldera-bot Mar 16, 2026
871d79c
Intermediate progress based on Mihai's comments.
wilmaontherun Mar 19, 2026
e856080
fixed comments on skills
wilmaontherun Mar 19, 2026
b672679
Fixed all comments before we refactor skills
wilmaontherun Mar 19, 2026
3d8536f
Merge skills
wilmaontherun Mar 19, 2026
219f8b7
Fixed the rest of the code w.r.t. new skill file
wilmaontherun Mar 19, 2026
83d7731
Revised doc indexing
wilmaontherun Mar 19, 2026
98f7dee
merged skills
wilmaontherun Mar 19, 2026
56bd911
add --verbose flag, translate-file, combined demos, and Feldera PK/qu…
wilmaontherun Mar 19, 2026
b10f7a4
more demo files
wilmaontherun Mar 19, 2026
7d56201
revised samples and skills
wilmaontherun Mar 20, 2026
9e17a33
[ci] apply automatic fixes
feldera-bot Mar 20, 2026
3a75739
add --compiler option, fix no-compiler handling, improve example list…
wilmaontherun Mar 20, 2026
ab4f746
[ci] apply automatic fixes
feldera-bot Mar 20, 2026
6672146
fixed readme
wilmaontherun Mar 20, 2026
9975f33
[ci] apply automatic fixes
feldera-bot Mar 20, 2026
14e7cc6
Add --model option, remove OpenAI support and hardcoded compiler path
wilmaontherun Mar 20, 2026
7ac2fb9
Use sqlparse for SQL splitting, fix README inconsistencies
wilmaontherun Mar 20, 2026
6655f7c
Add prompt caching and rate limit retry; skip examples on first pass
wilmaontherun Mar 20, 2026
8fa1210
Clean up code quality: fix imports, types, and consistency issues
wilmaontherun Mar 21, 2026
4358104
Fix spark_skills.md inconsistencies
wilmaontherun Mar 24, 2026
80a5e7f
[ci] apply automatic fixes
feldera-bot Mar 24, 2026
7da4c30
Verify and fix spark_skills.md against Apache Spark SQL reference
wilmaontherun Mar 24, 2026
589406c
Overhaul spark/data/samples: fix errors, add new patterns, remove tri…
wilmaontherun Mar 24, 2026
b53a97a
Fix skills inconsistencies: QUARTER unsupported, contains/binary, pmo…
wilmaontherun Mar 25, 2026
05388f9
Rename misnamed sample files to match their content
wilmaontherun Mar 25, 2026
643ba2b
Improve and expand sample demos
wilmaontherun Mar 25, 2026
d1b0c95
Fix demo files: remove unsupported patterns, add dates and arithmetic…
wilmaontherun Mar 25, 2026
d7057b9
Fix aggregations and arithmetic demos to use only supported Feldera f…
wilmaontherun Mar 25, 2026
b438037
split_part skill & base_url config
anandbraman Mar 27, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
129 changes: 129 additions & 0 deletions python/felderize/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
# Felderize — Spark SQL to Feldera SQL Translator

felderize attempts to translate Spark SQL schemas and queries into valid [Feldera](https://www.feldera.com/) SQL using LLM-based translation with optional compiler validation.

## Setup

```bash
cd python/felderize
python3 -m venv .venv
source .venv/bin/activate
pip install -e .
```

> **Note:** `pip install -e .` is required before running `felderize`. It registers the package and CLI command.

Create a `.env` file:

```bash
ANTHROPIC_API_KEY=your-key-here
FELDERA_COMPILER=/path/to/sql-to-dbsp # in Feldera repo: ../../sql-to-dbsp-compiler/SQL-compiler/sql-to-dbsp
FELDERIZE_MODEL=claude-sonnet-4-5
```

The `FELDERA_COMPILER` path is required for validation. Without it, translation still works but output SQL is not verified. You can also pass it per-command with `--compiler PATH`.

The compiler must be built before use (requires Java 19–21 and Maven):

```bash
cd sql-to-dbsp-compiler
./build.sh
```

## Usage

### Run a built-in example

```bash
# List available examples
felderize example

# Translate an example (validates by default)
felderize example simple

# Without compiler validation
felderize example simple --no-validate

# Log SQL submitted to the validator at each attempt
felderize example json --verbose

# Use a specific compiler binary
felderize example simple --compiler /path/to/sql-to-dbsp

# Output as JSON
felderize example simple --json-output
```

Available examples:

| Name | Description |
|------|-------------|
| `simple` | Date truncation, GROUP BY |
| `strings` | INITCAP, LPAD, NVL, CONCAT_WS |
| `arrays` | array_contains, size, element_at |
| `joins` | Null-safe equality (`<=>`) |
| `windows` | LAG, running SUM OVER |
| `aggregations` | COUNT DISTINCT, HAVING (includes unsupported: COLLECT_LIST, PERCENTILE_APPROX) |
| `json` | get_json_object → PARSE_JSON + VARIANT access *(combined file)* |
| `topk` | ROW_NUMBER TopK, QUALIFY, DATEDIFF → TIMESTAMPDIFF *(combined file)* |

The JSON output contains:

```json
{
"feldera_schema": "...", // translated DDL (CREATE TABLE statements)
"feldera_query": "...", // translated query (CREATE VIEW statements)
"unsupported": [...], // unsupported Spark features found
"warnings": [...], // non-fatal issues
"explanations": [...], // explanations for translation decisions
"status": "success|unsupported|error"
}
```

### Translate your own SQL

Two input formats are supported:

**Separate schema and query files:**
```bash
felderize translate path/to/schema.sql path/to/query.sql
felderize translate path/to/schema.sql path/to/query.sql --validate
```

**Single combined file** (CREATE TABLE and CREATE VIEW statements in one file):
```bash
felderize translate-file path/to/combined.sql
felderize translate-file path/to/combined.sql --validate
```

> **Note:** Running without `--validate` prints a warning — the output SQL has not been verified against the Feldera compiler.

Both commands accept:
- `--validate` to validate output against the Feldera compiler (opt-in; `example` validates by default, use `--no-validate` to skip)
- `--compiler PATH` to specify the path to the Feldera compiler binary (overrides `FELDERA_COMPILER` env var)
- `--model MODEL` to specify the LLM model (overrides `FELDERIZE_MODEL` env var)
- `--no-docs` to disable Feldera SQL reference docs in the prompt
- `--force-docs` to include docs on the first pass instead of only as a fallback
- `--verbose` to log the SQL submitted to the validator at each repair attempt
- `--json-output` to output results as JSON

## Configuration

Environment variables (set in `.env`):

| Variable | Description | Default |
|---|---|---|
| `ANTHROPIC_API_KEY` | Anthropic API key | (required) |
| `FELDERIZE_MODEL` | LLM model to use (can also be set with `--model`) | (required, set in `.env`) |
| `FELDERA_COMPILER` | Path to sql-to-dbsp compiler (can also be set with `--compiler`) | (required for validation) |

## How it works

1. Loads translation rules from a single skill file (`spark/data/skills/spark_skills.md`)
2. Sends Spark SQL to the LLM with rules, validated examples, and relevant Feldera SQL documentation (from `docs.feldera.com/docs/sql/`)
3. Parses the translated Feldera SQL from the LLM response
4. Optionally validates output against the Feldera compiler, retrying with error feedback if needed

## Support

Contact us at support@feldera.com for assistance with unsupported Spark SQL features.
30 changes: 30 additions & 0 deletions python/felderize/pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
[build-system]
requires = ["setuptools>=68.0"]
build-backend = "setuptools.build_meta"

[project]
name = "felderize"
version = "0.1.0"
description = "Spark SQL to Feldera SQL translator agent"
requires-python = ">=3.10"
dependencies = [
"anthropic>=0.39.0",
"sqlparse>=0.5.0",
"click>=8.1.0",
"pyyaml>=6.0",
"python-dotenv>=1.0.0",
]

[tool.setuptools.package-dir]
felderize = "spark"

[tool.setuptools.package-data]
felderize = [
"data/skills/*.md",
"data/skills/**/*.md",
"data/samples/*.md",
"data/demo/*.sql",
]

[project.scripts]
felderize = "felderize.cli:cli"
Empty file.
Loading
Loading