feldera · wilmaontherun · Mar 16, 2026 · Mar 16, 2026 · Mar 19, 2026 · Mar 19, 2026
diff --git a/python/felderize/README.md b/python/felderize/README.md
@@ -0,0 +1,129 @@
+# Felderize — Spark SQL to Feldera SQL Translator
+
+felderize attempts to translate Spark SQL schemas and queries into valid [Feldera](https://www.feldera.com/) SQL using LLM-based translation with optional compiler validation.
+
+## Setup
+
+```bash
+cd python/felderize
+python3 -m venv .venv
+source .venv/bin/activate
+pip install -e .
+```
+
+> **Note:** `pip install -e .` is required before running `felderize`. It registers the package and CLI command.
+
+Create a `.env` file:
+
+```bash
+ANTHROPIC_API_KEY=your-key-here
+FELDERA_COMPILER=/path/to/sql-to-dbsp  # in Feldera repo: ../../sql-to-dbsp-compiler/SQL-compiler/sql-to-dbsp
+FELDERIZE_MODEL=claude-sonnet-4-5
+```
+
+The `FELDERA_COMPILER` path is required for validation. Without it, translation still works but output SQL is not verified. You can also pass it per-command with `--compiler PATH`.
+
+The compiler must be built before use (requires Java 19–21 and Maven):
+
+```bash
+cd sql-to-dbsp-compiler
+./build.sh
+```
+
+## Usage
+
+### Run a built-in example
+
+```bash
+# List available examples
+felderize example
+
+# Translate an example (validates by default)
+felderize example simple
+
+# Without compiler validation
+felderize example simple --no-validate
+
+# Log SQL submitted to the validator at each attempt
+felderize example json --verbose
+
+# Use a specific compiler binary
+felderize example simple --compiler /path/to/sql-to-dbsp
+
+# Output as JSON
+felderize example simple --json-output
+```
+
+Available examples:
+
+| Name | Description |
+|------|-------------|
+| `simple` | Date truncation, GROUP BY |
+| `strings` | INITCAP, LPAD, NVL, CONCAT_WS |
+| `arrays` | array_contains, size, element_at |
+| `joins` | Null-safe equality (`<=>`) |
+| `windows` | LAG, running SUM OVER |
+| `aggregations` | COUNT DISTINCT, HAVING (includes unsupported: COLLECT_LIST, PERCENTILE_APPROX) |
+| `json` | get_json_object → PARSE_JSON + VARIANT access *(combined file)* |
+| `topk` | ROW_NUMBER TopK, QUALIFY, DATEDIFF → TIMESTAMPDIFF *(combined file)* |
+
+The JSON output contains:
+
+```json
+{
+  "feldera_schema": "...",   // translated DDL (CREATE TABLE statements)
+  "feldera_query": "...",    // translated query (CREATE VIEW statements)
+  "unsupported": [...],      // unsupported Spark features found
+  "warnings": [...],         // non-fatal issues
+  "explanations": [...],     // explanations for translation decisions
+  "status": "success|unsupported|error"
+}
+```
+
+### Translate your own SQL
+
+Two input formats are supported:
+
+**Separate schema and query files:**
+```bash
+felderize translate path/to/schema.sql path/to/query.sql
+felderize translate path/to/schema.sql path/to/query.sql --validate
+```
+
+**Single combined file** (CREATE TABLE and CREATE VIEW statements in one file):
+```bash
+felderize translate-file path/to/combined.sql
+felderize translate-file path/to/combined.sql --validate
+```
+
+> **Note:** Running without `--validate` prints a warning — the output SQL has not been verified against the Feldera compiler.
+
+Both commands accept:
+- `--validate` to validate output against the Feldera compiler (opt-in; `example` validates by default, use `--no-validate` to skip)
+- `--compiler PATH` to specify the path to the Feldera compiler binary (overrides `FELDERA_COMPILER` env var)
+- `--model MODEL` to specify the LLM model (overrides `FELDERIZE_MODEL` env var)
+- `--no-docs` to disable Feldera SQL reference docs in the prompt
+- `--force-docs` to include docs on the first pass instead of only as a fallback
+- `--verbose` to log the SQL submitted to the validator at each repair attempt
+- `--json-output` to output results as JSON
+
+## Configuration
+
+Environment variables (set in `.env`):
+
+| Variable | Description | Default |
+|---|---|---|
+| `ANTHROPIC_API_KEY` | Anthropic API key | (required) |
+| `FELDERIZE_MODEL` | LLM model to use (can also be set with `--model`) | (required, set in `.env`) |
+| `FELDERA_COMPILER` | Path to sql-to-dbsp compiler (can also be set with `--compiler`) | (required for validation) |
+
+## How it works
+
+1. Loads translation rules from a single skill file (`spark/data/skills/spark_skills.md`)
+2. Sends Spark SQL to the LLM with rules, validated examples, and relevant Feldera SQL documentation (from `docs.feldera.com/docs/sql/`)
+3. Parses the translated Feldera SQL from the LLM response
+4. Optionally validates output against the Feldera compiler, retrying with error feedback if needed
+
+## Support
+
+Contact us at support@feldera.com for assistance with unsupported Spark SQL features.
diff --git a/python/felderize/pyproject.toml b/python/felderize/pyproject.toml
@@ -0,0 +1,30 @@
+[build-system]
+requires = ["setuptools>=68.0"]
+build-backend = "setuptools.build_meta"
+
+[project]
+name = "felderize"
+version = "0.1.0"
+description = "Spark SQL to Feldera SQL translator agent"
+requires-python = ">=3.10"
+dependencies = [
+    "anthropic>=0.39.0",
+    "sqlparse>=0.5.0",
+    "click>=8.1.0",
+    "pyyaml>=6.0",
+    "python-dotenv>=1.0.0",
+]
+
+[tool.setuptools.package-dir]
+felderize = "spark"
+
+[tool.setuptools.package-data]
+felderize = [
+    "data/skills/*.md",
+    "data/skills/**/*.md",
+    "data/samples/*.md",
+    "data/demo/*.sql",
+]
+
+[project.scripts]
+felderize = "felderize.cli:cli"
diff --git a/python/felderize/spark/__init__.py b/python/felderize/spark/__init__.py