Skip to content

feat(snowflake): switch plugin/parser, advisors, and db driver to the omni parser#20567

Open
h3n4l wants to merge 20 commits into
mainfrom
feat/snowflake-omni-integrate
Open

feat(snowflake): switch plugin/parser, advisors, and db driver to the omni parser#20567
h3n4l wants to merge 20 commits into
mainfrom
feat/snowflake-omni-integrate

Conversation

@h3n4l

@h3n4l h3n4l commented Jun 10, 2026

Copy link
Copy Markdown
Member

What

Switch the Snowflake plugin from the legacy ANTLR parser (github.com/bytebase/parser/snowflake) to the omni hand-written parser (github.com/bytebase/omni/snowflake) — zero legacy imports remain. Follows the Trino cutover precedent (#20517). 23 files: 7 core parser features, 14 SQL-review advisors, the db-driver LIMIT rewriter, plus transition scaffolding (added then removed).

How (feature-by-feature, each subagent-implemented + cross-reviewed)

  • Core: SplitSQL, Diagnose, GetStatementRanges, query classification (GetQueryType/validateQuery), GetQuerySpan (column lineage, 1630→~580 lines), shared omni normalization helpers.
  • Advisors: all 14 rules rewritten from ANTLR-listener to omni-AST walking (redshift GetOmniNode style) via a temporary dual-AST OmniAST wrapper, so every commit kept the suite green; yaml advice expectations byte-identical throughout.
  • db driver: LIMIT injection rewritten to omni Loc byte-offset text splicing (test_limit.yaml 15/15 byte-exact).

Cross-review findings fixed along the way (Codex + orchestrator verification per feature)

  • query-span: UNION BY NAME positional-merge (legacy-retained bug, fixed forward), CASE/window lineage drops, JOIN USING/NATURAL collapse (masking-class), USE/SET error→empty-span.
  • Upstream omni fixes this differential surfaced: omni#294 (generated walker skipped ALL non-Node helper-struct fields — SELECT targets/CTEs/CASE whens/OVER were invisible to Inspect; mysql/tidb + pg/redshift same-class follow-ups filed) and omni#297 (FETCH noise words must be optional; TOP 125 * mis-lexed).

Documented divergences (ledgered; first-principles judged, test-locked where behavior changed)

  • SplitSQL drops empty ;-only statements (legacy emitted them as empty=false and spuriously processed them; all consumers skip empties; matches MySQL/Trino-omni). Test-locked.
  • Diagnose no longer flags trailing-garbage tokens (SELECT * FFROM users) — omni is parse-prefix; follow-up filed omni-side. Unicode: omni correctly rejects unquoted non-ASCII identifiers legacy wrongly accepted; quoted forms fine.
  • migration_compatibility: CREATE TABLE LIKE + same-batch DROP no longer warns (legacy fired only via an accidental grammar-listener gap; in-batch creations suppress their own drop by design). Test-locked.
  • PIVOT/UNPIVOT column lineage: omni parses but does not model the clause in its AST → pivot.yaml lineage suite disabled with assertions preserved (omni follow-up needed). NATURAL JOIN star-expansion column order kept un-hoisted (docs silent; lineage merged correctly).

Verification

Whole-repo go build ./backend/... green; parser/snowflake, advisor/snowflake, db/snowflake suites green (-count=1, hard timeouts); all yaml behavioral gates unchanged except the explicitly locked divergences above; grep "bytebase/parser/snowflake" → 0. omni pinned at v0.0.0-20260610063901 (post-#294/#297).

🤖 Generated with Claude Code

h3n4l and others added 16 commits June 9, 2026 17:32
…h core 2/7)

Rewrite diagnose.go to use github.com/bytebase/omni/snowflake/diagnostics
(Analyze) instead of the legacy ANTLR error-listener path; drop the legacy
parser + antlr imports from this file. Mirror trino/diagnose.go: map each
omni Diagnostic to base.SyntaxError via a byte-offset position mapper
(rune-based column) and convert with base.ConvertSyntaxErrorToDiagnostic.

Add diagnose_test.go modeled on the Trino precedent, with cases pinned to
omni's actual behavior and inline notes on each omni-vs-legacy divergence
(trailing-garbage acceptance, empty/;-as-no-error).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…switch core 3/7)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…eQuery) to omni (bytebase-switch core 4/7)

Rewrite query_type.go + query.go onto omni's hand-written parser/AST,
mirroring the Trino migration:

- query_type.go: getQueryType(ast.Node) type-switches omni AST nodes to
  base.QueryType (SELECT/set-op->Select, INSERT/UPDATE/DELETE/MERGE/COPY INTO
  ->DML, SHOW/DESCRIBE->SelectInfoSchema, USE/SET->Select, COMMENT/everything
  else->DDL), replacing the legacy ANTLR queryTypeListener.
- query.go: validateQuery splits via the omni splitter and classifies each
  segment via omni, preserving the legacy SQL-editor contract
  (allReadOnly, allReturnData, error) and the read-only set
  {SELECT, SHOW, DESCRIBE, SET, EXPLAIN}; SET is read-only but not
  data-returning; USE is rejected.

omni's Snowflake parser does not yet support EXPLAIN (parser.Parse errors),
so EXPLAIN is detected lexically and accepted as read-only+data-returning,
matching the legacy other_command.explain branch which never inspected the
inner statement.

The still-legacy query_span_extractor.go is the sole consumer of the old
queryTypeListener; it was relocated there verbatim (no behavior change) so
the extractor keeps compiling on the legacy ANTLR stack until its own
migration. ANTLR/antlr4-go imports are removed from query.go + query_type.go.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…omni does parse EXPLAIN; lexical short-circuit is for legacy parity)
…ase-switch core 5/7)

pivot.yaml disabled: omni parses PIVOT/UNPIVOT but does not model the clause
in the AST (consume-and-discard), so pivot lineage is unrecoverable until omni
gains PIVOT AST nodes. Assertions preserved for re-enable.
… USING/NATURAL collapse, USE/SET empty span (cross-review findings)

Four verified bugs in the omni-AST query-span extractor (column lineage
feeds data masking):

1. UNION [ALL] BY NAME was merged positionally, attributing the wrong
   source columns when the branches list columns in different orders.
   Now merges by case-insensitive column name (right-only columns
   appended), mirroring omni's own snowflake/analysis by-name merge.
   NOTE: the legacy ANTLR extractor was ALSO positional here — fixing
   forward, not a parity regression.

2. CASE WHEN ... THEN and window-function OVER (...) lineage was dropped
   entirely: omni's generated walker does not descend into
   CaseExpr.Whens ([]*WhenClause), FuncCallExpr.OrderBy, or
   FuncCallExpr.Over (non-Node embedded structs). The Inspect callback
   now recurses into those sub-trees manually; collection into a set
   keeps it idempotent if the upstream walker is fixed later.

3. JOIN ... USING fell through to the generic append, duplicating the
   shared column — shifting every later positional mask by one — and
   NATURAL JOIN collapsed but kept only the left side's lineage. USING
   now emits each using-column ONCE (using cols first in USING order,
   then left, then right non-shared, per Snowflake JOIN docs) with
   merged left+right lineage; NATURAL keeps its column order but merges
   right lineage into the shared columns. NOTE: legacy had NATURAL-only
   collapse (left lineage only) and no USING handling — fixing forward
   to docs-correct behavior. The pre-existing Natural Join yaml case
   expectations are updated accordingly (right-side lineage was missing).

4. USE / SET are classified base.Select but are not query expressions;
   they hit extractPseudoTableFromQueryNode's "unsupported query node
   type" error. They now return an empty-results span with no error,
   exactly as the legacy selectOnlyListener (which never fired for them)
   shaped it.

Regression yaml cases added for all four (UNION ALL BY NAME swapped
columns, CASE WHEN, ROW_NUMBER OVER, JOIN USING star expansion, NATURAL
JOIN star expansion, USE DATABASE, SET).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…igration (ANTLR advisors unaffected)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… (omni divergence — legacy fired via Create_table_like listener gap)
…h teardown)

- parseSnowflakeStatements omni-only; OmniAST drops the wrapped ANTLR tree + provider
- generic_checker.go ANTLR walker deleted; ANTLR-free BaseRule plumbing kept (base_rule.go)
- db/snowflake LIMIT rewriter rewritten via omni Loc text-splicing (test_limit.yaml byte-exact;
  needed omni fix #297: optional FETCH noise words + TOP-count infix-* bug, both legacy/docs parity)
- zero github.com/bytebase/parser/snowflake imports remain; omni bumped to post-#297
@h3n4l h3n4l requested a review from a team as a code owner June 10, 2026 06:50
@cla-bot cla-bot Bot added the cla-signed label Jun 10, 2026

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: da665ecf22

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

// Collect the star-token offsets first and emit advices in text order so
// the advice order matches the legacy depth-first listener walk.
var starOffsets []int
omniast.Inspect(node, func(n omniast.Node) bool {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

**P1 Badge Walk CTE SELECTs when enforcing SELECT ***

When a SELECT * appears inside a CTE or a scalar subquery target, this rule no longer visits it because it relies only on omniast.Inspect; the new query-span walker explicitly notes that the generated walker skips SelectStmt.With and SelectStmt.Targets (backend/plugin/parser/snowflake/query_span_extractor.go:1065-1068). For example, WITH c AS (SELECT * FROM t) SELECT a FROM c now produces no SELECT * advice even though the legacy ANTLR listener walked the full tree and flagged the CTE body, so this SQL review rule can be bypassed in common Snowflake queries.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified empirically and this turned out to be a stale-comment artifact: the omni generated walker was fixed upstream (bytebase/omni#294) to descend into SelectStmt.With/Targets (and CASE whens, OVER, etc.) before this PR's pin. WITH c AS (SELECT * FROM t) SELECT a FROM c DOES produce "Avoid using SELECT *." — added a yaml lock case for exactly this shape (statement_select_no_select_all.yaml), and corrected the stale comments in query_span_extractor.go that suggested otherwise. (fed16ac)

// advice order matches the legacy depth-first listener walk (a CTE body
// is reported before the main SELECT that follows it).
var offsets []int
omniast.Inspect(node, func(n omniast.Node) bool {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Walk CTE SELECTs for WHERE-required checks

This traversal misses SELECTs in WITH clauses (and SELECT-list scalar subqueries) for the same reason the query-span code hand-walks those fields: the generated omni walker skips SelectStmt.With and SelectStmt.Targets (backend/plugin/parser/snowflake/query_span_extractor.go:1065-1068). A statement such as WITH c AS (SELECT * FROM t) SELECT * FROM c WHERE id = 1 will only inspect the outer SELECT and will not report the missing WHERE in the CTE body, regressing the documented behavior that this rule checks CTE bodies and subqueries.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same root as the SELECT * comment: the walker gap was already fixed in bytebase/omni#294 (the extractor comment citing it was stale — now corrected). WITH c AS (SELECT x FROM t) SELECT a FROM c WHERE id = 1 DOES flag the WHERE-less CTE body (code 202) — added a yaml lock case (statement_where_require_select.yaml). (fed16ac)

// from the AST. The legacy ANTLR extractor handled both. The expected
// lineage assertions are preserved in pivot.yaml for when omni gains the
// feature. Tracked as an omni-vs-legacy divergence in the migration ledger.
// "test-data/query-span/pivot.yaml",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Do not silently accept unsupported PIVOT lineage

Disabling the PIVOT/UNPIVOT query-span cases leaves production code to parse FROM t PIVOT(...) AS p as a bare FROM t, as this comment says, so GetQuerySpan will return lineage and result columns for the underlying table instead of the pivoted output. For Snowflake queries that use PIVOT/UNPIVOT, this can corrupt masking/sensitive-column analysis; until omni models the clause, the parser should fail closed or retain the legacy handling rather than silently producing incorrect lineage.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed — fixed fail-closed. bytebase/omni#300 models PIVOT/UNPIVOT on ast.TableRef (Pivot/Unpivot/Nested; also fixes a chained-pivot parse bug and omni-side deparse/analysis), and this PR's extractor now returns an explicit error for pivoted table sources in extractTableSourceFromTableRef instead of silently resolving the bare base table — locked by TestGetQuerySpan_PivotFailsClosed; pivot.yaml stays disabled (assertions preserved) until pivot projection lineage is implemented. (fed16ac)

h3n4l added 3 commits June 10, 2026 17:03
…y span; lock CTE-body advisor coverage

- PIVOT: omni #300 models PIVOT/UNPIVOT on ast.TableRef (+ fixes omni-side parse/deparse/analysis);
  extractTableSourceFromTableRef now errors on Pivot/Unpivot/Nested (no silent wrong lineage);
  TestGetQuerySpan_PivotFailsClosed locks it; omni pin bumped.
- CTE coverage: the advisors DO flag CTE bodies (walker fixed in omni #294) — added yaml lock
  cases for SELECT_NO_SELECT_ALL and WHERE_REQUIRE_SELECT; corrected the stale extractor
  comments that claimed the generated walker skips WITH/targets.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2491636145

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +445 to +447
if colRef, ok := expr.(*ast.ColumnRef); ok {
normalizedDatabaseName, normalizedSchemaName, normalizedTableName, normalizedColumnName := normalizeColumnRef(colRef)
querySpanResult, err := q.getField(normalizedDatabaseName, normalizedSchemaName, normalizedTableName, normalizedColumnName)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve positional $n column lineage

For Snowflake queries that use positional column references such as SELECT $1 FROM t, the legacy extractor explicitly recognized DOLLAR()/Column_position() and resolved $1 to the first field from the current FROM source. This port only treats target expressions as ColumnRef or generic expressions, so a $n reference is either looked up as a literal column name or contributes no source columns, which breaks query-span lineage and masking for those queries. The old positional-resolution branch has no replacement in this function.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 71453ad*ast.DollarRef (Positional) now resolves to the N-th FROM-scope column (qualified T.$2 included; out-of-range errors match legacy), restoring the legacy DOLLAR/Column_position behavior; additionally $N inside compound expressions and predicates (TO_DATE($1), $1+1 — where legacy was also blind) now contributes lineage via a shared resolvePositionalRef in the expression walker. Locked by two yaml cases (SELECT $1, T.$2 FROM T1 T; SELECT TO_DATE($1) FROM T1).

Restores legacy DOLLAR/Column_position parity for SELECT targets (bare $1 and
qualified T.$2 resolve to the N-th FROM-scope column, out-of-range errors) and
goes one step further than legacy: $N inside compound expressions (TO_DATE($1),
$1+1) and predicates now contributes the positional column's lineage via a
shared resolvePositionalRef helper in the expression walker (legacy was blind
there too). yaml lock cases added.
@sonarqubecloud

Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants