feat: cut BigQuery + Spanner parsing to the omni googlesql parser#20569
feat: cut BigQuery + Spanner parsing to the omni googlesql parser#20569h3n4l wants to merge 6 commits into
Conversation
…lesql BigQuery plugin delegates to the omni hand-written parser + masking-grade analysis (replace directive pins the local omni-harden during validation): split/diagnose/classify via omni parser+diagnostics+analysis; the query-span extractor consumes resolved relation projections (StarSegments with ExceptColumns for USING-coalesced keys; StarMerge; deferred SetOpMerge with BY NAME name-merge honoring MatchColumns) and expands base-table stars via catalog metadata. Legacy ANTLR import removed from the plugin. 54-case legacy-recorded differential corpus green (goldens recorded from the legacy resolver, never hand-faked); leak-pin unit tests cover shapes the legacy resolver cannot record (lowercase-USING coalesce, UNNEST lineage, BY NAME merges) per the structural rule: correct lineage or fail closed, never silent under-attribution. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…esql Spanner plugin delegates to the omni hand-written parser + masking-grade analysis (DialectSpanner), mirroring the BigQuery cutover in this branch: - split.go: omni block-aware splitter with the legacy parse-tree splitter's conventions (contiguous Text from the previous statement's end through the trailing ';', positions via the shared byte-offset mapper). - dignose.go / query_type.go: omni diagnostics + classifier; the legacy spanner SET->Select special case re-applied. - query_span_extractor.go: the Spanner name model (named schemas under one database; the db part of db.schema.table ignored like legacy), system-only queries early-return an EMPTY SelectInfoSchema span and mixed user+system rejects (exact legacy behavior; SPANNER_SYS included), predicate columns empty (legacy parity), star-derived USING keys in metadata case. - spanner.go (legacy ANTLR wrapper) deleted; legacy import count is 0. Validated against the 52-case legacy-resolver differential corpus (recorded FROM the legacy spanner resolver; two ON+USING cases reshaped so both resolvers agree, the lowercase-USING case moved to a leak pin — legacy's case-fold non-coalesce is a positional masking leak omni fixes). Leak-pin unit tests cover legacy-unrecordable shapes: lowercase-USING coalesce, UNNEST lineage, BY NAME merge, schema-qualified dot-star (legacy errored; omni resolves with schema-qualified lineage), mixed/system-only handling. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Drop the local-development replace directive and pin github.com/bytebase/omni to the bytebase/omni#299 merge (masking-grade relation-projection resolution for googlesql). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The bigquery and spanner plugins were structural copies (SonarCloud: 73% duplication on new code). Extract the omni-backed implementation into backend/plugin/parser/googlesql — the query-span extractor, splitter, diagnostics adapter, and query-type mapping — parameterized by a Config carrying the documented dialect deltas (name model, system-table handling, SET classification, naming/split conventions). Each engine package shrinks to a registration wrapper plus its dialect Config; the shared test harness moves to googlesql/googlesqltest. Besides the duplication gate, one shared code path is the right shape for masking-critical lineage: the two copies could otherwise drift independently. Behavior is unchanged — both legacy-recorded differential corpora (54 + 52 cases) and all leak-pin tests pass identically through the shared implementation. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
bb80c97 to
3fb203b
Compare
|
Review the following changes in direct dependencies. Learn more about Socket for GitHub.
|
|
Warning Review the following alerts detected in dependencies. According to your organization's Security Policy, it is recommended to resolve "Warn" alerts. Learn more about Socket for GitHub.
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3fb203b4cb
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if err != nil { | ||
| var notFound *base.ResourceNotFoundError | ||
| if errors.As(err, ¬Found) && firstNotFound == nil { | ||
| firstNotFound = notFound |
There was a problem hiding this comment.
Propagate non-not-found metadata errors
When GetDatabaseMetadataFunc fails for a reason other than ResourceNotFoundError (for example a store outage or context cancellation), this branch still falls through to the table-level fallback and continues. For SELECT secret FROM t, that leaves the result column with an empty SourceColumns set and no NotFoundError, so query_result_masker does not reject the span and can return sensitive data unmasked after a transient metadata failure; only resource-not-found should be converted into a non-fatal span error, while other metadata errors should be returned.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed: expandTablesToColumns now returns fatally on any non-ResourceNotFoundError metadata failure (only genuine not-found degrades to span.NotFoundError, which the masking layer rejects). Star-expansion lookups only touch tables this pass already validated and cached, so the single fatal gate covers the lineage paths.
| var config = googlesql.Config{ | ||
| Dialect: analysis.DialectBigQuery, | ||
| UppercaseStarMergeName: true, |
There was a problem hiding this comment.
Preserve BigQuery SET as read-only
BigQuery's removed legacy listener classified SET statements as base.Select, but this BigQuery config leaves SetStatementIsSelect disabled even though the shared classifier notes omni reports SET as unknown. In the query path BigQuery uses EngineSupportQueryNewACL, where QueryTypeUnknown is rejected as "disallowed query type", so otherwise read-only BigQuery SET statements (for variables/options in scripts) regress from allowed to denied; set this flag for BigQuery as well.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed: verified the removed legacy BigQuery listener also classified SET as base.Select, so the BigQuery config now sets SetStatementIsSelect as well.
… dedupe - Thread context.Context as a parameter through the metadata-touching methods instead of storing it on the extractor (SonarCloud S8242). - Remove the orderedColumns machinery that became dead once stars expand via omni StarSegments (golangci-lint unused-parameter), make applyStarModifiers a plain function, and move to slices.Sort/SortFunc/ContainsFunc. - Collapse the per-engine wrappers onto googlesql.Register (registers split/ diagnose/query-span and returns the handlers for the engine tests) and unify the table/view metadata lookup into a single lookup loop — removing the remaining duplication SonarCloud paired between the two plugin files and against the Trino extractor this implementation was templated from. Behavior unchanged: both legacy-recorded differential corpora and all leak-pin tests pass identically. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…s, BigQuery SET
- expandTablesToColumns: a metadata error that is NOT ResourceNotFoundError (a
store outage, a canceled context) now fails the span FATALLY instead of
degrading to a table-level fallback with silently-empty result lineage and no
NotFoundError — the fail-open positional masker would have returned sensitive
data unmasked after a transient infrastructure failure. NotFound still
degrades non-fatally (recorded as span.NotFoundError, which the masking layer
rejects). Star-expansion lookups only touch tables this pass already
validated (omni records every physical base table into AccessTables), so the
single fatal gate covers the lineage paths.
- BigQuery config: SetStatementIsSelect — the legacy BigQuery listener also
classified SET as Select ("treat SAFE SET as select"); without it omni's
Unknown would be rejected by the new-ACL access check, regressing read-only
SET statements.
- googlesql.Register returns a Handlers struct so each engine re-exports
SplitSQL/GetQuerySpan with its own declaration (revive: exported).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|



Cut BigQuery + Spanner parsing to the omni googlesql parser
Cuts
backend/plugin/parser/{bigquery,spanner}from the legacy ANTLR parser (github.com/bytebase/parser/googlesql) to the omni hand-written GoogleSQL parser + masking-grade analysis — the Trino #20517 precedent applied to both GoogleSQL engines (one omni grammar serves both dialects). The legacy import count in both plugins is 0.What changes (3 commits: bigquery, spanner, pin bump)
Per engine:
split.go(omni block-aware splitter, legacy position conventions preserved),dignose.go(omni diagnostics; valid-but-unmodeled stubs suppressed),query_type.go(omni classifier; spanner's legacySET→Selectre-applied),query_span_extractor.go(delegates lineage to omnianalysis.GetQuerySpanper dialect, then bytebase post-processing: metadata star expansion, system-table handling, legacy naming). Spanner specifics reproduced exactly: named-schema model (db part ofdb.schema.tableignored like legacy), system-only queries (INFORMATION_SCHEMA / SPANNER_SYS) early-return an emptySelectInfoSchemaspan, mixed user+system rejects, predicate columns empty (legacy parity). Dead legacy wrappers (bigquery.go,spanner.go) removed.Masking fidelity — how this was validated
This cutover was blocked and rebuilt once: the first attempt surfaced that omni's then best-effort lineage under-attributed ~7 query-shape classes vs the legacy precise resolvers (bytebase's masker is positional and fail-open, so under-attribution = sensitive columns returned unmasked). omni's analysis was hardened to masking-grade first (bytebase/omni#299), and this PR's bar is:
test-data/query-span/standard.yamlwere expanded to 54 (bigquery) + 52 (spanner) lineage edge cases and recorded from the legacy resolvers (set-ops incl. star arms and 3-way chains, subset-projecting/recursive/nested CTEs, derived tables, JOIN USING coalescing, qualified stars, struct paths, system schemas, EXCEPT/REPLACE). The omni-backed extractors reproduce them 106/106.query_span_leak_pins_test.go×2) for shapes the legacy resolver cannot record because it errors or itself leaks: lowercase-USINGcoalescing (a legacy case-fold bug that mis-aligned the positional masker — omni coalesces case-insensitively, strictly safer), UNNEST element lineage (legacy fail-closed; omni attributes correctly),BY NAMEset-ops (legacy can't parse; omni name-merges), schema-qualifiedsch.t.*(legacy errored; omni resolves with schema-qualified lineage), struct-field stars (fail closed — never silently empty).Notes for review
PredicateColumnsstays empty for both engines (legacy parity — the legacy extractors never populated it; omni computes it, surfacing later is a deliberate follow-up).;included); existing split fixtures pass unchanged.🤖 Generated with Claude Code