Skip to content

feat: cut BigQuery + Spanner parsing to the omni googlesql parser#20569

Open
h3n4l wants to merge 6 commits into
mainfrom
feat/googlesql-omni-cutover
Open

feat: cut BigQuery + Spanner parsing to the omni googlesql parser#20569
h3n4l wants to merge 6 commits into
mainfrom
feat/googlesql-omni-cutover

Conversation

@h3n4l

@h3n4l h3n4l commented Jun 10, 2026

Copy link
Copy Markdown
Member

Cut BigQuery + Spanner parsing to the omni googlesql parser

Cuts backend/plugin/parser/{bigquery,spanner} from the legacy ANTLR parser (github.com/bytebase/parser/googlesql) to the omni hand-written GoogleSQL parser + masking-grade analysis — the Trino #20517 precedent applied to both GoogleSQL engines (one omni grammar serves both dialects). The legacy import count in both plugins is 0.

What changes (3 commits: bigquery, spanner, pin bump)

Per engine: split.go (omni block-aware splitter, legacy position conventions preserved), dignose.go (omni diagnostics; valid-but-unmodeled stubs suppressed), query_type.go (omni classifier; spanner's legacy SET→Select re-applied), query_span_extractor.go (delegates lineage to omni analysis.GetQuerySpan per dialect, then bytebase post-processing: metadata star expansion, system-table handling, legacy naming). Spanner specifics reproduced exactly: named-schema model (db part of db.schema.table ignored like legacy), system-only queries (INFORMATION_SCHEMA / SPANNER_SYS) early-return an empty SelectInfoSchema span, mixed user+system rejects, predicate columns empty (legacy parity). Dead legacy wrappers (bigquery.go, spanner.go) removed.

Masking fidelity — how this was validated

This cutover was blocked and rebuilt once: the first attempt surfaced that omni's then best-effort lineage under-attributed ~7 query-shape classes vs the legacy precise resolvers (bytebase's masker is positional and fail-open, so under-attribution = sensitive columns returned unmasked). omni's analysis was hardened to masking-grade first (bytebase/omni#299), and this PR's bar is:

  • Legacy-resolver differential corpora — the query-span goldens in test-data/query-span/standard.yaml were expanded to 54 (bigquery) + 52 (spanner) lineage edge cases and recorded from the legacy resolvers (set-ops incl. star arms and 3-way chains, subset-projecting/recursive/nested CTEs, derived tables, JOIN USING coalescing, qualified stars, struct paths, system schemas, EXCEPT/REPLACE). The omni-backed extractors reproduce them 106/106.
  • Leak-pin unit tests (query_span_leak_pins_test.go ×2) for shapes the legacy resolver cannot record because it errors or itself leaks: lowercase-USING coalescing (a legacy case-fold bug that mis-aligned the positional masker — omni coalesces case-insensitively, strictly safer), UNNEST element lineage (legacy fail-closed; omni attributes correctly), BY NAME set-ops (legacy can't parse; omni name-merges), schema-qualified sch.t.* (legacy errored; omni resolves with schema-qualified lineage), struct-field stars (fail closed — never silently empty).
  • 6 independent review rounds (Claude + Codex) over the omni resolver + both extractors; every finding fixed-and-pinned or defended against the legacy ground truth.
  • Documented golden deltas (in-corpus case descriptions): two ON+USING legacy over-accept cases reshaped so both resolvers agree; the lowercase-USING corpus case moved to a leak pin (its legacy golden encodes the leak).

Notes for review

  • PredicateColumns stays empty for both engines (legacy parity — the legacy extractors never populated it; omni computes it, surfacing later is a deliberate follow-up).
  • Splitter conventions preserved per engine: spanner keeps the legacy contiguous-text convention (leading whitespace attaches to the following statement, trailing ; included); existing split fixtures pass unchanged.
  • The omni dependency bump is forward-only; trino/doris (already on omni) are unaffected — the whole parser tree builds and their packages are untouched.

🤖 Generated with Claude Code

@h3n4l h3n4l requested a review from a team as a code owner June 10, 2026 07:02
@cla-bot cla-bot Bot added the cla-signed label Jun 10, 2026
h3n4l and others added 4 commits June 10, 2026 16:01
…lesql

BigQuery plugin delegates to the omni hand-written parser + masking-grade
analysis (replace directive pins the local omni-harden during validation):
split/diagnose/classify via omni parser+diagnostics+analysis; the query-span
extractor consumes resolved relation projections (StarSegments with
ExceptColumns for USING-coalesced keys; StarMerge; deferred SetOpMerge with
BY NAME name-merge honoring MatchColumns) and expands base-table stars via
catalog metadata. Legacy ANTLR import removed from the plugin.

54-case legacy-recorded differential corpus green (goldens recorded from the
legacy resolver, never hand-faked); leak-pin unit tests cover shapes the
legacy resolver cannot record (lowercase-USING coalesce, UNNEST lineage,
BY NAME merges) per the structural rule: correct lineage or fail closed,
never silent under-attribution.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…esql

Spanner plugin delegates to the omni hand-written parser + masking-grade
analysis (DialectSpanner), mirroring the BigQuery cutover in this branch:

- split.go: omni block-aware splitter with the legacy parse-tree splitter's
  conventions (contiguous Text from the previous statement's end through the
  trailing ';', positions via the shared byte-offset mapper).
- dignose.go / query_type.go: omni diagnostics + classifier; the legacy
  spanner SET->Select special case re-applied.
- query_span_extractor.go: the Spanner name model (named schemas under one
  database; the db part of db.schema.table ignored like legacy), system-only
  queries early-return an EMPTY SelectInfoSchema span and mixed user+system
  rejects (exact legacy behavior; SPANNER_SYS included), predicate columns
  empty (legacy parity), star-derived USING keys in metadata case.
- spanner.go (legacy ANTLR wrapper) deleted; legacy import count is 0.

Validated against the 52-case legacy-resolver differential corpus (recorded
FROM the legacy spanner resolver; two ON+USING cases reshaped so both
resolvers agree, the lowercase-USING case moved to a leak pin — legacy's
case-fold non-coalesce is a positional masking leak omni fixes). Leak-pin
unit tests cover legacy-unrecordable shapes: lowercase-USING coalesce, UNNEST
lineage, BY NAME merge, schema-qualified dot-star (legacy errored; omni
resolves with schema-qualified lineage), mixed/system-only handling.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Drop the local-development replace directive and pin
github.com/bytebase/omni to the bytebase/omni#299 merge (masking-grade
relation-projection resolution for googlesql).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The bigquery and spanner plugins were structural copies (SonarCloud: 73%
duplication on new code). Extract the omni-backed implementation into
backend/plugin/parser/googlesql — the query-span extractor, splitter,
diagnostics adapter, and query-type mapping — parameterized by a Config
carrying the documented dialect deltas (name model, system-table handling,
SET classification, naming/split conventions). Each engine package shrinks to
a registration wrapper plus its dialect Config; the shared test harness moves
to googlesql/googlesqltest.

Besides the duplication gate, one shared code path is the right shape for
masking-critical lineage: the two copies could otherwise drift independently.
Behavior is unchanged — both legacy-recorded differential corpora (54 + 52
cases) and all leak-pin tests pass identically through the shared
implementation.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@h3n4l h3n4l force-pushed the feat/googlesql-omni-cutover branch from bb80c97 to 3fb203b Compare June 10, 2026 08:12
@socket-security

socket-security Bot commented Jun 10, 2026

Copy link
Copy Markdown

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff Package Supply Chain
Security
Vulnerability Quality Maintenance License
Updatedgolang/​github.com/​bytebase/​omni@​v0.0.0-20260610065225-2cf04e304cb1 ⏵ v0.0.0-20260610065946-2e66c343ef9977 +1100100100100

View full report

@socket-security

socket-security Bot commented Jun 10, 2026

Copy link
Copy Markdown

Warning

Review the following alerts detected in dependencies.

According to your organization's Security Policy, it is recommended to resolve "Warn" alerts. Learn more about Socket for GitHub.

Action Severity Alert  (click "▶" to expand/collapse)
Warn High
Obfuscated code: golang github.com/bytebase/omni is 90.0% likely obfuscated

Confidence: 0.90

Location: Package overview

From: go.modgolang/github.com/bytebase/omni@v0.0.0-20260610065946-2e66c343ef99

ℹ Read more on: This package | This alert | What is obfuscated code?

Next steps: Take a moment to review the security alert above. Review the linked package source code to understand the potential risk. Ensure the package is not malicious before proceeding. If you're unsure how to proceed, reach out to your security team or ask the Socket team for help at support@socket.dev.

Suggestion: Packages should not obfuscate their code. Consider not using packages with obfuscated code.

Mark the package as acceptable risk. To ignore this alert only in this pull request, reply with the comment @SocketSecurity ignore golang/github.com/bytebase/omni@v0.0.0-20260610065946-2e66c343ef99. You can also ignore all packages with @SocketSecurity ignore-all. To ignore an alert for all future pull requests, use Socket's Dashboard to change the triage state of this alert.

View full report

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3fb203b4cb

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +284 to +287
if err != nil {
var notFound *base.ResourceNotFoundError
if errors.As(err, &notFound) && firstNotFound == nil {
firstNotFound = notFound

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Propagate non-not-found metadata errors

When GetDatabaseMetadataFunc fails for a reason other than ResourceNotFoundError (for example a store outage or context cancellation), this branch still falls through to the table-level fallback and continues. For SELECT secret FROM t, that leaves the result column with an empty SourceColumns set and no NotFoundError, so query_result_masker does not reject the span and can return sensitive data unmasked after a transient metadata failure; only resource-not-found should be converted into a non-fatal span error, while other metadata errors should be returned.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed: expandTablesToColumns now returns fatally on any non-ResourceNotFoundError metadata failure (only genuine not-found degrades to span.NotFoundError, which the masking layer rejects). Star-expansion lookups only touch tables this pass already validated and cached, so the single fatal gate covers the lineage paths.

Comment on lines +23 to +25
var config = googlesql.Config{
Dialect: analysis.DialectBigQuery,
UppercaseStarMergeName: true,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve BigQuery SET as read-only

BigQuery's removed legacy listener classified SET statements as base.Select, but this BigQuery config leaves SetStatementIsSelect disabled even though the shared classifier notes omni reports SET as unknown. In the query path BigQuery uses EngineSupportQueryNewACL, where QueryTypeUnknown is rejected as "disallowed query type", so otherwise read-only BigQuery SET statements (for variables/options in scripts) regress from allowed to denied; set this flag for BigQuery as well.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed: verified the removed legacy BigQuery listener also classified SET as base.Select, so the BigQuery config now sets SetStatementIsSelect as well.

… dedupe

- Thread context.Context as a parameter through the metadata-touching methods
  instead of storing it on the extractor (SonarCloud S8242).
- Remove the orderedColumns machinery that became dead once stars expand via
  omni StarSegments (golangci-lint unused-parameter), make applyStarModifiers
  a plain function, and move to slices.Sort/SortFunc/ContainsFunc.
- Collapse the per-engine wrappers onto googlesql.Register (registers split/
  diagnose/query-span and returns the handlers for the engine tests) and
  unify the table/view metadata lookup into a single lookup loop — removing
  the remaining duplication SonarCloud paired between the two plugin files
  and against the Trino extractor this implementation was templated from.

Behavior unchanged: both legacy-recorded differential corpora and all
leak-pin tests pass identically.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…s, BigQuery SET

- expandTablesToColumns: a metadata error that is NOT ResourceNotFoundError (a
  store outage, a canceled context) now fails the span FATALLY instead of
  degrading to a table-level fallback with silently-empty result lineage and no
  NotFoundError — the fail-open positional masker would have returned sensitive
  data unmasked after a transient infrastructure failure. NotFound still
  degrades non-fatally (recorded as span.NotFoundError, which the masking layer
  rejects). Star-expansion lookups only touch tables this pass already
  validated (omni records every physical base table into AccessTables), so the
  single fatal gate covers the lineage paths.
- BigQuery config: SetStatementIsSelect — the legacy BigQuery listener also
  classified SET as Select ("treat SAFE SET as select"); without it omni's
  Unknown would be rejected by the new-ACL access check, regressing read-only
  SET statements.
- googlesql.Register returns a Handlers struct so each engine re-exports
  SplitSQL/GetQuerySpan with its own declaration (revive: exported).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@sonarqubecloud

Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants