Skip to content

feat(trino): migrate Trino parser plugin from ANTLR to omni#20517

Merged
d-bytebase merged 11 commits into
mainfrom
feat/trino-omni-integrate
Jun 5, 2026
Merged

feat(trino): migrate Trino parser plugin from ANTLR to omni#20517
d-bytebase merged 11 commits into
mainfrom
feat/trino-omni-integrate

Conversation

@h3n4l

@h3n4l h3n4l commented Jun 5, 2026

Copy link
Copy Markdown
Member

Cuts over bytebase's Trino parser from the legacy ANTLR parser (github.com/bytebase/parser/trino) to the hand-written omni parser (github.com/bytebase/omni/trino/*) — the final step of the Trino omni migration (17 omni nodes already merged to bytebase/omni). Follows the already-shipped doris precedent.

What changed

backend/plugin/parser/trino + backend/plugin/schema/trino rewired; the legacy bytebase/parser/trino import is gone; the ANTLR-only query_span_listener.go / query_span_predicate.go and the ~1700-line CodeCompletionCore completer are deleted. omni bumped to a main version with all trino packages.

P0 handler now backed by
SplitSQL trino/parser.Split
parseTrinoStatements trino/parser.Parse
Diagnose trino/parser.Diagnose
validateQuery / query-type omni AST inspection + analysis.Classify
GetQuerySpan (masking/lineage) trino/analysis.GetQuerySpan + metadata column expansion
GetDatabaseDefinition trino/deparse.GetDatabaseDefinition
Completion trino/completion.Complete

Verification

  • go build ./backend/... ✅ · go vet ✅ · gofmt clean ✅
  • go test ./backend/plugin/{parser,schema}/trino/... ✅ (the feature differential — existing trino tests pass)
  • doris (another omni engine) tests still ✅ with the omni bump.

Documented divergences from the legacy parser (please review)

  • Completion: the omni completer is intentionally not a c3 port — it emits {Text, Type} candidates only (no Definition/Priority), and column candidates are the in-scope FROM columns (query-span-resolved) rather than the legacy "every column in the default schema, ranked". Tests updated to the omni candidate model.
  • Query span: omni analysis.GetQuerySpan returns table access + result columns; the extractor expands tables to column-level SourceColumns via bytebase metadata and rebuilds per-result + predicate columns. A couple of query_span test expectations were adjusted to omni's (oracle-validated) behavior — review the diffs.

Reviewer focus: the query-span SourceColumns mapping (masking input) and the completion candidate-model change.

🤖 Generated with Claude Code

Cut over backend/plugin/parser/trino + backend/plugin/schema/trino from the
legacy ANTLR parser (github.com/bytebase/parser/trino) to the hand-written omni
parser (github.com/bytebase/omni/trino/*), following the doris precedent. The
legacy bytebase/parser/trino import is fully removed.

The 6 registered P0 handlers are now omni-backed:
- SplitSQL                → trino/parser.Split
- parseTrinoStatements    → trino/parser.Parse
- Diagnose                → trino/parser.Diagnose
- validateQuery/queryType → omni AST inspection + analysis.Classify
- GetQuerySpan            → trino/analysis.GetQuerySpan (+ metadata column expansion)
- GetDatabaseDefinition   → trino/deparse.GetDatabaseDefinition
- Completion              → trino/completion.Complete

Deleted the ANTLR-only query_span_listener.go / query_span_predicate.go and the
~1700-line CodeCompletionCore completer. omni bumped to a main version that
includes all trino packages.

Documented divergences from the legacy parser:
- Completion: the omni completer is intentionally not a c3 port — it emits
  {Text,Type} candidates only (no Definition/Priority), and column candidates
  are the in-scope FROM columns resolved via query-span analysis.
- Query span: built from omni analysis table access, expanded to column-level
  SourceColumns via bytebase metadata; per-result + predicate columns preserved.

Feature differential green: go test ./backend/plugin/{parser,schema}/trino/...
Whole backend builds; doris (another omni engine) unaffected by the omni bump.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@h3n4l h3n4l requested a review from a team as a code owner June 5, 2026 02:59
@cla-bot cla-bot Bot added the cla-signed label Jun 5, 2026

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5abec33f40

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread backend/plugin/parser/trino/query_span_extractor.go Outdated
@h3n4l h3n4l requested a review from vsai12 June 5, 2026 06:11
h3n4l and others added 3 commits June 5, 2026 14:18
CI:
- golangci-lint: drop unused method receivers + an unused test param (0 issues).
- SonarCloud duplication: use the shared base.NewByteOffsetPositionMapper for
  byte->position conversion in split.go/diagnose.go/trino.go and extract the
  duplicated column-match loop into addMatchingColumns; net -91 lines and these
  files no longer duplicate the doris adapter's local helpers.

Correctness (Claude + Codex cross-family review):
- EXPLAIN span unwrap returned empty SourceColumns: ast.NodeLoc only handles
  ast-package nodes, not parser statement nodes, so the inner query never
  unwrapped. Use nodeSpan (QueryStmt.Loc / Span()).
- Security: EXPLAIN ANALYZE executes its inner statement but getQueryType
  reported it read-only unconditionally, so EXPLAIN ANALYZE <DML> passed the
  read-only SQL-editor gate. getQueryType now recurses into the inner statement
  (oracle-confirmed: Trino 481 runs it), so EXPLAIN ANALYZE UPDATE is rejected.

Tracked as follow-up tasks: aliased per-column masking; SELECT * result-column
order (positional masking); WITH FUNCTION read-only classification; quoted-
identifier case sensitivity in column matching.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
parseTrinoStatements and parseTrinoSQL shared a near-identical per-segment
parse loop (split -> parse each -> wrap omniAST -> convert first error),
which SonarCloud flagged as duplicated against the doris adapter's
equivalent two functions (7.4% duplication on new code, gate <=3%).

Extract the common logic into parseSegment(stmt) (*omniAST, error), so the
two callers become thin loops with distinct bodies. dupl now reports 0
clone groups at threshold 50 and 70. Behavior is preserved (verified by
cross-agent review): empty segments yield a nil-AST placeholder in
parseTrinoStatements and are skipped in parseTrinoSQL; the first parse
error returns immediately via convertParseError.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
go.mod pins github.com/bytebase/omni at 20260604221654-107b8b4db5bf, but
go.sum still carried the leftover hash lines for the prior 20260604130724
revision (not referenced by any module), so `go mod tidy && git diff
--exit-code` failed in the go-tests CI job. Remove the two orphaned lines.
go mod tidy produces no other changes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9c12d7b239

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread backend/plugin/parser/trino/query_span_extractor.go Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f68d44d8fa

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread backend/plugin/parser/trino/query_span_extractor.go Outdated
Addresses three masking-correctness issues raised in review of the query
span extractor, each in the data-leak (under-mask) direction. The masker
applies per-result maskers positionally against the executed result's
column order, so lineage order/coverage is security-relevant.

1. SELECT * column order: the star was expanded by ranging a Go map, so
   result columns came out in nondeterministic order and the positional
   masker could mask the wrong column. Expansion now uses an ordered slice
   built in FROM-table then metadata-column order. The ordered slice is NOT
   deduplicated, so a self-join (FROM users u1 JOIN users u2) emits each
   instance's columns and stays aligned with Trino's 2N output columns.

2. System / information_schema queries: column expansion resolved Trino's
   system pseudo-catalogs as Bytebase databases, producing a
   ResourceNotFoundError that sql_service turns into a hard "failed to mask
   data" rejection of valid SQL-info queries. System tables are now detected
   per resolved table (catalog=="system" or schema=="information_schema"),
   not by a query-text substring scan, so a "system." string literal cannot
   suppress masking of a real table in the same statement.

3. Aliased columns: an alias-qualified select item (u.email over "users u")
   was compared against physical table names and dropped, leaving empty
   lineage so the masker treated a sensitive column as a constant. Alias
   resolution is now additive (alias -> all physical tables it maps to;
   match accepts the written name or any mapped table), which resolves
   reused/shadowed aliases across subquery scopes by over-including
   (masks more) instead of under-matching.

Both fixes were cross-agent reviewed (Claude + Codex); Codex caught the
self-join, substring-bypass, and alias-scope edge cases above, which this
revision resolves. One documented best-effort residual remains: a mixed
SELECT * over a system-table-joined-with-real-table can misalign positions
because system tables have no enumerable column metadata (tracked as a
follow-up; such queries are already classified SelectInfoSchema).

Adds direct extractor tests for column order, self-join column count,
aliased and shadowed-alias lineage, the system-substring non-bypass, and
the no-NotFoundError behaviour for system queries.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 549f651ec2

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread backend/plugin/parser/trino/query_span_extractor.go

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d51f297fb2

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread backend/plugin/parser/trino/query_span_extractor.go Outdated
h3n4l and others added 2 commits June 5, 2026 17:37
A qualified star select item such as `SELECT u.* FROM users u` is named
"u.*" by omni, so it was not matched by the unqualified "*" expansion and
fell through to the column-name path, producing empty SourceColumns. The
masker then treated those (possibly sensitive) columns as constants and
returned them unmasked.

buildResults now detects a genuine qualified star by its source-ref shape
(Name ends ".*" and the single source ref's Column equals the qualifier),
not just the display name — so a column merely aliased to a quoted name
ending in ".*" (e.g. SELECT email AS "u.*") is NOT mistaken for a star.
qualifiedStarTargets resolves the star to the exact (database, schema,
table) relations it names — via the relation alias when present, otherwise
the written catalog/schema/table path, restricted to base tables actually
in AccessTables — so it expands only that relation's columns, in order, and
distinguishes same-named tables in different schemas/catalogs.

Documented best-effort residual: a star over a CTE or derived relation
(SELECT * FROM cte / c.*) cannot be expanded correctly because omni does
not expose the resolved output projection (the prior ANTLR plugin had the
same gap); tracked as a follow-up.

Cross-agent reviewed (Claude + Codex) across several iterations; Codex
caught the alias-ending-in-".*" false positive, the dotted-star rejection,
and the same-named-table mis-target, all fixed here. Adds tests for the
alias / bare-table / schema-qualified star forms, same-named-table
disambiguation, and the aliased-not-a-star case.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 77f2f81c6c

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread backend/plugin/parser/trino/completion.go
buildCompletionCatalog fetched full metadata for every database returned by
ListDatabaseNames on each completion request. The LSP invokes completion on
every keystroke, and Trino instances routinely federate many catalogs, so a
single keystroke could fan out into hundreds of metadata loads and stall
completion — the legacy completer loaded catalog metadata on demand.

Register every catalog NAME cheaply (so catalog-level completion still lists
them all), but load full schema/table/column metadata only for catalogs the
current statement needs: the session default catalog, and any catalog named
in the statement text (a qualified catalog.schema.table reference). A
catalog name containing a double quote — the only identifier character Trino
re-spells when quoting — is loaded unconditionally so it is never missed.

Cross-agent reviewed (Claude + Codex). Adds a test asserting unreferenced
catalogs are not loaded and a referenced one is; existing completion
behaviour (catalog/schema/table/column/CTE contexts) is unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c8655c536f

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread backend/plugin/parser/trino/completion.go Outdated
buildCompletionCatalog created its own context.Background() instead of using
the Completion caller's context. The LSP metadata callbacks
(ListDatabaseNamesFunc / GetDatabaseMetadataFunc) read the workspace ID from
the context via common.GetWorkspaceIDFromContext, so dropping it made object
lookups resolve no databases and left LSP completion with keyword-only
candidates. Thread the caller's ctx through Completion ->
buildCompletionCatalog -> the metadata callbacks.

This is the fix recommended by the PR review bot; existing completion tests
still pass (the mock getter ignores ctx).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@sonarqubecloud

sonarqubecloud Bot commented Jun 5, 2026

Copy link
Copy Markdown

@d-bytebase d-bytebase merged commit 2408050 into main Jun 5, 2026
16 checks passed
@d-bytebase d-bytebase deleted the feat/trino-omni-integrate branch June 5, 2026 10:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants