feat(trino): migrate Trino parser plugin from ANTLR to omni#20517
Conversation
Cut over backend/plugin/parser/trino + backend/plugin/schema/trino from the
legacy ANTLR parser (github.com/bytebase/parser/trino) to the hand-written omni
parser (github.com/bytebase/omni/trino/*), following the doris precedent. The
legacy bytebase/parser/trino import is fully removed.
The 6 registered P0 handlers are now omni-backed:
- SplitSQL → trino/parser.Split
- parseTrinoStatements → trino/parser.Parse
- Diagnose → trino/parser.Diagnose
- validateQuery/queryType → omni AST inspection + analysis.Classify
- GetQuerySpan → trino/analysis.GetQuerySpan (+ metadata column expansion)
- GetDatabaseDefinition → trino/deparse.GetDatabaseDefinition
- Completion → trino/completion.Complete
Deleted the ANTLR-only query_span_listener.go / query_span_predicate.go and the
~1700-line CodeCompletionCore completer. omni bumped to a main version that
includes all trino packages.
Documented divergences from the legacy parser:
- Completion: the omni completer is intentionally not a c3 port — it emits
{Text,Type} candidates only (no Definition/Priority), and column candidates
are the in-scope FROM columns resolved via query-span analysis.
- Query span: built from omni analysis table access, expanded to column-level
SourceColumns via bytebase metadata; per-result + predicate columns preserved.
Feature differential green: go test ./backend/plugin/{parser,schema}/trino/...
Whole backend builds; doris (another omni engine) unaffected by the omni bump.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5abec33f40
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
CI: - golangci-lint: drop unused method receivers + an unused test param (0 issues). - SonarCloud duplication: use the shared base.NewByteOffsetPositionMapper for byte->position conversion in split.go/diagnose.go/trino.go and extract the duplicated column-match loop into addMatchingColumns; net -91 lines and these files no longer duplicate the doris adapter's local helpers. Correctness (Claude + Codex cross-family review): - EXPLAIN span unwrap returned empty SourceColumns: ast.NodeLoc only handles ast-package nodes, not parser statement nodes, so the inner query never unwrapped. Use nodeSpan (QueryStmt.Loc / Span()). - Security: EXPLAIN ANALYZE executes its inner statement but getQueryType reported it read-only unconditionally, so EXPLAIN ANALYZE <DML> passed the read-only SQL-editor gate. getQueryType now recurses into the inner statement (oracle-confirmed: Trino 481 runs it), so EXPLAIN ANALYZE UPDATE is rejected. Tracked as follow-up tasks: aliased per-column masking; SELECT * result-column order (positional masking); WITH FUNCTION read-only classification; quoted- identifier case sensitivity in column matching. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
parseTrinoStatements and parseTrinoSQL shared a near-identical per-segment parse loop (split -> parse each -> wrap omniAST -> convert first error), which SonarCloud flagged as duplicated against the doris adapter's equivalent two functions (7.4% duplication on new code, gate <=3%). Extract the common logic into parseSegment(stmt) (*omniAST, error), so the two callers become thin loops with distinct bodies. dupl now reports 0 clone groups at threshold 50 and 70. Behavior is preserved (verified by cross-agent review): empty segments yield a nil-AST placeholder in parseTrinoStatements and are skipped in parseTrinoSQL; the first parse error returns immediately via convertParseError. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
go.mod pins github.com/bytebase/omni at 20260604221654-107b8b4db5bf, but go.sum still carried the leftover hash lines for the prior 20260604130724 revision (not referenced by any module), so `go mod tidy && git diff --exit-code` failed in the go-tests CI job. Remove the two orphaned lines. go mod tidy produces no other changes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9c12d7b239
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
…rate # Conflicts: # go.mod # go.sum
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f68d44d8fa
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Addresses three masking-correctness issues raised in review of the query span extractor, each in the data-leak (under-mask) direction. The masker applies per-result maskers positionally against the executed result's column order, so lineage order/coverage is security-relevant. 1. SELECT * column order: the star was expanded by ranging a Go map, so result columns came out in nondeterministic order and the positional masker could mask the wrong column. Expansion now uses an ordered slice built in FROM-table then metadata-column order. The ordered slice is NOT deduplicated, so a self-join (FROM users u1 JOIN users u2) emits each instance's columns and stays aligned with Trino's 2N output columns. 2. System / information_schema queries: column expansion resolved Trino's system pseudo-catalogs as Bytebase databases, producing a ResourceNotFoundError that sql_service turns into a hard "failed to mask data" rejection of valid SQL-info queries. System tables are now detected per resolved table (catalog=="system" or schema=="information_schema"), not by a query-text substring scan, so a "system." string literal cannot suppress masking of a real table in the same statement. 3. Aliased columns: an alias-qualified select item (u.email over "users u") was compared against physical table names and dropped, leaving empty lineage so the masker treated a sensitive column as a constant. Alias resolution is now additive (alias -> all physical tables it maps to; match accepts the written name or any mapped table), which resolves reused/shadowed aliases across subquery scopes by over-including (masks more) instead of under-matching. Both fixes were cross-agent reviewed (Claude + Codex); Codex caught the self-join, substring-bypass, and alias-scope edge cases above, which this revision resolves. One documented best-effort residual remains: a mixed SELECT * over a system-table-joined-with-real-table can misalign positions because system tables have no enumerable column metadata (tracked as a follow-up; such queries are already classified SelectInfoSchema). Adds direct extractor tests for column order, self-join column count, aliased and shadowed-alias lineage, the system-substring non-bypass, and the no-NotFoundError behaviour for system queries. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 549f651ec2
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d51f297fb2
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
A qualified star select item such as `SELECT u.* FROM users u` is named "u.*" by omni, so it was not matched by the unqualified "*" expansion and fell through to the column-name path, producing empty SourceColumns. The masker then treated those (possibly sensitive) columns as constants and returned them unmasked. buildResults now detects a genuine qualified star by its source-ref shape (Name ends ".*" and the single source ref's Column equals the qualifier), not just the display name — so a column merely aliased to a quoted name ending in ".*" (e.g. SELECT email AS "u.*") is NOT mistaken for a star. qualifiedStarTargets resolves the star to the exact (database, schema, table) relations it names — via the relation alias when present, otherwise the written catalog/schema/table path, restricted to base tables actually in AccessTables — so it expands only that relation's columns, in order, and distinguishes same-named tables in different schemas/catalogs. Documented best-effort residual: a star over a CTE or derived relation (SELECT * FROM cte / c.*) cannot be expanded correctly because omni does not expose the resolved output projection (the prior ANTLR plugin had the same gap); tracked as a follow-up. Cross-agent reviewed (Claude + Codex) across several iterations; Codex caught the alias-ending-in-".*" false positive, the dotted-star rejection, and the same-named-table mis-target, all fixed here. Adds tests for the alias / bare-table / schema-qualified star forms, same-named-table disambiguation, and the aliased-not-a-star case. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 77f2f81c6c
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
buildCompletionCatalog fetched full metadata for every database returned by ListDatabaseNames on each completion request. The LSP invokes completion on every keystroke, and Trino instances routinely federate many catalogs, so a single keystroke could fan out into hundreds of metadata loads and stall completion — the legacy completer loaded catalog metadata on demand. Register every catalog NAME cheaply (so catalog-level completion still lists them all), but load full schema/table/column metadata only for catalogs the current statement needs: the session default catalog, and any catalog named in the statement text (a qualified catalog.schema.table reference). A catalog name containing a double quote — the only identifier character Trino re-spells when quoting — is loaded unconditionally so it is never missed. Cross-agent reviewed (Claude + Codex). Adds a test asserting unreferenced catalogs are not loaded and a referenced one is; existing completion behaviour (catalog/schema/table/column/CTE contexts) is unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c8655c536f
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
buildCompletionCatalog created its own context.Background() instead of using the Completion caller's context. The LSP metadata callbacks (ListDatabaseNamesFunc / GetDatabaseMetadataFunc) read the workspace ID from the context via common.GetWorkspaceIDFromContext, so dropping it made object lookups resolve no databases and left LSP completion with keyword-only candidates. Thread the caller's ctx through Completion -> buildCompletionCatalog -> the metadata callbacks. This is the fix recommended by the PR review bot; existing completion tests still pass (the mock getter ignores ctx). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|



Cuts over bytebase's Trino parser from the legacy ANTLR parser (
github.com/bytebase/parser/trino) to the hand-written omni parser (github.com/bytebase/omni/trino/*) — the final step of the Trino omni migration (17 omni nodes already merged to bytebase/omni). Follows the already-shipped doris precedent.What changed
backend/plugin/parser/trino+backend/plugin/schema/trinorewired; the legacybytebase/parser/trinoimport is gone; the ANTLR-onlyquery_span_listener.go/query_span_predicate.goand the ~1700-line CodeCompletionCore completer are deleted. omni bumped to a main version with all trino packages.SplitSQLtrino/parser.SplitparseTrinoStatementstrino/parser.ParseDiagnosetrino/parser.DiagnosevalidateQuery/ query-typeanalysis.ClassifyGetQuerySpan(masking/lineage)trino/analysis.GetQuerySpan+ metadata column expansionGetDatabaseDefinitiontrino/deparse.GetDatabaseDefinitionCompletiontrino/completion.CompleteVerification
go build ./backend/...✅ ·go vet✅ ·gofmtclean ✅go test ./backend/plugin/{parser,schema}/trino/...✅ (the feature differential — existing trino tests pass)Documented divergences from the legacy parser (please review)
{Text, Type}candidates only (noDefinition/Priority), and column candidates are the in-scope FROM columns (query-span-resolved) rather than the legacy "every column in the default schema, ranked". Tests updated to the omni candidate model.analysis.GetQuerySpanreturns table access + result columns; the extractor expands tables to column-levelSourceColumnsvia bytebase metadata and rebuilds per-result + predicate columns. A couple ofquery_spantest expectations were adjusted to omni's (oracle-validated) behavior — review the diffs.Reviewer focus: the query-span
SourceColumnsmapping (masking input) and the completion candidate-model change.🤖 Generated with Claude Code