Skip to content

Conversation

@chelsea-lin
Copy link
Contributor

Fixes internal issue 481740136 🦕

@product-auto-label product-auto-label bot added size: m Pull request size is medium. api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. labels Feb 4, 2026
@chelsea-lin chelsea-lin force-pushed the main_chelsealin_selectstart branch from e085050 to f5a2276 Compare February 4, 2026 20:48
@chelsea-lin chelsea-lin marked this pull request as ready for review February 4, 2026 20:50
@chelsea-lin chelsea-lin requested review from a team as code owners February 4, 2026 20:50
Comment on lines +59 to +62
nodes.ScanItem(
identifiers.ColumnId(scan_item.source_id), scan_item.source_id
)
for scan_item in node.scan_list.items
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what we really want is to order by the underlying physical schema?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When generating SQL, we should respect the original order of the selected columns and avoid reordering them to match the physical schema. While reordering can reveal more SELECT * optimizations when the query involves intermediate CTEs or subqueries, the current logic sorts the scan_list by the algebraic ordering of column names. This approach hides potential SELECT * optimizations, particularly in use cases like bpd.read_table("table_name")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think this needs to be by physical schema id ideally, but I think it mostly doesn't matter when combined with other rewriters.

@property
def is_star_selection(self) -> bool:
physical_names = tuple(item.name for item in self.source.table.physical_schema)
scan_names = tuple(item.source_id for item in self.scan_list.items)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the scan list item id must equal source_id as well for this to work as intended

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good points. It should be addressed in the new commit. Thanks.

@chelsea-lin chelsea-lin force-pushed the main_chelsealin_selectstart branch from f5a2276 to 15dad78 Compare February 5, 2026 22:52
@chelsea-lin chelsea-lin changed the title refactor: Enable SELECT * optimization when compiling read-table nodes into sqlglot refactor: Enable SELECT * optimization in the sqlglot compiler Feb 11, 2026
@chelsea-lin chelsea-lin force-pushed the main_chelsealin_selectstart branch from 15dad78 to dd8825c Compare February 11, 2026 19:46
@chelsea-lin chelsea-lin force-pushed the main_chelsealin_selectstart branch from dd8825c to a62598e Compare February 11, 2026 19:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. size: m Pull request size is medium.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants