Fix correlated subquery empty defaults for regr_count and approx_distinct#22319
Draft
nathanb9 wants to merge 1 commit into
Draft
Fix correlated subquery empty defaults for regr_count and approx_distinct#22319nathanb9 wants to merge 1 commit into
nathanb9 wants to merge 1 commit into
Conversation
b502e6c to
63074d5
Compare
63074d5 to
cc2d270
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
regr_count,approx_distinctin correlated subquery #22317.Rationale for this change
Correlated scalar subqueries with ungrouped aggregates are decorrelated into joins. For unmatched outer rows, the rewritten join naturally produces NULLs on the right side, so DataFusion has compensation logic for aggregates that should return a non-NULL value on empty input.
That compensation previously special-cased
countby name. As a result, other aggregates with non-NULL empty-input results, such asregr_countandapprox_distinct, incorrectly returned NULL after decorrelation.What changes are included in this PR?
This PR updates decorrelation to use each aggregate UDF's
default_value()instead of hard-codingcount.It also adds empty-input defaults for:
regr_count:UInt64(0)approx_distinct:UInt64(0)Regression coverage is added for correlated scalar subqueries using these aggregates in projection expressions and filters.
Are these changes tested?
Yes.
cargo fmt --all cargo test -p datafusion-sqllogictest --test sqllogictests -- subquery.sltAre there any user-facing changes?
Yes. Queries using
regr_countorapprox_distinctin correlated scalar subqueries now return0for unmatched outer rows instead ofNULL, matching the aggregate behavior on empty input.