Skip to content

Conversation

@cakedev0
Copy link
Contributor

@cakedev0 cakedev0 commented Oct 27, 2025

Reference Issues/PRs

Fixes: #32585

What does this implement/fix? Explain your changes.

Fix the logic for the sub-sampling (only affect the case ignore_implicit_zeros=True)

Adds a non-regression test.

Any other comments?

While reviewing the PR, you might notice "behavior discontinuity" that this PR fixes also exists for the case ignore_implicit_zeros=False. But for this case, it doesn't incur unexpected/incorrect results, just unexpectedly long/short runtime. See issue [TODO] about that.

I haven't written a change-log as nobody reported the bug except me, and I found it just by reading the code... Let me know if this is needed.

@github-actions
Copy link

github-actions bot commented Oct 27, 2025

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 326d0bf. Link to the linter CI: here

@cakedev0 cakedev0 marked this pull request as ready for review October 27, 2025 19:34
@cakedev0 cakedev0 changed the title [WIP] FIX: Fix QuantileTransformer(ignore_implicit_zeros=True) sub-sampling behavior FIX: Fix QuantileTransformer(ignore_implicit_zeros=True) sub-sampling behavior Oct 27, 2025
@betatim
Copy link
Member

betatim commented Oct 29, 2025

I'd add a change log entry, even if no one has reported a bug people might notice a change in behaviour. The entry is the way we want people to answer that kind of question for themselves without posting here.

Fix looks good to me. Not thought about the test yet.

mdshoaibuddinchanda pushed a commit to mdshoaibuddinchanda/scikit-learn that referenced this pull request Nov 2, 2025
…bug (scikit-learn#32587)

Add deterministic regression test that constructs a very sparse CSC matrix with two nearly-identical columns (one with subsample - 1 nonzeros, the other with subsample + 1 nonzeros) and verifies that the transformed columns are not degenerate after fit/transform.

For large declared n_samples, the bug could create a zero-sized column sample leading to collapsed quantiles (all zeros) for some columns.

References: scikit-learn#32587
mdshoaibuddinchanda pushed a commit to mdshoaibuddinchanda/scikit-learn that referenced this pull request Nov 2, 2025
…eros=True

When ignore_implicit_zeros=True, use subsample directly instead of scaling by n_samples to avoid zero-sized column samples. Also add max(1) guard for the else case.

Fixes scikit-learn#32587
mdshoaibuddinchanda pushed a commit to mdshoaibuddinchanda/scikit-learn that referenced this pull request Nov 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BUG: QuantileTransformer(ignore_implicit_zeros=True) sub-sampling behavior is incorrect

2 participants