Remove unnecessary dtype conversion from pairwise_distances_argmin_* #32511

IgnacioJPickering · 2025-10-15T15:40:30Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

In pairwise_distances_argmin_* an initial check with check_pairwise_arrays(...) used
the default argument dtype="infer_float". For boolean metrics this triggered an unnecessary
type conversion to float64, even when the arrays were originally bool. When the arrays were forwarded to pairwise_distances, another
call to check_pairwise_arrays(...) cast the arrays back to bool, and this triggered a warning that there was data conversion.

I've added a new utility function _find_floating_or_bool_dtype_allow_sparse(X, Y, metric, xp) which works in an equivalent way to _find_floating_dtype_allow_sparse but is metric-aware, and returns bool for boolean metrics.

Additionally, I've factored out the warnings into another helper function to reduce duplication.

@pushkar-hue

github-actions · 2025-10-15T15:41:15Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: e1afdbd. Link to the linter CI: here}

pushkar-hue · 2025-10-15T16:05:00Z

Hi @IgnacioJPickering,

Thanks for your offer to add my tests. I'd appreciate it! This is the test that I wrote It confirms that pairwise_distances_argmin no longer raises a warning for boolean inputs. I also extended it to add pairwise_distances_argmin_min


def test_pairwise_argmin_no_warning_for_bool():
    """
    Check that no DataConversionWarning is raised for boolean metric
    when the data is already boolean.
    Regression test for #32495.
    """
    # Create boolean input arrays.
    X = np.ones((5, 5), dtype=np.bool_)
    Y = np.ones((5, 5), dtype=np.bool_)
    # Call the function within a warning-catching context.
    with warnings.catch_warnings(record=True) as w:
        warnings.simplefilter("always", DataConversionWarning)
        pairwise_distances_argmin(X, Y, metric="jaccard")
        # Check that the list of caught warnings is empty.
        assert len(w) == 0, "A DataConversionWarning was incorrectly raised."
        
        
def test_pairwise_argmin_min_no_warning_for_bool():
    """
    Check that no DataConversionWarning is raised for boolean metric
    when the data is already boolean.
    Regression test for #32495.
    """
    # Create boolean input arrays.
    X = np.ones((5, 5), dtype=np.bool_)
    Y = np.ones((5, 5), dtype=np.bool_)
    # Call the function within a warning-catching context.
    with warnings.catch_warnings(record=True) as w:
        warnings.simplefilter("always", DataConversionWarning)
        pairwise_distances_argmin_min(X, Y, metric="jaccard")
        # Check that the list of caught warnings is empty.
        assert len(w) == 0, "A DataConversionWarning was incorrectly raised."

```

…sions

IgnacioJPickering · 2025-10-15T17:21:06Z

@pushkar-hue I added you as a collaborator to my fork, I believe that may be the easiest way for you to push the tests there.

…for incorrect dtype conversion

pushkar-hue · 2025-10-15T18:36:07Z

@IgnacioJPickering I have just pushed the commit for tests you can take a look let me know if there's anything I need to change. Thanks again for the collaboration!

ogrisel

Thanks for the fix @IgnacioJPickering and @pushkar-hue. Could you please add a changelog entry for this?

See instructions in https://github.com/scikit-learn/scikit-learn/blob/main/doc/whats_new/upcoming_changes/README.md for details.

pushkar-hue · 2025-10-17T14:11:32Z

Hi @ogrisel, The changelog has been added. The PR should be ready for final review now. Thanks!

ogrisel · 2025-10-17T14:59:09Z

@pushkar-hue can you have a look at the codecov report:

https://app.codecov.io/gh/scikit-learn/scikit-learn/pull/32511?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=checks&utm_campaign=pr+comments&utm_term=scikit-learn

and see if there is an easy way to cover those lines by a small extension to the tests (e.g. testing with Python lists)?

pushkar-hue · 2025-10-17T15:24:29Z

I really apologize for this confusion. @IgnacioJPickering already did that due to this very reason but when i added changelog and tried to clean up my commit history somehow the test commit was removed. I re added those tests this should cover the codecov warning. I again apologize for my mistake.

…sions

IgnacioJPickering · 2025-10-18T00:04:32Z

@ogrisel no problem, I've reinstated the correct tests.

…sions

doc/whats_new/upcoming_changes/sklearn.metrics/32511.fix.rst

sklearn/metrics/tests/test_pairwise.py

sklearn/metrics/pairwise.py

lucyleeow · 2025-10-28T03:38:00Z

sklearn/metrics/pairwise.py

+            Y is not None and not xp.isdtype(Y.dtype, "bool")
+        ):
+            msg = f"Data was converted to boolean for metric {metric}"
+            warnings.warn(msg, DataConversionWarning)


Why not let pairwise_distances give this warning? The concern here is also that we issue this warning before we actually do the conversion.

You're right, I looked into it and it looks like @IgnacioJPickering just moved the the existing logic inside the helper function.

The old code in pairwise_distances had a manual block that issued this exact warning:

dtype = bool if metric in PAIRWISE_BOOLEAN_FUNCTIONS else "infer_float" if dtype is bool and (X.dtype != bool or (Y is not None and Y.dtype != bool)): msg = "Data was converted to boolean for metric %s" % metric warnings.warn(msg, DataConversionWarning) X, Y = check_pairwise_arrays( X, Y, dtype=dtype, ensure_all_finite=ensure_all_finite ) # precompute data-derived metric params params = _precompute_metric_params(X, Y, metric=metric, **kwds) kwds.update(**params) if effective_n_jobs(n_jobs) == 1 and X is Y: return distance.squareform(distance.pdist(X, metric=metric, **kwds)) func = partial(distance.cdist, metric=metric, **kwds)

it is now moved into the helper functions and used as such:

X, Y, dtype = _find_dtype_for_check_pairwise_arrays(X, Y, metric) X, Y = check_pairwise_arrays( X, Y, dtype=dtype, ensure_all_finite=ensure_all_finite ) # precompute data-derived metric params params = _precompute_metric_params(X, Y, metric=metric, **kwds) kwds.update(**params) if effective_n_jobs(n_jobs) == 1 and X is Y: return distance.squareform(distance.pdist(X, metric=metric, **kwds)) func = partial(distance.cdist, metric=metric, **kwds) return _parallel_pairwise(X, Y, func, n_jobs, **kwds)

we are actually issuing the waring after the conversion as it was before the refactor.

@lucyleeow I understand the concern, its not great that the warning is for what a different function does (I believe what @pushkar-hue means is that this is the way it was done before in the code too, but I see its not optimal)

The reason we can't delegate the warning to pairwise_distances is because that function may not get called if ArgKMin is usable for the metric.

Do you think it would be better to write a wrapper _check_pairwise_arrays_for_metric(...)? This would find the dtype, raise the warning and also call check_pairwise_arrays which is what ultimately does the conversion.

@lucyleeow I went ahead and did this, since I thought it was cleaner. and also clarified the comment and made it a bit more precise. Hopefully things are much more clear now.

The reason we can't delegate the warning to pairwise_distances is because that function may not get called if ArgKMin is usable for the metric.

Aren't bool metrics specifically excluded from ArgKMin?

scikit-learn/sklearn/metrics/_pairwise_distances_reduction/_dispatcher.py

Lines 66 to 78 in ce27c87

def valid_metrics(cls) -> List[str]:

excluded = {

# PyFunc cannot be supported because it necessitates interacting with

# the CPython interpreter to call user defined functions.

"pyfunc",

"mahalanobis", # is numerically unstable

# In order to support discrete distance metrics, we need to have a

# stable simultaneous sort which preserves the order of the indices

# because there generally is a lot of occurrences for a given values

# of distances in this case.

# TODO: implement a stable simultaneous_sort.

"hamming",

*BOOL_METRICS,

The issue is that check_pairwise_arrays has to be called before the ArgKmin.is_usable_for(...)

Why is this a problem?

I don't understand https://github.com/scikit-learn/scikit-learn/pull/32511/files#r2485450646: check_pairwise_arrays would no longer be called at all for PAIRWISE_BOOLEAN metrics: so error messages about inconsistent shapes would not be raised when passing invalid inputs for boolean metrics.

@ogrisel check_pairwise_arrays would still be called, since for boolean metrics the function would delegate to pairwise_distances_chunked, which itself delegates to pairwise_distances, which calls check_pairwise_arrays. I recognize it is a bit confusing though, but I think the logic checks out.

@lucyleeow It is not a problem, but we must first filter for boolean arrays. Calling check_pairwise_arrays in these functions without first filtering for boolean metrics is the bug that is currently in main.

By default, if check_pairwise_arrays is called with infer_float it converts the arrays to float unconditionally, with no warning, which is not necessary since the arrays are bool to begin with.

Afterwards a second check in pairwise_distances checks the dtypes of the arrays, converts the arrays back to bool, and raises a conversion warning. This means there are 2 casts and 1 warning where there should have been none.

If I filter for PAIRWISE_BOOLEAN first then I can call check_pairwise_arrays only in the case that ArgKmin may be called, and delegate the rest of the checks to pairwise_distances_chunked.

@lucyleeow @ogrisel
From the comments in the PR I believe this, together with a comment specifying why check_pairwise_arrays is being called early in the case that ArgKmin may be called, is preferable, since it seems to me the code is still hard to understand.

Ok, I've modified the code so that the checks are delegated to pairwise_distances_chunked I believe this should get rid of the confusion.

The TypeError is raised only in the case where sparse arrays are forwarded to pairwise_distances, which is a single place in the code, the same place where the warning is raised, and the cast is performed.

This only required an if check for the check_pairwise_arrays call, which is only performed if the metric is not a PAIRWISE_BOOLEAN, so we avoid casting it to bool like what is currently in main. I added a comment for extra clarity.

In the end I think my initial fix was overly complicated, this gets rid of the issue and has minimal modifications.

sklearn/metrics/tests/test_pairwise.py

sklearn/metrics/pairwise.py

Co-authored-by: Lucy Liu <jliu176@gmail.com>

…sions

…ithub.com:IgnacioJPickering/scikit-learn into fix/ipickering/remove-incorrect-dtype-conversions

lucyleeow

One comment but otherwise looks good.

@ogrisel I think this may be worth a second review from you as the code has changed quite a bit.

sklearn/metrics/pairwise.py

…sions

…ithub.com:IgnacioJPickering/scikit-learn into fix/ipickering/remove-incorrect-dtype-conversions

…sions

Remove unnecessary dtype conversion from pairwise_distances_argmin_*

5f70cc6

github-actions bot added the module:metrics label Oct 15, 2025

IgnacioJPickering mentioned this pull request Oct 15, 2025

Prevent warning for bool dtypes in pairwise_argmin #32510

Closed

Fix typo

79a82c6

Merge branch 'main' into fix/ipickering/remove-incorrect-dtype-conver…

9af8675

…sions

IgnacioJPickering and others added 4 commits October 15, 2025 13:32

Fix incorrect unpacking of return value

932ed5b

Lint

b0ce86d

test for pairwise_distances_argmin and pairwise_distances_argmin_min …

4a87908

…for incorrect dtype conversion

ruff formating

40514e3

IgnacioJPickering added 3 commits October 15, 2025 17:16

Make test more comprehensive for code coverage

36364ef

Fix test

eb745aa

Fix test

c068c82

ogrisel approved these changes Oct 17, 2025

View reviewed changes

changelog entry

21bbfbf

pushkar-hue force-pushed the fix/ipickering/remove-incorrect-dtype-conversions branch from 2ec3d1e to 21bbfbf Compare October 17, 2025 14:06

ogrisel added Quick Review For PRs that are quick to review Waiting for Second Reviewer First reviewer is done, need a second one! labels Oct 17, 2025

readded parameterized test

1c4f655

IgnacioJPickering added 2 commits October 17, 2025 19:25

Merge branch 'main' into fix/ipickering/remove-incorrect-dtype-conver…

0e40a94

…sions

Reinstate

78d1b2f

IgnacioJPickering added 2 commits October 17, 2025 20:10

Fix

5e24431

Merge branch 'main' into fix/ipickering/remove-incorrect-dtype-conver…

7ffaa79

…sions

lucyleeow added the Array API label Oct 27, 2025

Merge branch 'main' into fix/ipickering/remove-incorrect-dtype-conver…

5d1b8aa

…sions

lucyleeow removed the Quick Review For PRs that are quick to review label Oct 28, 2025

lucyleeow reviewed Oct 28, 2025

View reviewed changes

pushkar-hue and others added 10 commits October 28, 2025 09:27

Update doc/whats_new/upcoming_changes/sklearn.metrics/32511.fix.rst

ee4b85f

Co-authored-by: Lucy Liu <jliu176@gmail.com>

Update sklearn/metrics/tests/test_pairwise.py

9dc8368

Co-authored-by: Lucy Liu <jliu176@gmail.com>

used msg variable for warnings

9f830a1

Update sklearn/metrics/pairwise.py

f1f3b9c

Co-authored-by: Lucy Liu <jliu176@gmail.com>

Merge branch 'main' into fix/ipickering/remove-incorrect-dtype-conver…

268fafc

…sions

Expand comment

c583c1d

Merge branch 'main' into fix/ipickering/remove-incorrect-dtype-conver…

80441a8

…sions

Merge branch 'main' into fix/ipickering/remove-incorrect-dtype-conver…

d3fbdb9

…sions

Merge branch 'main' into fix/ipickering/remove-incorrect-dtype-conver…

2b2f1ce

…sions

Merge branch 'main' into fix/ipickering/remove-incorrect-dtype-conver…

0a26f34

…sions

ogrisel moved this to Todo in Array API Oct 30, 2025

ogrisel added this to Array API Oct 30, 2025

ogrisel moved this from Todo to In Progress in Array API Oct 30, 2025

pushkar-hue and others added 7 commits October 30, 2025 22:03

Merge branch 'main' into fix/ipickering/remove-incorrect-dtype-conver…

59cb3c1

…sions

Add name of metric

b3e32fd

Merge branch 'main' into fix/ipickering/remove-incorrect-dtype-conver…

8833dc1

…sions

Merge branch 'fix/ipickering/remove-incorrect-dtype-conversions' of g…

e32309a

…ithub.com:IgnacioJPickering/scikit-learn into fix/ipickering/remove-incorrect-dtype-conversions

Make comments more clear

c7ecbee

Revert change of data conversion

6166727

trigger ci

a51f9a3

lucyleeow reviewed Nov 3, 2025

View reviewed changes

sklearn/metrics/pairwise.py Outdated Show resolved Hide resolved

ogrisel reviewed Nov 3, 2025

View reviewed changes

sklearn/metrics/pairwise.py Outdated Show resolved Hide resolved

IgnacioJPickering added 4 commits November 3, 2025 13:54

Merge branch 'main' into fix/ipickering/remove-incorrect-dtype-conver…

cbd7b20

…sions

trigger ci

c1f56f4

Merge branch 'fix/ipickering/remove-incorrect-dtype-conversions' of g…

4a92662

…ithub.com:IgnacioJPickering/scikit-learn into fix/ipickering/remove-incorrect-dtype-conversions

Merge branch 'main' into fix/ipickering/remove-incorrect-dtype-conver…

e1afdbd

…sions

	def valid_metrics(cls) -> List[str]:
	excluded = {
	# PyFunc cannot be supported because it necessitates interacting with
	# the CPython interpreter to call user defined functions.
	"pyfunc",
	"mahalanobis", # is numerically unstable
	# In order to support discrete distance metrics, we need to have a
	# stable simultaneous sort which preserves the order of the indices
	# because there generally is a lot of occurrences for a given values
	# of distances in this case.
	# TODO: implement a stable simultaneous_sort.
	"hamming",
	*BOOL_METRICS,

Uh oh!

Remove unnecessary dtype conversion from pairwise_distances_argmin_* #32511

Are you sure you want to change the base?

Remove unnecessary dtype conversion from pairwise_distances_argmin_* #32511

Conversation

IgnacioJPickering commented Oct 15, 2025

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Uh oh!

github-actions bot commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

pushkar-hue commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

IgnacioJPickering commented Oct 15, 2025

Uh oh!

pushkar-hue commented Oct 15, 2025

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

pushkar-hue commented Oct 17, 2025

Uh oh!

ogrisel commented Oct 17, 2025

Uh oh!

pushkar-hue commented Oct 17, 2025

Uh oh!

IgnacioJPickering commented Oct 18, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lucyleeow Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

pushkar-hue Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

IgnacioJPickering Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

IgnacioJPickering Nov 2, 2025

Choose a reason for hiding this comment

Uh oh!

lucyleeow Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

lucyleeow Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

ogrisel Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

IgnacioJPickering Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

IgnacioJPickering Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

IgnacioJPickering Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lucyleeow left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

github-actions bot commented Oct 15, 2025 •

edited

Loading

pushkar-hue commented Oct 15, 2025 •

edited

Loading

IgnacioJPickering Oct 29, 2025 •

edited

Loading

IgnacioJPickering Nov 3, 2025 •

edited

Loading

IgnacioJPickering Nov 3, 2025 •

edited

Loading