ENH: add string bin support for multivariate histograms by JosephMehdiyev · Pull Request #31276 · numpy/numpy

JosephMehdiyev · 2026-04-19T16:59:39Z

PR summary

fixes #20215

Adds multi dimensional bin width histogram algorithm(s)
also generalizes existing bin width histogram algorithms to arrays (for dimensionality)
majority of them will throw not implemented error when D>1
Generalizes some helper functions to multi dimensions
Adds string support for multi dimensional histograms

Other comments

I am not content with "auto" choice but it should be practical enough.

AI Disclosure

~~I used Claude as sanity check on my fixes. All modifications etc are done solely by my decisions and manually typed or copy pasted documentation from the related PR above.~~
Claude was used extensively on some parts of the code, especially on _get_bin_edges. Other than that I do not remember to be honest.

This will also add additional string arguments for `bin` for `histogramdd`. Similarly, `histogram2d` will change.

JosephMehdiyev · 2026-04-19T20:28:47Z

Seems like I have focused on the old PR so much that I have missed so many things (other than stubs), please review after PR is ready for review

Also, this fixes some obvious bugs that was introduced from the last modification of `histogramdd`.

JosephMehdiyev · 2026-04-20T14:30:28Z

@jorenham (afaik you are the author of these stubs) is there a specific reason why overload stubs of histogram2d are more detailed (a couple of them even looks reduntant to me) and written differently than histogram and histogramdd? While I am here changing the file I might also rewrite it similar to histogram etc, unless these are intended

for example deleting this ~~wont change things~~ will succesfully run tests anyways (also true before PR changes)

@overload
def histogram2d(
    x: _ArrayLike1DNumber_co,
    y: _ArrayLike1DNumber_co,
    bins: _BinKind | Sequence[Sequence[int] | _BinKind],
    range: _ArrayLike2DFloat_co | None = None,
    density: bool | None = None,
    weights: _ArrayLike1DFloat_co | None = None,
) -> _Histogram2D[np.int_]: ...

since this below already handles that case

def histogram2d[ScalarT: _Number_co](
    x: _ArrayLike1DNumber_co,
    y: _ArrayLike1DNumber_co,
    bins: _BinKind | _ArrayLike1D[ScalarT] | Sequence[_ArrayLike1D[ScalarT] | _BinKind],
    range: _ArrayLike2DFloat_co | None = None,
    density: bool | None = None,
    weights: _ArrayLike1DFloat_co | None = None,
) -> _Histogram2D[ScalarT]: ...

I have no experience with .pyi files, I am mirroring my knowledge from .hpp and cpp templates FYI

jorenham · 2026-04-20T14:44:54Z

@jorenham (afaik you are the author of these stubs) is there a specific reason why overload stubs of histogram2d are more detailed (a couple of them even looks reduntant to me) and written differently than histogram and histogramdd? While I am here changing the file I might also rewrite it similar to histogram etc, unless these are intended

for example deleting this wont change things
@overload
def histogram2d(
    x: _ArrayLike1DNumber_co,
    y: _ArrayLike1DNumber_co,
    bins: _BinKind | Sequence[Sequence[int] | _BinKind],
    range: _ArrayLike2DFloat_co | None = None,
    density: bool | None = None,
    weights: _ArrayLike1DFloat_co | None = None,
) -> _Histogram2D[np.int_]: ...
since this below already handles that case
def histogram2d[ScalarT: _Number_co](
    x: _ArrayLike1DNumber_co,
    y: _ArrayLike1DNumber_co,
    bins: _BinKind | _ArrayLike1D[ScalarT] | Sequence[_ArrayLike1D[ScalarT] | _BinKind],
    range: _ArrayLike2DFloat_co | None = None,
    density: bool | None = None,
    weights: _ArrayLike1DFloat_co | None = None,
) -> _Histogram2D[ScalarT]: ...
I have no experience with .pyi files, I am mirroring my knowledge from .hpp and cpp templates FYI

The difference here is in the bins, where the first overload here handles int sequences, and the second ScalarT. The important bit is that ScalarT only accept np.number | np.bool, and would therefore reject int.

So these overloads are distinct and non-overlapping.

... or at least, that's what they should be. You added _BinKind to both, which makes them overlap in an incompatible way.

JosephMehdiyev · 2026-04-21T22:33:21Z

Not sure the failing tests are PR related.
I did not do anything about #31296, let me know if I should add some documentation about it
PR should be ready to review now. I looked around the stuff I did and did not find any issues (hopefully I am right)

JosephMehdiyev · 2026-04-30T21:39:25Z

hey, can someone give a feedback for this PR?

JosephMehdiyev · 2026-05-07T08:12:35Z

Other than minor fixes, is the ~~fix~~ PR good? Any big issues? Would like to finish this PR at this point

jorenham · 2026-05-07T08:43:44Z

I had Claude take a look at the stubs, and it actually found some real issues:

`complexfloating` overloads of `histogram2d` were NOT updated (bug)

_twodim_base_impl.pyi — the two overloads for ScalarT: np.complexfloating still have bins: int | Sequence[int]:

@overload
def histogram2d[ScalarT: np.complexfloating](
    x: _ArrayLike1D[ScalarT],
    y: _ArrayLike1D[ScalarT | _Float_co],
    bins: int | Sequence[int] = 10,  # ← missing _BinKind
    ...

Since np.complexfloating is a subtype of np.inexact, the type checker matches these overloads first for complex arrays. If you call np.histogram2d(complex_arr, complex_arr, bins="auto"), it won't resolve to the complexfloating overload and will silently fall back to a less-specific one with Any in the return type — losing the ScalarT binding. These should also have int | _BinKind | Sequence[int | _BinKind].

The same applies to the Sequence[complex] overload around _twodim_base_impl.pyi — it also doesn't include _BinKind.

`histogramdd` stubs don't precisely type per-dimension string bins

For histogramdd, the updated signature is bins: _BinKind | SupportsIndex | ArrayLike. This means bins=['auto', 'fd'] (list of per-dimension estimators) is technically only covered through ArrayLike, which is semantically incorrect (it's for numeric arrays, not a list of method name strings). A more precise type would be:

bins: _BinKind | SupportsIndex | ArrayLike | Sequence[_BinKind | SupportsIndex | ArrayLike] = 10

This is a relatively minor precision issue but worth noting, since a mixed list like ['auto', 5, np.array([0,1,2])] is a valid input to histogramdd and wouldn't be well-typed.

Ordering inconsistency across files (minor, marked resolved)

The resolved review thread noted int | _BinKind vs _BinKind | int. Within _twodim_base_impl.pyi, int | _BinKind is now consistent. However, _histograms_impl.pyi uses _BinKind | SupportsIndex | ... (BinKind first). This cross-file inconsistency was not addressed.

algorithms.

JosephMehdiyev · 2026-06-05T12:13:51Z

Changes are because of #20215 (comment)
There are still a lot to do but the ~~stone~~ 'fd' and scott algorithms should work.

bins.

JosephMehdiyev · 2026-06-08T18:18:26Z

stubs: The only things changed in stubs are that strings cannot be in array i.e "auto" is fine but not ["auto"] or ["auto", 2] and bins cannot be complex values. Complex values are unrelated to PR, but might as well fix it here.

I could only generalize fd and scott algorithms as other algorithms do not have literature about N-D case. We could somewhat generalize others to D dimension by changing some variables to respect D, but it would be purely heuristic
auto is tricky as it is not possibe to use the existing 1-D auto. fd or scott should be good enough for large dimensions because of the curse of dimensionality, but they may not work nicely on some cases in D=2 or D=3
I tried to read papers and implement other multivariate bin width algorithms, these are generally maximize some kind of likelihood function, but the problem is that it is computationally expensive and the end result is not practically useful in my testing

JosephMehdiyev · 2026-06-09T16:50:44Z

cc @jorenham see the above short comment about the stub changes, be free to review whenever, (if) you want

I will continue to update the documentation, tests and clean up some code too.

JosephMehdiyev added 2 commits April 19, 2026 16:56

ENH: use _get_bin_edges for histogramdd

ba315ac

This will also add additional string arguments for `bin` for `histogramdd`. Similarly, `histogram2d` will change.

DOC: add clarification about str values of bins

e88e640

jorenham reviewed Apr 19, 2026

View reviewed changes

Comment thread numpy/lib/_histograms_impl.py

Comment thread numpy/lib/_twodim_base_impl.py Outdated

JosephMehdiyev added 4 commits April 19, 2026 21:49

TYP: fixed and made types consistent between histograms.

adab35a

ENH: histogram2d logic is handled in histogramdd now.

a0083db

Also, this fixes some obvious bugs that was introduced from the last modification of `histogramdd`.

TST: added new tests for the new features histogramdd can handle.

20ea56c

STY: fix lint errors

76d5e09

JosephMehdiyev force-pushed the hist branch from f0e36ac to 76d5e09 Compare April 20, 2026 12:51

JosephMehdiyev force-pushed the hist branch from d88bf8e to 25bd4cd Compare April 20, 2026 16:49

TYP: redo types of histogram2d without major changes.

82c1419

JosephMehdiyev force-pushed the hist branch from 25bd4cd to 82c1419 Compare April 20, 2026 16:54

JosephMehdiyev added 2 commits April 20, 2026 17:57

TST: add type tests for new bins values.

46f9444

DOC: add release notes

5be6d24

JosephMehdiyev force-pushed the hist branch from 8891c9d to 5be6d24 Compare April 20, 2026 17:38

JosephMehdiyev mentioned this pull request Apr 21, 2026

DOC: histogram2d behaves ambigiously on specific cases on bins #31296

Closed

TST: Add more tests for the new string 'bins'.

c0078c2

JosephMehdiyev marked this pull request as ready for review April 21, 2026 22:30

JosephMehdiyev commented Apr 21, 2026

View reviewed changes

Comment thread numpy/lib/_histograms_impl.py Outdated

JosephMehdiyev requested a review from jorenham April 21, 2026 22:35

DOC: modify warning message

f678cba

jorenham reviewed Apr 30, 2026

View reviewed changes

Comment thread numpy/lib/_histograms_impl.py Outdated

Comment thread numpy/lib/_twodim_base_impl.pyi Outdated

ENH: make code more readable

633aa9c

JosephMehdiyev force-pushed the hist branch from 9e3d590 to 633aa9c Compare May 3, 2026 17:32

JosephMehdiyev added 5 commits June 5, 2026 10:51

ENH: Generalize Freedman-Diaconis for multi dimensions.

b1a10df

ENH: generalization of _get_bin_edges for multi dimensional string bin

dd126db

algorithms.

ENH: Naive implementation of _get_bin_edges that is sort of complete.

8ccdcec

TST: Update a test.

d141c76

ENH: Fix some error handling issues.

fd19970

JosephMehdiyev added 2 commits June 5, 2026 13:15

STY: Linter issues

f2b36c5

TYP: Change the stubs of the histograms for string support.

3028c09

jorenham self-requested a review June 5, 2026 12:34

JosephMehdiyev added 9 commits June 5, 2026 13:40

TYP/TST: Remove some tests after the proper implementation of the string

bc806e3

bins.

DOC: Change histogram2d bin documentation.

b338ba9

DOC: Revert a part of the documentation to the old one.

8697988

ENH: Remove reduntant line from fd after the change in _get_bin_edges.

7852f13

MNT: Replace M with N.

c7059d8

ENH: simplify range handling on _get_outer_edges

9131bf3

ENH: Simplify _get_bin_edges and fix an error handling

ef9c72b

ENH: Generalize all the bin width algorithms to arrays.

f82890e

STY: linter fix

b6d25f8

JosephMehdiyev changed the title ~~ENH: use _get_bin_edges() on histogramdd for consistency.~~ ENH: add string bin support for multivariate histograms Jun 8, 2026

TYP: Make sure string bins in arrays are rejected.

1936726

JosephMehdiyev added 4 commits June 8, 2026 19:30

TYP: bins cannot be complex numbers.

7dd6b01

TST/TYP: Change the bin test values to only possible strings.

ca4b3fa

ENH: Write a simple "auto" bin width algorithm for N-D.

6db9f91

STY: fix linter errors

eaaa7dd

JosephMehdiyev marked this pull request as ready for review June 9, 2026 13:21

JosephMehdiyev marked this pull request as draft June 9, 2026 13:38

JosephMehdiyev marked this pull request as ready for review June 9, 2026 16:44

DOC: Update histogram_bin_edges doc for D dimensions.

5d9ae84

Uh oh!

Conversation

JosephMehdiyev commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR summary

Other comments

AI Disclosure

Uh oh!

Uh oh!

Uh oh!

JosephMehdiyev commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JosephMehdiyev commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jorenham commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JosephMehdiyev commented Apr 21, 2026

Uh oh!

Uh oh!

JosephMehdiyev commented Apr 30, 2026

Uh oh!

Uh oh!

Uh oh!

JosephMehdiyev commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jorenham commented May 7, 2026

complexfloating overloads of histogram2d were NOT updated (bug)

histogramdd stubs don't precisely type per-dimension string bins

Ordering inconsistency across files (minor, marked resolved)

Uh oh!

JosephMehdiyev commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JosephMehdiyev commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JosephMehdiyev commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

JosephMehdiyev commented Apr 19, 2026 •

edited

Loading

JosephMehdiyev commented Apr 19, 2026 •

edited

Loading

JosephMehdiyev commented Apr 20, 2026 •

edited

Loading

jorenham commented Apr 20, 2026 •

edited

Loading

JosephMehdiyev commented May 7, 2026 •

edited

Loading

`complexfloating` overloads of `histogram2d` were NOT updated (bug)

`histogramdd` stubs don't precisely type per-dimension string bins

JosephMehdiyev commented Jun 5, 2026 •

edited

Loading

JosephMehdiyev commented Jun 8, 2026 •

edited

Loading