Skip to content

ENH: add string bin support for multivariate histograms#31276

Open
JosephMehdiyev wants to merge 40 commits into
numpy:mainfrom
JosephMehdiyev:hist
Open

ENH: add string bin support for multivariate histograms#31276
JosephMehdiyev wants to merge 40 commits into
numpy:mainfrom
JosephMehdiyev:hist

Conversation

@JosephMehdiyev

@JosephMehdiyev JosephMehdiyev commented Apr 19, 2026

Copy link
Copy Markdown
Contributor

PR summary

fixes #20215

  1. Adds multi dimensional bin width histogram algorithm(s)
    also generalizes existing bin width histogram algorithms to arrays (for dimensionality)
    majority of them will throw not implemented error when D>1
  2. Generalizes some helper functions to multi dimensions
  3. Adds string support for multi dimensional histograms

Other comments

I am not content with "auto" choice but it should be practical enough.

AI Disclosure

I used Claude as sanity check on my fixes. All modifications etc are done solely by my decisions and manually typed or copy pasted documentation from the related PR above.
Claude was used extensively on some parts of the code, especially on _get_bin_edges. Other than that I do not remember to be honest.

This will also add additional string arguments for `bin` for
`histogramdd`. Similarly, `histogram2d` will change.
Comment thread numpy/lib/_histograms_impl.py
Comment thread numpy/lib/_twodim_base_impl.py Outdated
@JosephMehdiyev

JosephMehdiyev commented Apr 19, 2026

Copy link
Copy Markdown
Contributor Author

Seems like I have focused on the old PR so much that I have missed so many things (other than stubs), please review after PR is ready for review

@JosephMehdiyev

JosephMehdiyev commented Apr 20, 2026

Copy link
Copy Markdown
Contributor Author

@jorenham (afaik you are the author of these stubs) is there a specific reason why overload stubs of histogram2d are more detailed (a couple of them even looks reduntant to me) and written differently than histogram and histogramdd? While I am here changing the file I might also rewrite it similar to histogram etc, unless these are intended

for example deleting this wont change things will succesfully run tests anyways (also true before PR changes)

@overload
def histogram2d(
    x: _ArrayLike1DNumber_co,
    y: _ArrayLike1DNumber_co,
    bins: _BinKind | Sequence[Sequence[int] | _BinKind],
    range: _ArrayLike2DFloat_co | None = None,
    density: bool | None = None,
    weights: _ArrayLike1DFloat_co | None = None,
) -> _Histogram2D[np.int_]: ...

since this below already handles that case

def histogram2d[ScalarT: _Number_co](
    x: _ArrayLike1DNumber_co,
    y: _ArrayLike1DNumber_co,
    bins: _BinKind | _ArrayLike1D[ScalarT] | Sequence[_ArrayLike1D[ScalarT] | _BinKind],
    range: _ArrayLike2DFloat_co | None = None,
    density: bool | None = None,
    weights: _ArrayLike1DFloat_co | None = None,
) -> _Histogram2D[ScalarT]: ...

I have no experience with .pyi files, I am mirroring my knowledge from .hpp and cpp templates FYI

@jorenham

jorenham commented Apr 20, 2026

Copy link
Copy Markdown
Member

@jorenham (afaik you are the author of these stubs) is there a specific reason why overload stubs of histogram2d are more detailed (a couple of them even looks reduntant to me) and written differently than histogram and histogramdd? While I am here changing the file I might also rewrite it similar to histogram etc, unless these are intended

for example deleting this wont change things

@overload
def histogram2d(
    x: _ArrayLike1DNumber_co,
    y: _ArrayLike1DNumber_co,
    bins: _BinKind | Sequence[Sequence[int] | _BinKind],
    range: _ArrayLike2DFloat_co | None = None,
    density: bool | None = None,
    weights: _ArrayLike1DFloat_co | None = None,
) -> _Histogram2D[np.int_]: ...

since this below already handles that case

def histogram2d[ScalarT: _Number_co](
    x: _ArrayLike1DNumber_co,
    y: _ArrayLike1DNumber_co,
    bins: _BinKind | _ArrayLike1D[ScalarT] | Sequence[_ArrayLike1D[ScalarT] | _BinKind],
    range: _ArrayLike2DFloat_co | None = None,
    density: bool | None = None,
    weights: _ArrayLike1DFloat_co | None = None,
) -> _Histogram2D[ScalarT]: ...

I have no experience with .pyi files, I am mirroring my knowledge from .hpp and cpp templates FYI

The difference here is in the bins, where the first overload here handles int sequences, and the second ScalarT. The important bit is that ScalarT only accept np.number | np.bool, and would therefore reject int.

So these overloads are distinct and non-overlapping.

... or at least, that's what they should be. You added _BinKind to both, which makes them overlap in an incompatible way.

@JosephMehdiyev JosephMehdiyev marked this pull request as ready for review April 21, 2026 22:30
@JosephMehdiyev

Copy link
Copy Markdown
Contributor Author

Not sure the failing tests are PR related.
I did not do anything about #31296, let me know if I should add some documentation about it
PR should be ready to review now. I looked around the stuff I did and did not find any issues (hopefully I am right)

Comment thread numpy/lib/_histograms_impl.py Outdated
@JosephMehdiyev JosephMehdiyev requested a review from jorenham April 21, 2026 22:35
@JosephMehdiyev

Copy link
Copy Markdown
Contributor Author

hey, can someone give a feedback for this PR?

Comment thread numpy/lib/_histograms_impl.py Outdated
Comment thread numpy/lib/_twodim_base_impl.pyi Outdated
@JosephMehdiyev

JosephMehdiyev commented May 7, 2026

Copy link
Copy Markdown
Contributor Author

Other than minor fixes, is the fix PR good? Any big issues? Would like to finish this PR at this point

@jorenham

jorenham commented May 7, 2026

Copy link
Copy Markdown
Member

I had Claude take a look at the stubs, and it actually found some real issues:

complexfloating overloads of histogram2d were NOT updated (bug)

_twodim_base_impl.pyi — the two overloads for ScalarT: np.complexfloating still have bins: int | Sequence[int]:

@overload
def histogram2d[ScalarT: np.complexfloating](
    x: _ArrayLike1D[ScalarT],
    y: _ArrayLike1D[ScalarT | _Float_co],
    bins: int | Sequence[int] = 10,  # ← missing _BinKind
    ...

Since np.complexfloating is a subtype of np.inexact, the type checker matches these overloads first for complex arrays. If you call np.histogram2d(complex_arr, complex_arr, bins="auto"), it won't resolve to the complexfloating overload and will silently fall back to a less-specific one with Any in the return type — losing the ScalarT binding. These should also have int | _BinKind | Sequence[int | _BinKind].

The same applies to the Sequence[complex] overload around _twodim_base_impl.pyi — it also doesn't include _BinKind.

histogramdd stubs don't precisely type per-dimension string bins

For histogramdd, the updated signature is bins: _BinKind | SupportsIndex | ArrayLike. This means bins=['auto', 'fd'] (list of per-dimension estimators) is technically only covered through ArrayLike, which is semantically incorrect (it's for numeric arrays, not a list of method name strings). A more precise type would be:

bins: _BinKind | SupportsIndex | ArrayLike | Sequence[_BinKind | SupportsIndex | ArrayLike] = 10

This is a relatively minor precision issue but worth noting, since a mixed list like ['auto', 5, np.array([0,1,2])] is a valid input to histogramdd and wouldn't be well-typed.

Ordering inconsistency across files (minor, marked resolved)

The resolved review thread noted int | _BinKind vs _BinKind | int. Within _twodim_base_impl.pyi, int | _BinKind is now consistent. However, _histograms_impl.pyi uses _BinKind | SupportsIndex | ... (BinKind first). This cross-file inconsistency was not addressed.

@JosephMehdiyev

JosephMehdiyev commented Jun 5, 2026

Copy link
Copy Markdown
Contributor Author

Changes are because of #20215 (comment)
There are still a lot to do but the stone 'fd' and scott algorithms should work.

@jorenham jorenham self-requested a review June 5, 2026 12:34
@JosephMehdiyev JosephMehdiyev changed the title ENH: use _get_bin_edges() on histogramdd for consistency. ENH: add string bin support for multivariate histograms Jun 8, 2026
@JosephMehdiyev

JosephMehdiyev commented Jun 8, 2026

Copy link
Copy Markdown
Contributor Author

stubs: The only things changed in stubs are that strings cannot be in array i.e "auto" is fine but not ["auto"] or ["auto", 2] and bins cannot be complex values. Complex values are unrelated to PR, but might as well fix it here.

  1. I could only generalize fd and scott algorithms as other algorithms do not have literature about N-D case. We could somewhat generalize others to D dimension by changing some variables to respect D, but it would be purely heuristic

  2. auto is tricky as it is not possibe to use the existing 1-D auto. fd or scott should be good enough for large dimensions because of the curse of dimensionality, but they may not work nicely on some cases in D=2 or D=3

  3. I tried to read papers and implement other multivariate bin width algorithms, these are generally maximize some kind of likelihood function, but the problem is that it is computationally expensive and the end result is not practically useful in my testing

@JosephMehdiyev JosephMehdiyev marked this pull request as ready for review June 9, 2026 13:21
@JosephMehdiyev JosephMehdiyev marked this pull request as draft June 9, 2026 13:38
@JosephMehdiyev JosephMehdiyev marked this pull request as ready for review June 9, 2026 16:44
@JosephMehdiyev

Copy link
Copy Markdown
Contributor Author

cc @jorenham see the above short comment about the stub changes, be free to review whenever, (if) you want

I will continue to update the documentation, tests and clean up some code too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ENH: np.histogram2d() does not support argument 'auto' for bins.

3 participants