Skip to content

ENH: New-style object sorting with descending support and NaN handling#31431

Merged
seberg merged 26 commits into
numpy:mainfrom
MaanasArora:object-sorts
Jun 9, 2026
Merged

ENH: New-style object sorting with descending support and NaN handling#31431
seberg merged 26 commits into
numpy:mainfrom
MaanasArora:object-sorts

Conversation

@MaanasArora

@MaanasArora MaanasArora commented May 14, 2026

Copy link
Copy Markdown
Contributor

Addresses part of #31423. Adds sorting ArrayMethods for object that support descending=True and new NaN-handling logic using templating. Treats any object such that obj != obj as NaN and sorts those to the end. ping @seberg, thanks!

I had to include a sentinel guard for out-of-bounds partitioning in quicksort because object comparisons can be unsafe, hopefully the constexpr avoids any performance deficit (probably will). The docs are a bit drafty maybe but should be ready for initial review at least!

AI Disclosure

I used LLMs a fair bit for debugging code snippets (much of which went nowhere, but they caught the out-of-bounds issue :))

@MaanasArora MaanasArora changed the title ENH: New-style object sorting with NaN handling ENH: New-style object sorting with descending support and NaN handling May 14, 2026

@seberg seberg left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, nice start and looks nice and simple for now!

I had to include a sentinel guard for out-of-bounds partitioning in quicksort because object comparisons can be unsafe, hopefully the constexpr avoids any performance deficit (probably will).

Hmmmm, I am a bit surprised, is this for objects that return obj < obj == True?
This seems fine, I am wondering if an angle where we just hard-code object identity to be equal from a sorting perspective wouldn't just make sense, since I think it solves this issue the same.

The biggest churn will be seeing if we can't handle error gracefully... I think that would be really nice, but might be annoying (requiring to threading error handling to every npy::cmp call...).

Comment thread numpy/_core/src/common/numpy_tag.h Outdated
int isnan_a = isnan(a);
int isnan_b = isnan(b);
if (isnan_a < 0 || isnan_b < 0) {
return 0;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can rewrite this a bit, I think:

  • If LT returns true, we can swap (no further checks needed)
  • Since this is an || we don't have to evaluate both isnan() most of the time.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes makes sense, thanks! I pushed a refactor.

Comment thread numpy/_core/src/common/numpy_tag.h Outdated
static int less(PyObject *a, PyObject *b)
{
/*
* work around gh-3879, we cannot abort an in-progress quicksort

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, we are not using the compare function here now. So we should be able to do this, although we'll have to rely on the compiler to to optimize the error path away on the other ones.

It is annoying I admit, since right now cmp() returns true/false, and then it'll be able to return an error, so all call sites will have to deal with that.

Can you see how that pans out -- because this is one actual advantage we have here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this now! There's a fair bit of call sites but agree that it makes sense to allow threading errors.

Comment thread numpy/_core/src/common/numpy_tag.h Outdated
static int isnan(PyObject *a) {
if (a == NULL) {
return 1;
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit annoying. Ideally, we should just translate NULL to Py_None at some point, since that is what NumPy generally does (it shouldn't normally happen though).
I think that means returning False here, though? (But we also would need the treatment earlier in the less/greater)

(In practice it doesn't matter, but I guess swapping should work with the original raw values and preserve NULL for refcounting reasons.)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thanks - I just moved this NULL check to the top of less and greater if that works? There is some more duplication, perhaps we can even just make a _cmp function that takes in a compare op... (as less and greater only differ in the op)?

@seberg

seberg commented May 14, 2026

Copy link
Copy Markdown
Member

Treats any object such that obj != obj as NaN and sorts those to the end.

Just to write it down, as mentioned also yesterday. I think this is perfectly good even if it doesn't allow NA yet.1

Footnotes

  1. Nathan mentioned that for pandas NA support the StringDType actually allows errors to pass because pandas bool(NA) is an error. I don't mind trying to invent a pattern that makes pandas work, but I don't want to start with try/except on a hot path and I am not sure there are other patterns that work by knowing that ((NA < other) is NA. I suspect something may be workable but one has to be exceedingly careful since e.g. (False != False) is False as well).

@MaanasArora

MaanasArora commented May 15, 2026

Copy link
Copy Markdown
Contributor Author

Hmmmm, I am a bit surprised, is this for objects that return obj < obj == True?

I think poor orderings, e.g. intransitive comparisons, are the reason it is unsafe - and they can be present in user code of course. We do actually have tests that fail (usually even segfault) if this check is excluded - they mostly seem to explicitly check if poor orderings work.

The biggest churn will be seeing if we can't handle error gracefully...

grep -r "npy::cmp" reveals there are exactly 100 occurences of npy::cmp in the code! I guess we're excluding mergesorts, but that's just seven (so still 93). Most of them are inlined, so hard to return on error. But I agree a refactor would be very nice, even for user dtypes if we ever export (similar) templates. Perhaps a macro is a reasonable compromise here? I asked Claude to dig for this in CPython, which seems to use one:

#define IFLT(X, Y) if ((k = ISLT(X, Y)) < 0) goto fail;  \
           if (k)

(https://github.com/python/cpython/blob/461b1d96313de02992d284c1782be9aff24586c9/Objects/listobject.c#L1715-L1716)

@seberg

seberg commented May 15, 2026

Copy link
Copy Markdown
Member

Yeah it isn't great... I really dislike Macros that include a return, but maybe it makes sense here?
Although, I guess even then you can't have the return inside the if, so hmmmm...
But I really would prefer to not do this error checking dance :/.

@MaanasArora

MaanasArora commented May 15, 2026

Copy link
Copy Markdown
Contributor Author

Thanks, pushed a macro NPY_CMP and handled any error propagation through call sites! I think it turned out a bit nice even.

Edit: nevermind, it seems statement expressions don't work for MSVC. I'm going to revert, sorry! Let me try to do a full refactor without macro.

Edit: full refactor with error handling done! Doesn't look as pretty anymore, but perhaps more explicit for some loops anyway... :/ I left the string ones (and mergesort of course) unchanged.

@seberg seberg left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, FWIW, as much as it is too bad that this adds so many lines of code to propagate errors we don't need for most types, I think this is the right path.
At least unless we want to avoid implementing a specialized sort for object.

FWIW, my quick timings this is around 25% faster than what we currently have for object dtype (both random and already sorted -- and for random we may add the isnan check).
Doesn't matter all that much, but maybe it is a nice bonus. (It may be cool to add object to the benchmarks for this, but doesn't have to be here.)

Comment thread numpy/_core/src/common/numpy_tag.hpp
Comment thread numpy/_core/src/npysort/quicksort.hpp Outdated
@MaanasArora MaanasArora force-pushed the object-sorts branch 2 times, most recently from 12a530d to b9f04dd Compare May 25, 2026 07:15
@MaanasArora

MaanasArora commented May 25, 2026

Copy link
Copy Markdown
Contributor Author

Thanks, I just rebased with main (to include the new sort benchmarks) and added object to the benchmarks! Here are the results, though I'll post a re-run (and include argsort) because they were a bit flaky:

Sort Benchmaks
Change Before [0e18dd2] After [5edfd04] <object-sorts~1> Ratio Benchmark (Parameter)
+ 25.5±0.2ms 37.9±0.2ms 1.48 bench_function_base.Sort.time_sort(True, False, 'object', ('ordered',))
+ 14.9±0.1ms 19.4±2ms 1.3 bench_function_base.Sort.time_sort(True, True, 'float16', ('sorted_block', 1000))
+ 13.6±0.06ms 17.2±0.06ms 1.27 bench_function_base.Sort.time_sort(True, False, 'uint32', ('sorted_block', 10))
+ 13.6±0.03ms 17.1±0.06ms 1.26 bench_function_base.Sort.time_sort(True, False, 'int32', ('sorted_block', 10))
+ 19.0±0.2ms 23.2±0.8ms 1.22 bench_function_base.Sort.time_sort(True, True, 'float16', ('sorted_block', 100))
+ 14.6±0.08ms 17.3±0.1ms 1.19 bench_function_base.Sort.time_sort(False, True, 'float16', ('reversed',))
+ 11.2±0.02ms 13.4±0.04ms 1.19 bench_function_base.Sort.time_sort(False, True, 'float32', ('reversed',))
+ 14.6±0.09ms 17.3±0.04ms 1.18 bench_function_base.Sort.time_sort(False, False, 'float16', ('uniform',))
+ 5.41±0.1ms 6.29±0.3ms 1.16 bench_function_base.Sort.time_sort(False, True, 'int64', ('ordered',))
+ 8.20±0.1ms 9.14±0.4ms 1.11 bench_function_base.Sort.time_sort(True, True, 'float16', ('reversed',))
+ 15.7±0.2ms 17.3±0.5ms 1.1 bench_function_base.Sort.time_sort(False, False, 'float16', ('ordered',))
+ 29.1±0.7ms 32.1±0.7ms 1.1 bench_function_base.Sort.time_sort(True, True, 'float16', ('sorted_block', 10))
+ 6.21±0.04ms 6.77±0.03ms 1.09 bench_function_base.Sort.time_sort(False, True, 'float32', ('ordered',))
+ 36.9±0.1ms 39.9±0.06ms 1.08 bench_function_base.Sort.time_sort(False, True, 'float16', ('sorted_block', 1000))
+ 54.8±0.1ms 58.5±0.2ms 1.07 bench_function_base.Sort.time_sort(False, False, 'int16', ('random',))
+ 45.8±0.1ms 48.8±0.2ms 1.07 bench_function_base.Sort.time_sort(False, True, 'float16', ('sorted_block', 10))
+ 7.94±0.06ms 8.52±0.05ms 1.07 bench_function_base.Sort.time_sort(True, False, 'int64', ('sorted_block', 1000))
+ 39.0±0.4ms 41.3±0.2ms 1.06 bench_function_base.Sort.time_sort(False, False, 'float16', ('sorted_block', 1000))
+ 40.8±0.08ms 43.4±0.09ms 1.06 bench_function_base.Sort.time_sort(False, False, 'int16', ('sorted_block', 10))
+ 45.2±0.03ms 48.1±0.1ms 1.06 bench_function_base.Sort.time_sort(False, False, 'int16', ('sorted_block', 100))
+ 37.5±0.07ms 39.7±0.03ms 1.06 bench_function_base.Sort.time_sort(False, False, 'int16', ('sorted_block', 1000))
+ 33.0±0.2ms 35.1±0.7ms 1.06 bench_function_base.Sort.time_sort(True, False, 'object', ('uniform',))
+ 5.71±0.02ms 6.04±0.07ms 1.06 bench_function_base.Sort.time_sort(True, True, 'int32', ('sorted_block', 1000))
+ 12.5±0.02ms 13.2±0.2ms 1.05 bench_function_base.Sort.time_sort(False, True, 'float32', ('uniform',))
- 6.27±0.09ms 5.95±0.08ms 0.95 bench_function_base.Sort.time_sort(False, False, 'bool', ('sorted_block', 10))
- 12.7±0.05ms 12.0±0.02ms 0.95 bench_function_base.Sort.time_sort(False, False, 'int8', ('sorted_block', 1000))
- 4.97±0.05ms 4.70±0.04ms 0.95 bench_function_base.Sort.time_sort(False, False, 'uint8', ('ordered',))
- 82.6±1ms 78.6±0.2ms 0.95 bench_function_base.Sort.time_sort(True, False, 'float16', ('random',))
- 677±4ms 641±3ms 0.95 bench_function_base.Sort.time_sort(True, False, 'object', ('random',))
- 5.34±0.1ms 5.02±0.07ms 0.94 bench_function_base.Sort.time_sort(False, False, 'bool', ('ordered',))
- 5.28±0.01ms 4.95±0.03ms 0.94 bench_function_base.Sort.time_sort(False, False, 'bool', ('uniform',))
- 5.29±0.02ms 4.98±0.06ms 0.94 bench_function_base.Sort.time_sort(False, False, 'uint8', ('uniform',))
- 5.31±0.02ms 4.99±0.01ms 0.94 bench_function_base.Sort.time_sort(False, True, 'bool', ('reversed',))
- 14.7±0.02ms 13.8±0.05ms 0.94 bench_function_base.Sort.time_sort(False, True, 'float16', ('uniform',))
- 5.29±0.05ms 4.94±0.03ms 0.94 bench_function_base.Sort.time_sort(False, True, 'uint8', ('reversed',))
- 10.8±0.1ms 10.1±0.1ms 0.94 bench_function_base.Sort.time_sort(True, True, 'int64', ('sorted_block', 100))
- 5.28±0.02ms 4.92±0.02ms 0.93 bench_function_base.Sort.time_sort(False, False, 'bool', ('reversed',))
- 5.29±0.03ms 4.94±0.01ms 0.93 bench_function_base.Sort.time_sort(False, False, 'int8', ('reversed',))
- 5.32±0.03ms 4.96±0.02ms 0.93 bench_function_base.Sort.time_sort(False, True, 'bool', ('uniform',))
- 5.03±0.04ms 4.69±0.03ms 0.93 bench_function_base.Sort.time_sort(False, True, 'uint8', ('ordered',))
- 6.45±0.1ms 5.91±0.04ms 0.92 bench_function_base.Sort.time_sort(False, True, 'bool', ('sorted_block', 10))
- 17.8±2ms 16.4±0.1ms 0.92 bench_function_base.Sort.time_sort(True, True, 'float64', ('sorted_block', 100))
- 6.06±0.1ms 5.49±0.04ms 0.91 bench_function_base.Sort.time_sort(False, False, 'bool', ('sorted_block', 100))
- 5.44±0.1ms 4.98±0.05ms 0.91 bench_function_base.Sort.time_sort(False, True, 'uint8', ('uniform',))
- 647±40μs 590±4μs 0.91 bench_function_base.Sort.time_sort(True, True, 'int32', ('reversed',))
- 8.82±0.2ms 7.98±0.1ms 0.91 bench_function_base.Sort.time_sort(True, True, 'int64', ('sorted_block', 1000))
- 5.22±0.05ms 4.69±0.02ms 0.9 bench_function_base.Sort.time_sort(False, False, 'int8', ('ordered',))
- 386±10μs 349±0.8μs 0.9 bench_function_base.Sort.time_sort(False, False, 'uint32', ('uniform',))
- 18.6±0.2ms 16.3±0.1ms 0.87 bench_function_base.Sort.time_sort(True, True, 'object', ('sorted_block', 10))
- 8.80±0.1ms 7.58±0.7ms 0.86 bench_function_base.Sort.time_sort(False, True, 'int32', ('reversed',))
- 19.4±0.2ms 16.6±0.3ms 0.86 bench_function_base.Sort.time_sort(True, True, 'int64', ('sorted_block', 10))
- 26.2±2ms 22.2±0.1ms 0.85 bench_function_base.Sort.time_sort(True, True, 'float64', ('sorted_block', 10))
- 16.2±0.07ms 12.6±0.03ms 0.78 bench_function_base.Sort.time_sort(True, True, 'int32', ('sorted_block', 10))
- 20.9±0.07ms 16.0±0.1ms 0.77 bench_function_base.Sort.time_sort(True, False, 'int64', ('sorted_block', 10))
- 16.1±0.04ms 12.4±0.07ms 0.77 bench_function_base.Sort.time_sort(True, True, 'uint32', ('sorted_block', 10))
- 20.9±0.3ms 16.0±0.2ms 0.76 bench_function_base.Sort.time_sort(True, False, 'object', ('sorted_block', 10))
- 718±9ms 541±10ms 0.75 bench_function_base.Sort.time_sort(False, False, 'object', ('random',))
- 26.1±0.7ms 17.6±0.4ms 0.68 bench_function_base.Sort.time_sort(True, False, 'object', ('reversed',))
- 344±0.6ms 202±3ms 0.59 bench_function_base.Sort.time_sort(False, False, 'object', ('ordered',))
- 628±1ms 373±1ms 0.59 bench_function_base.Sort.time_sort(False, False, 'object', ('reversed',))

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.

Overall object sorts are indeed faster, though there's a regression for ordered stable sorts for object - perhaps error handling is hurting there on a particular code path. Some other flakiness too, seem to be some regressions but smaller sorted blocks are faster?

@seberg

seberg commented Jun 1, 2026

Copy link
Copy Markdown
Member

The object sort for already sorted will have a regression because of the additional check for NaN. I.e. effectively to check if we are already sorted, it's using !cmp(b, a) so not b < a. But for object that means we do at least one additional check for b != b (or a, not sure).
We could consider changing that by using a <= b and avoiding the not at the cost of changing the actual comparison operator to include equality and slightly more complexity as we now need a new helper cmp_eq or so...

@charris I was hoping you would chime in briefly on the API here, although I guess we use the != logic already in the NaN functions, so I don't see much problem. Maybe you have an opinion about the already sorted regression too.

@MaanasArora

MaanasArora commented Jun 5, 2026

Copy link
Copy Markdown
Contributor Author

Thanks, sorry for the delay, I checked the benchmarks and they seemed to be stable, so this was a real regression. Yes, I flipped the not in the timsort by adding greater_equal and using a cmp_eq helper (except for strings); I only did this for the already sorted check. It didn't turn out too complex actually, thanks for the suggestion! The benchmarks have improved for object, but there are 3-4x regressions for float16 uniform/ordered:

Sort Benchmarks
Change Before [a20ef19] After [c77fbf5] Ratio Benchmark (Parameter)
+ 1.03±0.01ms 4.60±0.05ms 4.46 bench_function_base.Sort.time_sort(True, False, 'float16', ('uniform',))
+ 1.12±0.05ms 4.20±0.06ms 3.73 bench_function_base.Sort.time_sort(True, False, 'float16', ('ordered',))
+ 14.7±0.07ms 17.3±2ms 1.17 bench_function_base.Sort.time_sort(True, True, 'float16', ('sorted_block', 1000))
+ 12.1±0.1ms 13.9±0.1ms 1.15 bench_function_base.Sort.time_sort(False, True, 'float64', ('reversed',))
+ 15.2±0.3ms 17.3±0.09ms 1.14 bench_function_base.Sort.time_sort(False, False, 'float16', ('uniform',))
+ 708±4μs 807±40μs 1.14 bench_function_base.Sort.time_sort(True, False, 'uint32', ('reversed',))
+ 15.0±0.3ms 16.9±0.3ms 1.13 bench_function_base.Sort.time_sort(False, True, 'float16', ('reversed',))
+ 3.91±0.03ms 4.43±0.3ms 1.13 bench_function_base.Sort.time_sort(True, True, 'int16', ('reversed',))
+ 28.9±0.5ms 32.4±0.8ms 1.12 bench_function_base.Sort.time_sort(True, True, 'float16', ('sorted_block', 10))
+ 18.9±0.1ms 21.2±1ms 1.12 bench_function_base.Sort.time_sort(True, True, 'float16', ('sorted_block', 100))
+ 5.59±0.1ms 6.14±0.07ms 1.1 bench_function_base.Sort.time_sort(True, True, 'uint32', ('sorted_block', 1000))
+ 5.65±0.03ms 6.14±0.08ms 1.09 bench_function_base.Sort.time_sort(True, True, 'int32', ('sorted_block', 1000))
+ 16.2±0.3ms 17.5±0.6ms 1.08 bench_function_base.Sort.time_sort(True, False, 'float16', ('sorted_block', 1000))
+ 885±7μs 952±10μs 1.08 bench_function_base.Sort.time_sort(True, False, 'float32', ('uniform',))
+ 1.05±0.01ms 1.12±0.04ms 1.07 bench_function_base.Sort.time_sort(True, False, 'uint8', ('sorted_block', 100))
+ 7.61±0.04ms 8.15±0.04ms 1.07 bench_function_base.Sort.time_sort(True, True, 'int32', ('sorted_block', 100))
+ 13.2±0.3ms 14.0±0.5ms 1.06 bench_function_base.Sort.time_sort(True, False, 'float64', ('sorted_block', 1000))
+ 347±3μs 367±2μs 1.06 bench_function_base.Sort.time_sort(True, False, 'int16', ('ordered',))
+ 76.8±0.6ms 81.3±0.6ms 1.06 bench_function_base.Sort.time_sort(True, True, 'float64', ('random',))
- 23.6±3ms 22.4±0.06ms 0.95 bench_function_base.Sort.time_sort(False, False, 'uint8', ('sorted_block', 100))
- 6.52±0.1ms 6.18±0.04ms 0.95 bench_function_base.Sort.time_sort(False, True, 'bool', ('sorted_block', 10))
- 13.7±0.06ms 13.0±0.1ms 0.95 bench_function_base.Sort.time_sort(True, False, 'int32', ('sorted_block', 10))
- 15.7±0.09ms 14.8±0.09ms 0.94 bench_function_base.Sort.time_sort(False, True, 'float16', ('ordered',))
- 15.0±0.1ms 14.1±0.1ms 0.94 bench_function_base.Sort.time_sort(False, True, 'float16', ('uniform',))
- 5.31±0.02ms 5.02±0.03ms 0.94 bench_function_base.Sort.time_sort(False, True, 'int8', ('reversed',))
- 5.30±0.04ms 4.98±0.02ms 0.94 bench_function_base.Sort.time_sort(False, True, 'uint8', ('reversed',))
- 12.6±0.07ms 11.9±0.03ms 0.94 bench_function_base.Sort.time_sort(False, True, 'uint8', ('sorted_block', 1000))
- 10.7±0.3ms 10.1±0.08ms 0.94 bench_function_base.Sort.time_sort(True, True, 'object', ('sorted_block', 100))
- 5.98±0.2ms 5.50±0.04ms 0.92 bench_function_base.Sort.time_sort(False, False, 'float32', ('sorted_block', 100))
- 6.21±0.1ms 5.69±0.04ms 0.92 bench_function_base.Sort.time_sort(False, False, 'int32', ('sorted_block', 10))
- 5.14±0.04ms 4.72±0.02ms 0.92 bench_function_base.Sort.time_sort(False, True, 'int8', ('ordered',))
- 8.99±0.03ms 8.26±0.1ms 0.92 bench_function_base.Sort.time_sort(True, False, 'int32', ('sorted_block', 100))
- 8.78±0.04ms 8.10±0.1ms 0.92 bench_function_base.Sort.time_sort(True, False, 'uint32', ('sorted_block', 100))
- 11.0±1ms 9.83±0.1ms 0.9 bench_function_base.Sort.time_sort(False, False, 'bool', ('random',))
- 6.82±3ms 6.10±0.06ms 0.9 bench_function_base.Sort.time_sort(False, False, 'bool', ('sorted_block', 10))
- 25.0±2ms 22.5±0.2ms 0.9 bench_function_base.Sort.time_sort(False, False, 'int8', ('sorted_block', 10))
- 6.07±0.2ms 5.44±0.05ms 0.9 bench_function_base.Sort.time_sort(False, True, 'bool', ('sorted_block', 1000))
- 5.83±0.1ms 5.22±0.1ms 0.9 bench_function_base.Sort.time_sort(False, True, 'int64', ('ordered',))
- 5.92±0.6ms 5.26±0.07ms 0.89 bench_function_base.Sort.time_sort(False, False, 'bool', ('uniform',))
- 5.64±0.4ms 5.00±0.02ms 0.89 bench_function_base.Sort.time_sort(False, False, 'uint8', ('reversed',))
- 13.5±1ms 12.0±0.08ms 0.89 bench_function_base.Sort.time_sort(False, False, 'uint8', ('sorted_block', 1000))
- 6.41±0.2ms 5.68±0.08ms 0.89 bench_function_base.Sort.time_sort(False, True, 'bool', ('sorted_block', 100))
- 6.89±0.07ms 6.13±0.1ms 0.89 bench_function_base.Sort.time_sort(True, False, 'int32', ('sorted_block', 1000))
- 6.76±0.02ms 6.03±0.04ms 0.89 bench_function_base.Sort.time_sort(True, False, 'uint32', ('sorted_block', 1000))
- 26.0±1ms 23.2±0.3ms 0.89 bench_function_base.Sort.time_sort(True, True, 'float64', ('sorted_block', 10))
- 6.00±0.03ms 5.29±0.09ms 0.88 bench_function_base.Sort.time_sort(False, True, 'int16', ('reversed',))
- 24.8±2ms 21.6±0.1ms 0.87 bench_function_base.Sort.time_sort(False, False, 'int8', ('sorted_block', 100))
- 13.7±1ms 12.0±0.03ms 0.87 bench_function_base.Sort.time_sort(False, False, 'int8', ('sorted_block', 1000))
- 5.05±0.06ms 4.37±0.3ms 0.87 bench_function_base.Sort.time_sort(False, True, 'uint8', ('ordered',))
- 34.4±5ms 29.6±0.08ms 0.86 bench_function_base.Sort.time_sort(False, False, 'int8', ('random',))
- 6.45±0.7ms 5.42±0.04ms 0.84 bench_function_base.Sort.time_sort(False, False, 'bool', ('sorted_block', 1000))
- 6.09±0.05ms 5.14±0.04ms 0.84 bench_function_base.Sort.time_sort(False, True, 'int16', ('ordered',))
- 6.36±1ms 5.26±0.2ms 0.83 bench_function_base.Sort.time_sort(False, False, 'bool', ('ordered',))
- 20.9±0.1ms 17.2±0.2ms 0.83 bench_function_base.Sort.time_sort(True, False, 'object', ('sorted_block', 10))
- 6.46±2ms 5.25±0.05ms 0.81 bench_function_base.Sort.time_sort(False, False, 'bool', ('reversed',))
- 21.2±0.5ms 17.1±0.2ms 0.81 bench_function_base.Sort.time_sort(True, False, 'int64', ('sorted_block', 10))
- 16.3±0.1ms 12.9±0.09ms 0.79 bench_function_base.Sort.time_sort(True, True, 'int32', ('sorted_block', 10))
- 16.5±0.07ms 13.0±0.5ms 0.79 bench_function_base.Sort.time_sort(True, True, 'uint32', ('sorted_block', 10))
- 7.08±0.8ms 5.53±0.3ms 0.78 bench_function_base.Sort.time_sort(False, False, 'bool', ('sorted_block', 100))
- 5.99±1ms 4.68±0.05ms 0.78 bench_function_base.Sort.time_sort(False, False, 'uint8', ('ordered',))
- 6.18±2ms 4.39±0.01ms 0.71 bench_function_base.Sort.time_sort(False, False, 'int8', ('ordered',))
- 26.8±0.7ms 19.0±0.6ms 0.71 bench_function_base.Sort.time_sort(True, False, 'object', ('ordered',))
- 27.6±0.9ms 18.7±1ms 0.68 bench_function_base.Sort.time_sort(True, False, 'object', ('reversed',))
- 8.88±1ms 5.78±0.02ms 0.65 bench_function_base.Sort.time_sort(False, False, 'int16', ('reversed',))
- 7.91±4ms 4.95±0.08ms 0.63 bench_function_base.Sort.time_sort(False, False, 'int8', ('uniform',))
- 364±30ms 212±0.5ms 0.58 bench_function_base.Sort.time_sort(False, False, 'object', ('ordered',))
- 663±50ms 382±2ms 0.58 bench_function_base.Sort.time_sort(False, False, 'object', ('reversed',))
- 973±200ms 558±10ms 0.57 bench_function_base.Sort.time_sort(False, False, 'object', ('random',))
- 33.6±0.3ms 13.0±0.3ms 0.39 bench_function_base.Sort.time_sort(True, False, 'object', ('uniform',))

Not sure where those are coming from, perhaps the comparator. Also, I did recover cmp rather than cmp_eq for descending, which probably makes sense as we use the inverted version (which usually returns early rather than goes all the way.)

With this, all objects need to implement <= and >= to work, as caught by the test_sort_bad_ordering test (which I updated to add a __le__ method to the bogus class for now). I guess that could be annoying, should we add a fallback? (If we do so in the comparators, it would be silently slower, but I don't know if we should alter the sorts clearly.)

EDIT: The float16 stuff seems to be random compiler choices again, given there really is no branching difference... changing the _equal comparators a bit optimized it more for me, but no point in pushing I guess.

@MaanasArora

Copy link
Copy Markdown
Contributor Author

Sorry, I think I misunderstood, we clearly don't need the __le__/__ge__ methods to do this; the inverted op is enough! Just pushed a change doing that instead. Benchmarks against main now:

Sort Benchmarks
Change Before [a20ef19] After [ca5446d] Ratio Benchmark (Parameter)
+ 1.10±0.4ms 3.58±0.4ms 3.27 bench_function_base.Sort.time_sort(True, False, 'float16', ('uniform',))
+ 1.16±0.07ms 3.52±0.1ms 3.03 bench_function_base.Sort.time_sort(True, False, 'float16', ('ordered',))
+ 1.17±0ms 3.52±0.04ms 3 bench_function_base.Sort.time_sort(True, True, 'float16', ('ordered',))
+ 1.16±0.01ms 3.08±0.3ms 2.65 bench_function_base.Sort.time_sort(True, True, 'float16', ('uniform',))
+ 889±30μs 1.19±0.1ms 1.34 bench_function_base.Sort.time_sort(True, True, 'float32', ('uniform',))
+ 13.5±0.09ms 17.2±0.08ms 1.27 bench_function_base.Sort.time_sort(True, False, 'uint32', ('sorted_block', 10))
+ 22.0±0.7ms 26.0±0.2ms 1.18 bench_function_base.Sort.time_sort(True, False, 'float64', ('sorted_block', 10))
+ 14.9±0.07ms 17.0±0.1ms 1.15 bench_function_base.Sort.time_sort(False, True, 'float16', ('uniform',))
+ 12.0±0.04ms 13.8±0.1ms 1.15 bench_function_base.Sort.time_sort(False, True, 'float64', ('reversed',))
+ 15.7±0.2ms 17.4±0.07ms 1.11 bench_function_base.Sort.time_sort(False, False, 'float16', ('ordered',))
+ 5.20±0.05ms 5.79±0.09ms 1.11 bench_function_base.Sort.time_sort(False, False, 'int16', ('reversed',))
+ 352±2μs 385±5μs 1.09 bench_function_base.Sort.time_sort(False, False, 'int32', ('uniform',))
+ 17.0±0.3ms 18.3±0.06ms 1.08 bench_function_base.Sort.time_sort(False, False, 'float16', ('reversed',))
+ 39.2±0.4ms 42.3±0.09ms 1.08 bench_function_base.Sort.time_sort(False, False, 'float16', ('sorted_block', 1000))
+ 19.0±0.1ms 20.6±0.2ms 1.08 bench_function_base.Sort.time_sort(True, True, 'float16', ('sorted_block', 100))
+ 49.4±0.2ms 52.8±0.04ms 1.07 bench_function_base.Sort.time_sort(False, False, 'float16', ('sorted_block', 10))
+ 18.7±0.2ms 20.1±0.08ms 1.07 bench_function_base.Sort.time_sort(True, False, 'float32', ('sorted_block', 10))
+ 480±2μs 513±20μs 1.07 bench_function_base.Sort.time_sort(True, False, 'uint32', ('ordered',))
+ 5.71±0.02ms 6.08±0.1ms 1.07 bench_function_base.Sort.time_sort(True, True, 'int32', ('sorted_block', 1000))
+ 50.4±0.3ms 53.5±0.2ms 1.06 bench_function_base.Sort.time_sort(False, False, 'float16', ('sorted_block', 100))
+ 45.7±0.06ms 48.2±0.1ms 1.06 bench_function_base.Sort.time_sort(False, True, 'float16', ('sorted_block', 10))
- 22.7±0.09ms 21.6±0.04ms 0.95 bench_function_base.Sort.time_sort(False, False, 'int8', ('sorted_block', 100))
- 5.03±0.04ms 4.75±0.01ms 0.95 bench_function_base.Sort.time_sort(False, False, 'uint8', ('ordered',))
- 12.6±0.04ms 12.0±0.09ms 0.95 bench_function_base.Sort.time_sort(False, False, 'uint8', ('sorted_block', 1000))
- 9.49±0.1ms 8.98±0.1ms 0.95 bench_function_base.Sort.time_sort(False, True, 'int64', ('reversed',))
- 23.0±0.1ms 21.9±0.04ms 0.95 bench_function_base.Sort.time_sort(False, True, 'int8', ('sorted_block', 10))
- 8.50±0.06ms 8.04±0.07ms 0.95 bench_function_base.Sort.time_sort(False, True, 'uint32', ('reversed',))
- 24.0±0.09ms 22.4±0.04ms 0.94 bench_function_base.Sort.time_sort(False, False, 'int8', ('sorted_block', 10))
- 12.7±0.05ms 11.9±0.01ms 0.94 bench_function_base.Sort.time_sort(False, False, 'int8', ('sorted_block', 1000))
- 5.27±0.04ms 4.94±0.02ms 0.94 bench_function_base.Sort.time_sort(False, True, 'int8', ('reversed',))
- 12.6±0.05ms 11.8±0.02ms 0.94 bench_function_base.Sort.time_sort(False, True, 'uint8', ('sorted_block', 1000))
- 80.0±0.9ms 75.3±0.2ms 0.94 bench_function_base.Sort.time_sort(True, True, 'float16', ('random',))
- 5.65±0.03ms 5.27±0.05ms 0.93 bench_function_base.Sort.time_sort(False, True, 'uint32', ('uniform',))
- 6.10±0.1ms 5.61±0.09ms 0.92 bench_function_base.Sort.time_sort(False, False, 'bool', ('sorted_block', 100))
- 6.11±0.2ms 5.64±0.02ms 0.92 bench_function_base.Sort.time_sort(False, True, 'bool', ('sorted_block', 100))
- 8.87±0.2ms 8.16±0.2ms 0.92 bench_function_base.Sort.time_sort(False, True, 'int32', ('reversed',))
- 5.06±0.05ms 4.65±0.02ms 0.92 bench_function_base.Sort.time_sort(False, True, 'uint8', ('ordered',))
- 77.0±7ms 70.7±0.2ms 0.92 bench_function_base.Sort.time_sort(True, False, 'int32', ('random',))
- 30.2±0.5ms 27.8±0.1ms 0.92 bench_function_base.Sort.time_sort(True, True, 'float16', ('sorted_block', 10))
- 20.1±0.7ms 18.6±0.9ms 0.92 bench_function_base.Sort.time_sort(True, True, 'float32', ('sorted_block', 10))
- 15.5±0.1ms 14.2±0.05ms 0.91 bench_function_base.Sort.time_sort(False, True, 'float16', ('ordered',))
- 857±30μs 779±20μs 0.91 bench_function_base.Sort.time_sort(True, False, 'int64', ('uniform',))
- 5.86±0.01ms 5.26±0.1ms 0.9 bench_function_base.Sort.time_sort(False, True, 'int16', ('reversed',))
- 5.20±0.04ms 4.67±0.04ms 0.9 bench_function_base.Sort.time_sort(False, True, 'int8', ('ordered',))
- 6.01±0.2ms 5.38±0.03ms 0.89 bench_function_base.Sort.time_sort(False, True, 'bool', ('sorted_block', 1000))
- 18.4±0.3ms 16.2±0.2ms 0.88 bench_function_base.Sort.time_sort(True, True, 'object', ('sorted_block', 10))
- 5.38±0.04ms 4.64±0.05ms 0.86 bench_function_base.Sort.time_sort(False, False, 'int8', ('reversed',))
- 6.03±0.03ms 5.21±0.2ms 0.86 bench_function_base.Sort.time_sort(False, True, 'int16', ('ordered',))
- 19.7±0.3ms 16.8±0.2ms 0.85 bench_function_base.Sort.time_sort(True, True, 'int64', ('sorted_block', 10))
- 4.30±2ms 3.54±0.01ms 0.82 bench_function_base.Sort.time_sort(True, False, 'bool', ('sorted_block', 1000))
- 7.51±2ms 6.14±0.03ms 0.82 bench_function_base.Sort.time_sort(True, False, 'int32', ('sorted_block', 1000))
- 1.45±0.3ms 1.17±0.07ms 0.8 bench_function_base.Sort.time_sort(True, True, 'float64', ('uniform',))
- 21.3±0.2ms 16.9±0.2ms 0.79 bench_function_base.Sort.time_sort(True, False, 'int64', ('sorted_block', 10))
- 16.2±0.09ms 12.8±0.2ms 0.79 bench_function_base.Sort.time_sort(True, True, 'int32', ('sorted_block', 10))
- 16.2±0.04ms 12.7±0.2ms 0.78 bench_function_base.Sort.time_sort(True, True, 'uint32', ('sorted_block', 10))
- 17.0±6ms 12.8±0.05ms 0.75 bench_function_base.Sort.time_sort(True, False, 'int32', ('sorted_block', 10))
- 2.15±0.4ms 1.60±0ms 0.75 bench_function_base.Sort.time_sort(True, False, 'int8', ('sorted_block', 1000))
- 729±9ms 537±10ms 0.74 bench_function_base.Sort.time_sort(False, False, 'object', ('random',))
- 1.06±0.1ms 747±80μs 0.7 bench_function_base.Sort.time_sort(False, False, 'int64', ('uniform',))
- 12.0±7ms 8.16±0.3ms 0.68 bench_function_base.Sort.time_sort(True, False, 'int32', ('sorted_block', 100))
- 26.3±7ms 16.8±0.2ms 0.64 bench_function_base.Sort.time_sort(True, False, 'object', ('sorted_block', 10))
- 39.5±7ms 23.5±0.5ms 0.6 bench_function_base.Sort.time_sort(True, False, 'object', ('uniform',))
- 344±0.8ms 202±1ms 0.59 bench_function_base.Sort.time_sort(False, False, 'object', ('ordered',))
- 633±20ms 366±2ms 0.58 bench_function_base.Sort.time_sort(False, False, 'object', ('reversed',))
- 31.0±6ms 17.6±2ms 0.57 bench_function_base.Sort.time_sort(True, False, 'object', ('reversed',))

@seberg

seberg commented Jun 8, 2026

Copy link
Copy Markdown
Member

Thanks! Sorry quick question, but do you know what's up with the float16 benchmarks? I don't think we use it for that, but I wonder if adding NPY_FINLINE to the isnan and lt_nonan helpers might nudge the compiler to inline?
(Guessing that it is failing to do so suddenly, the other fluctuations seem basically "random", but the 3x is a bit surprising.)

(I guess this should be settling, but if this "inverted" logic would hinder this, we can also revert it...)

@MaanasArora

MaanasArora commented Jun 8, 2026

Copy link
Copy Markdown
Contributor Author

Not quite sure! The 3x is pretty stable across runs, so it does seem to be a real regression. Adding NPY_FINLINE didn't really help unfortunately:

float16-only sort benchmarks
Change Before [c80693c] After [12d1d249] Ratio Benchmark (Parameter)
+ 1.17±0.01ms 4.41±0.8ms 3.78 bench_function_base.Sort.time_sort(True, True, 'float16', ('uniform',))
+ 1.18±0ms 3.47±0.7ms 2.94 bench_function_base.Sort.time_sort(True, True, 'float16', ('ordered',))
+ 15.0±0.4ms 17.5±0.4ms 1.17 bench_function_base.Sort.time_sort(True, True, 'float16', ('sorted_block', 1000))
+ 15.1±0.2ms 17.3±0.03ms 1.15 bench_function_base.Sort.time_sort(False, False, 'float16', ('uniform',))
+ 14.9±0.2ms 16.8±0.1ms 1.13 bench_function_base.Sort.time_sort(False, True, 'float16', ('reversed',))
+ 36.8±0.2ms 38.6±0.1ms 1.05 bench_function_base.Sort.time_sort(False, True, 'float16', ('sorted_block', 1000))
- 8.51±0.2ms 8.06±0.09ms 0.95 bench_function_base.Sort.time_sort(True, False, 'float16', ('reversed',))
- 15.8±0.2ms 14.9±0.2ms 0.94 bench_function_base.Sort.time_sort(False, True, 'float16', ('ordered',))
- 14.9±0.8ms 14.0±0.2ms 0.94 bench_function_base.Sort.time_sort(False, True, 'float16', ('uniform',))

I suspect the !less in less_equal (and same for greater) is causing some random compiler-specific optimization of the nan-boolean cases, as before !ret came after. Let me experiment with it a bit...

@MaanasArora

MaanasArora commented Jun 8, 2026

Copy link
Copy Markdown
Contributor Author

OK, pushed a change to the less_equal and greater_equal functions by expanding them, which helped speed up on my machine at least, I think the regressions are gone mostly - benchmarks below! Added a release note as well.

Sort Benchmarks
Change Before [c80693c] After [a84f7b6] Ratio Benchmark (Parameter)
+ 13.5±0.05ms 17.0±0.06ms 1.26 bench_function_base.Sort.time_sort(True, True, 'int32', ('sorted_block', 10))
+ 6.45±0.02ms 7.72±0.1ms 1.2 bench_function_base.Sort.time_sort(False, True, 'float64', ('ordered',))
+ 5.04±0.02ms 6.02±0.09ms 1.2 bench_function_base.Sort.time_sort(False, True, 'int16', ('ordered',))
+ 7.96±0.05ms 9.50±0.1ms 1.19 bench_function_base.Sort.time_sort(True, False, 'object', ('sorted_block', 1000))
+ 14.6±0.09ms 17.3±1ms 1.19 bench_function_base.Sort.time_sort(True, True, 'float16', ('sorted_block', 1000))
+ 5.21±0.01ms 6.15±0.01ms 1.18 bench_function_base.Sort.time_sort(False, True, 'int16', ('reversed',))
+ 7.96±0.2ms 9.31±0.1ms 1.17 bench_function_base.Sort.time_sort(True, False, 'int64', ('sorted_block', 1000))
+ 14.9±0.06ms 17.1±0.03ms 1.14 bench_function_base.Sort.time_sort(False, False, 'float16', ('ordered',))
+ 16.0±0.2ms 18.1±0.03ms 1.13 bench_function_base.Sort.time_sort(False, False, 'float16', ('reversed',))
+ 9.96±0.1ms 11.2±0.1ms 1.13 bench_function_base.Sort.time_sort(True, False, 'int64', ('sorted_block', 100))
+ 5.62±0.02ms 6.35±0.1ms 1.13 bench_function_base.Sort.time_sort(True, False, 'uint32', ('sorted_block', 1000))
+ 11.0±0.3ms 12.5±0.3ms 1.13 bench_function_base.Sort.time_sort(True, True, 'int64', ('sorted_block', 100))
+ 6.40±0.1ms 7.17±0.06ms 1.12 bench_function_base.Sort.time_sort(False, True, 'float32', ('ordered',))
+ 10.0±0.09ms 11.2±0.1ms 1.12 bench_function_base.Sort.time_sort(True, False, 'object', ('sorted_block', 100))
+ 8.52±0.03ms 9.45±0.08ms 1.11 bench_function_base.Sort.time_sort(True, True, 'int64', ('sorted_block', 1000))
+ 25.2±0.5ms 27.8±0.3ms 1.1 bench_function_base.Sort.time_sort(True, False, 'object', ('ordered',))
+ 511±2ms 560±20ms 1.09 bench_function_base.Sort.time_sort(False, False, 'object', ('uniform',))
+ 11.6±0.03ms 12.7±0.1ms 1.09 bench_function_base.Sort.time_sort(False, True, 'float64', ('reversed',))
+ 40.7±0.04ms 44.5±0.2ms 1.09 bench_function_base.Sort.time_sort(False, True, 'int16', ('sorted_block', 10))
+ 12.8±0.1ms 13.8±0.09ms 1.08 bench_function_base.Sort.time_sort(False, True, 'float64', ('uniform',))
+ 45.1±0.03ms 48.8±0.1ms 1.08 bench_function_base.Sort.time_sort(False, True, 'int16', ('sorted_block', 100))
+ 37.4±0.04ms 40.5±0.07ms 1.08 bench_function_base.Sort.time_sort(False, True, 'int16', ('sorted_block', 1000))
+ 15.6±0.2ms 16.9±0.1ms 1.08 bench_function_base.Sort.time_sort(True, False, 'object', ('sorted_block', 10))
+ 7.63±0.01ms 8.21±0.4ms 1.08 bench_function_base.Sort.time_sort(True, False, 'uint32', ('sorted_block', 100))
+ 702±4μs 760±7μs 1.08 bench_function_base.Sort.time_sort(True, True, 'float32', ('reversed',))
+ 54.4±0.1ms 58.4±0.2ms 1.07 bench_function_base.Sort.time_sort(False, True, 'int16', ('random',))
+ 29.5±0.2ms 31.5±0.6ms 1.07 bench_function_base.Sort.time_sort(True, False, 'float16', ('sorted_block', 10))
- 8.90±0.1ms 8.41±0.07ms 0.95 bench_function_base.Sort.time_sort(False, True, 'int64', ('reversed',))
- 792±90μs 753±4μs 0.95 bench_function_base.Sort.time_sort(True, False, 'int64', ('uniform',))
- 318±8μs 300±0.7μs 0.94 bench_function_base.Sort.time_sort(True, False, 'bool', ('uniform',))
- 4.82±0.05ms 4.50±0.02ms 0.93 bench_function_base.Sort.time_sort(False, True, 'int32', ('ordered',))
- 23.0±0.4ms 21.5±0.05ms 0.93 bench_function_base.Sort.time_sort(False, True, 'int8', ('sorted_block', 100))
- 324±10μs 303±0.2μs 0.93 bench_function_base.Sort.time_sort(True, True, 'bool', ('uniform',))
- 5.48±0.09ms 5.06±0.01ms 0.92 bench_function_base.Sort.time_sort(False, False, 'int16', ('uniform',))
- 1.14±0.03ms 1.05±0.01ms 0.92 bench_function_base.Sort.time_sort(True, False, 'float64', ('ordered',))
- 15.5±0.3ms 14.1±0.03ms 0.91 bench_function_base.Sort.time_sort(False, True, 'float16', ('ordered',))
- 8.78±0.2ms 7.96±0.1ms 0.91 bench_function_base.Sort.time_sort(False, True, 'int32', ('reversed',))
- 13.2±0.07ms 11.9±0.02ms 0.9 bench_function_base.Sort.time_sort(False, True, 'int8', ('sorted_block', 1000))
- 8.74±0.2ms 7.84±0.02ms 0.9 bench_function_base.Sort.time_sort(False, True, 'uint32', ('reversed',))
- 1.12±0.06ms 1.00±0.01ms 0.89 bench_function_base.Sort.time_sort(True, False, 'float64', ('reversed',))
- 26.5±0.4ms 23.5±0.2ms 0.89 bench_function_base.Sort.time_sort(True, True, 'float64', ('sorted_block', 10))
- 667±50μs 596±3μs 0.89 bench_function_base.Sort.time_sort(True, True, 'uint32', ('reversed',))
- 4.92±0.06ms 4.35±0.01ms 0.88 bench_function_base.Sort.time_sort(False, True, 'uint32', ('ordered',))
- 1.24±0.1ms 1.06±0.03ms 0.86 bench_function_base.Sort.time_sort(True, True, 'float64', ('uniform',))
- 1.05±0.01ms 875±3μs 0.84 bench_function_base.Sort.time_sort(True, False, 'float16', ('uniform',))
- 1.42±0.2ms 1.20±0.03ms 0.84 bench_function_base.Sort.time_sort(True, True, 'float64', ('reversed',))
- 1.09±0.01ms 884±8μs 0.81 bench_function_base.Sort.time_sort(True, False, 'float16', ('ordered',))
- 16.2±0.03ms 12.7±0.05ms 0.79 bench_function_base.Sort.time_sort(True, False, 'uint32', ('sorted_block', 10))
- 661±2ms 518±3ms 0.78 bench_function_base.Sort.time_sort(False, False, 'object', ('random',))
- 5.81±0.01ms 4.55±0.5ms 0.78 bench_function_base.Sort.time_sort(True, False, 'int32', ('sorted_block', 1000))
- 1.15±0ms 880±3μs 0.77 bench_function_base.Sort.time_sort(True, True, 'float16', ('ordered',))
- 1.15±0.01ms 882±10μs 0.77 bench_function_base.Sort.time_sort(True, True, 'float16', ('uniform',))
- 16.1±0.08ms 12.4±0.03ms 0.77 bench_function_base.Sort.time_sort(True, True, 'uint32', ('sorted_block', 10))
- 32.2±0.3ms 24.0±0.4ms 0.75 bench_function_base.Sort.time_sort(True, False, 'object', ('uniform',))
- 25.3±0.1ms 17.1±0.09ms 0.68 bench_function_base.Sort.time_sort(True, False, 'object', ('reversed',))
- 326±0.5ms 202±0.2ms 0.62 bench_function_base.Sort.time_sort(False, False, 'object', ('ordered',))
- 595±2ms 369±0.4ms 0.62 bench_function_base.Sort.time_sort(False, False, 'object', ('reversed',))

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.

Argsort benchmarks
Change Before [c80693c] After [6d83c91] Ratio Benchmark (Parameter)
+ 16.9±0.4ms 22.7±0.8ms 1.35 bench_function_base.Sort.time_argsort(True, False, 'int32', ('sorted_block', 10))
+ 683±10μs 913±200μs 1.34 bench_function_base.Sort.time_argsort(True, True, 'int64', ('ordered',))
+ 8.35±0.09ms 10.7±1ms 1.28 bench_function_base.Sort.time_argsort(True, False, 'int32', ('sorted_block', 1000))
+ 12.2±0.4ms 15.2±0.3ms 1.25 bench_function_base.Sort.time_argsort(False, True, 'float32', ('reversed',))
+ 581±20μs 721±100μs 1.24 bench_function_base.Sort.time_argsort(True, False, 'int32', ('uniform',))
+ 575±7μs 703±70μs 1.22 bench_function_base.Sort.time_argsort(True, False, 'int32', ('ordered',))
+ 11.7±0.4ms 14.3±0.9ms 1.22 bench_function_base.Sort.time_argsort(True, False, 'uint32', ('sorted_block', 100))
+ 523±4μs 636±80μs 1.22 bench_function_base.Sort.time_argsort(True, False, 'uint8', ('uniform',))
+ 802±3μs 971±200μs 1.21 bench_function_base.Sort.time_argsort(True, True, 'uint32', ('reversed',))
+ 6.07±0.07ms 7.14±0.3ms 1.18 bench_function_base.Sort.time_argsort(False, True, 'int64', ('ordered',))
+ 6.97±0.4ms 8.17±0.6ms 1.17 bench_function_base.Sort.time_argsort(False, True, 'float32', ('ordered',))
+ 5.22±0.1ms 6.12±0.3ms 1.17 bench_function_base.Sort.time_argsort(True, False, 'uint8', ('random',))
+ 521±9μs 606±30μs 1.16 bench_function_base.Sort.time_argsort(True, False, 'uint8', ('ordered',))
+ 6.74±0.1ms 7.72±0.5ms 1.15 bench_function_base.Sort.time_argsort(False, True, 'bool', ('sorted_block', 10))
+ 12.3±0.2ms 14.2±0.5ms 1.15 bench_function_base.Sort.time_argsort(True, False, 'int32', ('sorted_block', 100))
+ 19.6±0.3ms 22.6±0.7ms 1.15 bench_function_base.Sort.time_argsort(True, False, 'object', ('ordered',))
+ 578±8μs 656±20μs 1.14 bench_function_base.Sort.time_argsort(True, False, 'uint32', ('ordered',))
+ 16.7±0.5ms 18.9±2ms 1.13 bench_function_base.Sort.time_argsort(False, False, 'float16', ('ordered',))
+ 519±5μs 588±8μs 1.13 bench_function_base.Sort.time_argsort(True, False, 'bool', ('ordered',))
+ 8.10±0.2ms 9.13±0.1ms 1.13 bench_function_base.Sort.time_argsort(True, False, 'uint32', ('sorted_block', 1000))
+ 578±3μs 652±40μs 1.13 bench_function_base.Sort.time_argsort(True, False, 'uint32', ('uniform',))
+ 2.65±0.06ms 3.01±0.1ms 1.13 bench_function_base.Sort.time_argsort(True, False, 'uint8', ('sorted_block', 1000))
+ 23.5±0.3ms 26.5±0.7ms 1.13 bench_function_base.Sort.time_argsort(True, True, 'object', ('sorted_block', 10))
+ 10.7±0.04ms 11.9±0.9ms 1.11 bench_function_base.Sort.time_argsort(False, True, 'int64', ('reversed',))
+ 16.3±0.4ms 18.2±0.2ms 1.11 bench_function_base.Sort.time_argsort(True, False, 'uint32', ('sorted_block', 10))
+ 3.24±0.01ms 3.56±0.4ms 1.1 bench_function_base.Sort.time_argsort(True, False, 'bool', ('sorted_block', 100))
+ 13.5±0.3ms 15.0±0.3ms 1.1 bench_function_base.Sort.time_argsort(True, False, 'object', ('sorted_block', 100))
+ 7.16±0.08ms 7.88±0.2ms 1.1 bench_function_base.Sort.time_argsort(True, False, 'uint8', ('reversed',))
+ 9.19±0.3ms 9.99±0.7ms 1.09 bench_function_base.Sort.time_argsort(True, False, 'object', ('sorted_block', 1000))
+ 5.87±0.07ms 6.37±0.3ms 1.08 bench_function_base.Sort.time_argsort(False, False, 'bool', ('uniform',))
+ 15.2±0.2ms 16.5±0.2ms 1.08 bench_function_base.Sort.time_argsort(False, False, 'int8', ('sorted_block', 1000))
+ 5.99±0.04ms 6.48±0.2ms 1.08 bench_function_base.Sort.time_argsort(False, True, 'bool', ('sorted_block', 1000))
+ 19.1±0.04ms 20.4±0.6ms 1.07 bench_function_base.Sort.time_argsort(False, False, 'float64', ('reversed',))
+ 6.02±0.05ms 6.42±0.2ms 1.07 bench_function_base.Sort.time_argsort(False, True, 'bool', ('ordered',))
+ 11.9±0.1ms 12.8±0.4ms 1.07 bench_function_base.Sort.time_argsort(False, True, 'bool', ('random',))
+ 56.3±0.3ms 60.1±0.7ms 1.07 bench_function_base.Sort.time_argsort(False, True, 'int64', ('sorted_block', 1000))
+ 4.15±0.02ms 4.43±0.2ms 1.07 bench_function_base.Sort.time_argsort(True, False, 'bool', ('sorted_block', 1000))
+ 13.6±0.06ms 14.5±0.08ms 1.07 bench_function_base.Sort.time_argsort(True, False, 'float64', ('sorted_block', 1000))
+ 20.5±0.2ms 21.8±0.4ms 1.07 bench_function_base.Sort.time_argsort(True, False, 'object', ('sorted_block', 10))
+ 31.5±0.7ms 33.4±0.9ms 1.06 bench_function_base.Sort.time_argsort(False, False, 'int32', ('random',))
+ 507±3ms 539±1ms 1.06 bench_function_base.Sort.time_argsort(False, False, 'object', ('uniform',))
+ 84.8±0.4ms 89.7±3ms 1.06 bench_function_base.Sort.time_argsort(False, True, 'float32', ('random',))
+ 14.8±0.1ms 15.6±0.1ms 1.05 bench_function_base.Sort.time_argsort(False, False, 'float16', ('uniform',))
+ 4.22±0.05ms 4.44±0.3ms 1.05 bench_function_base.Sort.time_argsort(True, False, 'int16', ('sorted_block', 10))
+ 88.4±0.2ms 92.9±1ms 1.05 bench_function_base.Sort.time_argsort(True, False, 'int32', ('random',))
+ 35.9±0.1ms 37.9±0.4ms 1.05 bench_function_base.Sort.time_argsort(True, True, 'float16', ('sorted_block', 10))
- 62.0±0.2ms 59.1±0.6ms 0.95 bench_function_base.Sort.time_argsort(False, True, 'int32', ('sorted_block', 10))
- 97.8±2ms 92.9±0.5ms 0.95 bench_function_base.Sort.time_argsort(True, False, 'float32', ('random',))
- 5.96±0.05ms 5.60±0.05ms 0.94 bench_function_base.Sort.time_argsort(False, True, 'int8', ('ordered',))
- 8.06±0.3ms 7.47±0.2ms 0.93 bench_function_base.Sort.time_argsort(True, True, 'int8', ('reversed',))
- 12.7±0.6ms 11.9±0.1ms 0.93 bench_function_base.Sort.time_argsort(True, True, 'uint32', ('sorted_block', 100))
- 8.12±0.2ms 7.48±0.2ms 0.92 bench_function_base.Sort.time_argsort(True, True, 'uint8', ('reversed',))
- 1.17±0.01ms 1.07±0.03ms 0.91 bench_function_base.Sort.time_argsort(True, False, 'float16', ('uniform',))
- 11.3±0.4ms 9.87±0.2ms 0.87 bench_function_base.Sort.time_argsort(False, True, 'int32', ('reversed',))
- 21.7±0.7ms 18.8±0.2ms 0.87 bench_function_base.Sort.time_argsort(True, True, 'float64', ('sorted_block', 100))
- 17.3±0.4ms 14.9±0.1ms 0.86 bench_function_base.Sort.time_argsort(False, True, 'float16', ('uniform',))
- 6.96±0.3ms 5.86±0.1ms 0.84 bench_function_base.Sort.time_argsort(False, True, 'uint32', ('ordered',))
- 3.09±0.2ms 2.57±0.04ms 0.83 bench_function_base.Sort.time_argsort(True, True, 'uint8', ('sorted_block', 10))
- 1.29±0.01ms 1.07±0.03ms 0.82 bench_function_base.Sort.time_argsort(True, False, 'float16', ('ordered',))
- 1.32±0.01ms 1.07±0.02ms 0.81 bench_function_base.Sort.time_argsort(True, True, 'float16', ('uniform',))
- 28.5±0.3ms 22.8±0.5ms 0.8 bench_function_base.Sort.time_argsort(True, False, 'object', ('uniform',))
- 7.39±0.3ms 5.87±0.2ms 0.79 bench_function_base.Sort.time_argsort(False, True, 'int32', ('ordered',))
- 1.42±0.1ms 1.05±0.02ms 0.74 bench_function_base.Sort.time_argsort(True, True, 'float16', ('ordered',))
- 3.70±0.6ms 2.72±0.07ms 0.74 bench_function_base.Sort.time_argsort(True, True, 'uint8', ('sorted_block', 1000))
- 591±10ms 359±2ms 0.61 bench_function_base.Sort.time_argsort(False, False, 'object', ('reversed',))
- 325±2ms 195±3ms 0.6 bench_function_base.Sort.time_argsort(False, False, 'object', ('ordered',))
- 19.5±0.3ms 10.9±0.09ms 0.56 bench_function_base.Sort.time_argsort(True, False, 'object', ('reversed',))

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.

@seberg seberg left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, as discussed a bit, I am really starting to think it was a terrible idea to think about using <= style comparisons, because if we avoid the __le__ explicitly then I am not quite sure about the implementation.

So, if that makes tests fail (for good reasons or not), then I think it might be best to just not do that cmp_eq thing. In the end, it is only interesting if it avoids NaN checks when no NaNs are involved (and even then it only might be interesting).

Comment on lines +3 to +5
`np.sort` and `np.argsort` with arrays of dtype `object`
now support passing `descending=True` to sort in descending order.
Unordered objects, i.e. `obj` such that `obj != obj`, are sorted

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
`np.sort` and `np.argsort` with arrays of dtype `object`
now support passing `descending=True` to sort in descending order.
Unordered objects, i.e. `obj` such that `obj != obj`, are sorted
`np.sort` and `np.argsort` with arrays of dtype ``object``
now support passing `descending=True` to sort in descending order.
Unordered objects, i.e. ``obj`` such that ``obj != obj``, are now sorted

@MaanasArora MaanasArora Jun 8, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fixed! I also cleaned up the release note a bit, not sure if obj != obj was actually nice, but seems more natural this way...

Comment thread numpy/_core/tests/test_multiarray.py Outdated
b = np.concatenate((a[~nanmask][::-1], a[nanmask]))
if np.issubdtype(a.dtype, np.object_):
# cast to float for comparison, as object np.nan != np.nan
a = a.astype(float)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks wrong (i.e. the cast is before the actual sort).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, thanks!

if (j < n && npy::cmp<Tag, reverse>(a[j], a[j + 1])) {
ret = npy::cmp<Tag, reverse>(a[j], a[j + 1]);
if (ret < 0) return ret;
if (j < n && ret) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be wrong, i.e. it computes ret even if j < n isn't true. You could inline the (ret = ...) == 1 although not the prettiest maybe.

(I guess we could probably just delete heapsort in practice, but maybe not as part of this PR.)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done, I think!

Comment thread numpy/_core/src/common/numpy_tag.hpp
Comment thread numpy/_core/src/common/numpy_tag.hpp Outdated
return 0;
}

ret = PyObject_RichCompareBool(a, b, op);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm if isnan(a) && isnan(b) then for this style of comparison, they are considered equal, I think?
I guess in practice that might not even matter for timesort, with just another sorting approach being taken (i.e. if the "already sorted" pass fails for NaNs that might be fine, but I am not quite sure, unless you are? then we should comment)...

I am now thinking I really led you astray here. I don't mind using <=, FWIW, (maybe with a small release note), we can undo if someone notices...
But, at this point it feels like it is adding a lot of annoyances, and I would be just as happy to not do it here. If someone ever wants to optimize it, they could follow up.

But, my guess is your re-factor seems to have optimized from 3 to 2 Python comparisons for the already sorted case but if we change it to something like:

def less_equal_with_nan(a, b):
    if b > a:
        return 0
    elif b != b:
        return 1
    elif a != a:
        return 0
    return 1

which to me would seem safe, then we would again end up with 3 comparisons when a <= b is True and at that point the whole use of cmp_eq may be pretty much moot?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it feels a bit moot with the added complexity now, even if it is a bit optimized (from 3 to 2), and totally moot if not in the future. I don't think it warrants this much of tweaks... I've just gone and ahead and reverted these files to before the cmp_eq experiment, so we lose this baggage!

@@ -0,0 +1,6 @@
object array sorting supports `descending=True`

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe mention NaN/Nan-like objects here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed up to include, thanks!

@MaanasArora

Copy link
Copy Markdown
Contributor Author

Thanks for reviewing! Yeah, I think reverting is a good choice here, as there was added complexity on a few fronts. At least we know this is tricky to do now :)

@seberg seberg left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made a tiny tweak, mostly removed the with errstate context from the benchmarks (because it didn't make sense to me, but who knows maybe I missed something).
Tiny tweak to the release note, but it's good enough.

One thing that is in a sense missing are tests that actual exercise the error paths, that might be a good follow-up, but I don't want to hold it off due to that.

Thanks, I'll put it in once tests pass, if there is something more, we can follow-up.

@seberg seberg merged commit c0c20aa into numpy:main Jun 9, 2026
85 of 87 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants