Skip to content

TST: Add direct tests for join_inner in table/_np_utils Cython extension#19458

Open
ReemHamraz wants to merge 17 commits intoastropy:mainfrom
ReemHamraz:tst/direct-np-utils-cython
Open

TST: Add direct tests for join_inner in table/_np_utils Cython extension#19458
ReemHamraz wants to merge 17 commits intoastropy:mainfrom
ReemHamraz:tst/direct-np-utils-cython

Conversation

@ReemHamraz
Copy link
Copy Markdown
Contributor

@ReemHamraz ReemHamraz commented Mar 18, 2026

Part of GSoC 2026 proposal: Hardening Astropy's Core Stability
This PR adds direct tests for the join_inner function in astropy/table/_np_utils.pyx, bypassing the public astropy.table.join() API to test the compiled Cython interface directly.

join_inner() is the single function exposed by this extension and is the core index-computation engine for all table join operations in astropy. It currently has zero standalone test coverage, it is only exercised indirectly through the public API.

Tests cover

  • All four join types: inner (0), outer (1), left (2), right (3)
  • Cartesian product expansion for duplicate keys (the n×m case)
  • Output array length consistency (len(left_out) == n_out for all join types)
  • Mask array length consistency (len(left_mask) == n_out for all join types)
  • Edge cases: single row match, no overlap, 100-key scaling test

The helper _make_join_inputs() reproduces the exact index-preparation that operations.py performs before calling join_inner(), keeping the tests fully self-contained with no dependency on the public API.

Add test_np_utils_extension.py, which imports the compiled _np_utils
extension module directly (without going through astropy.table.Table
or astropy.table.join) and exercises join_inner() across all four
join types (inner/outer/left/right) with hand-constructed index inputs.

Also add test_np_utils.py with tests for common_dtype() in np_utils.py.

This is part of the work described in astropy#19249 to build a standalone
test layer for astropy's compiled extension modules.
@github-actions
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Astropy! 🌌 This checklist is meant to remind the package maintainers who will review this pull request of some common things to look for.

  • Do the proposed changes actually accomplish desired goals?
  • Do the proposed changes follow the Astropy coding guidelines?
  • Are tests added/updated as required? If so, do they follow the Astropy testing guidelines?
  • Are docs added/updated as required? If so, do they follow the Astropy documentation guidelines?
  • Is rebase and/or squash necessary? If so, please provide the author with appropriate instructions. Also see instructions for rebase and squash.
  • Did the CI pass? If no, are the failures related? If you need to run daily and weekly cron jobs as part of the PR, please apply the "Extra CI" label. Codestyle issues can be fixed by the bot.
  • Is a change log needed? If yes, did the change log check pass? If no, add the "no-changelog-entry-needed" label. If this is a manual backport, use the "skip-changelog-checks" label unless special changelog handling is necessary.
  • Is this a big PR that makes a "What's new?" entry worthwhile and if so, is (1) a "what's new" entry included in this PR and (2) the "whatsnew-needed" label applied?
  • At the time of adding the milestone, if the milestone set requires a backport to release branch(es), apply the appropriate "backport-X.Y.x" label(s) before merge.

@ReemHamraz
Copy link
Copy Markdown
Contributor Author

pre-commit.ci autofix

@pllim pllim added this to the v8.0.0 milestone Mar 18, 2026
@ReemHamraz ReemHamraz force-pushed the tst/direct-np-utils-cython branch from a401338 to 14b0a45 Compare March 18, 2026 15:30
@ReemHamraz ReemHamraz force-pushed the tst/direct-np-utils-cython branch from dddc9ec to 8c730c9 Compare March 18, 2026 16:12
@neutrinoceros neutrinoceros marked this pull request as draft March 18, 2026 16:24
@ReemHamraz
Copy link
Copy Markdown
Contributor Author

pre-commit.ci autofix

@ReemHamraz ReemHamraz force-pushed the tst/direct-np-utils-cython branch from c54b6e7 to 8dcadba Compare March 18, 2026 17:26
@ReemHamraz ReemHamraz force-pushed the tst/direct-np-utils-cython branch from 843ba46 to dad7647 Compare March 18, 2026 18:06
Expands the standalone test coverage for the  Cython
extension in .

Previously, tests only verified the structural output (e.g.,
and the  boolean flag). This commit adds strict assertions
using  to validate the actual contents
of the computed , , , and
arrays across all four join types (inner, outer, left, right).

Additionally, adds a  function to ensure the
C-extension handles zero-length inputs gracefully without segfaulting.
@ReemHamraz ReemHamraz force-pushed the tst/direct-np-utils-cython branch from 3180166 to 656cbd4 Compare March 19, 2026 16:20
@ReemHamraz ReemHamraz force-pushed the tst/direct-np-utils-cython branch from 1a9e5af to a7d9b4f Compare March 19, 2026 16:35
@neutrinoceros neutrinoceros marked this pull request as ready for review March 20, 2026 09:47
Copy link
Copy Markdown
Contributor

@neutrinoceros neutrinoceros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a great start. I think we need to iterate a couple times to reduce in-between tests duplication though. I'd suggest moving as much tests to a general class (not bound to any particular jointype) and parametrize it over jointypes
as it makes sense. I'll then re-review in more depth. From a relatively quick gaze, tests look sensible and minimal, good job !

@ReemHamraz
Copy link
Copy Markdown
Contributor Author

@neutrinoceros I have refactored the test suite based on your feedback so like I condensed all the four classes into a single TestJoinInner class using @pytest.mark.parametrize.

I thought that since _get_join_sort_idxs requires full Astropy Table objects, and the goal of the PR was to bypass Table objects to test the C-arrays directly, but if you think it's better to avoid the code duplication and just pass dummy Table objects into _get_join_sort_idxs for the Cython test, I could rewrite it that way, just let me know what you prefer!

Also please lemme know what you think of this

Copy link
Copy Markdown
Contributor

@neutrinoceros neutrinoceros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, it looks much better now !

I think all your tests are sensible, but they all focus on equal-size left/right inputs. I think we should also include more diverse test cases like

  • inputs of different sizes
  • inputs with 0 elements (left, right and both)

I'm also curious what happens with non-integer dtypes ? in particular, what happens if there's a NaN in left and right ? I'm asking because NaN doesn't compare equal to itself, so it could break the helper function in interesting ways.

@ReemHamraz
Copy link
Copy Markdown
Contributor Author

I think all your tests are sensible, but they all focus on equal-size left/right inputs. I think we should also include more diverse test cases like

  • inputs with 0 elements (left, right and both)

I'm also curious what happens with non-integer dtypes ? in particular, what happens if there's a NaN in left and right ? I'm asking because NaN doesn't compare equal to itself, so it could break the helper function in interesting ways.

I actually tried adding empty array tests previously, but it caused the test suite to instantly fail with a Segmentation Fault, and since the public Python API in operations.py explicitly guards against empty tables (if len_left == 0 or len_right == 0: raise ValueError), I left the 0-element tests out of this Cython-direct test suite.

@ReemHamraz ReemHamraz force-pushed the tst/direct-np-utils-cython branch from 536e763 to b0d5d90 Compare March 22, 2026 14:22
@ReemHamraz ReemHamraz force-pushed the tst/direct-np-utils-cython branch from 28ef2b3 to 0a17f0a Compare March 22, 2026 14:43
@neutrinoceros
Copy link
Copy Markdown
Contributor

it caused the test suite to instantly fail with a Segmentation Fault (...) the public Python API in operations.py explicitly guards against empty tables

Indeed, I see now what this happens. I wonder if raising an exception is actually the correct behavior, I would naively assume that at least some join types could still logically accept empty tables... Any way, happy to declare this out of scope for the present PR.

Copy link
Copy Markdown
Contributor

@neutrinoceros neutrinoceros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're in the late stage review;

A high level remark is that it's a bit hard to read parametrizations when there are so many parameters in a single pytest.param.
One way to address this would be to define a couple small dataclasses to hold the information as:

from dataclasses import dataclass

Array = np.ndarray[tuple[int], np.dtype[np.float64]]
Mask = np.ndarray[tuple[int], np.dtype[np.bool_]]

@dataclass(kw_only=True, slots=True, frozen=True)
class ArrayMaskPair:
   array: Array
   mask: Mask

@dataclass(kw_only=True, slots=True, frozen=True)
class ExpectedResults:
    left: ArrayMaskPair
    right: ArrayMaskPair

@neutrinoceros
Copy link
Copy Markdown
Contributor

I intend my next review to be the final round. I'll be focusing on wether all tests are actually correct, but it'll be easier to do once my current set of suggestions are addressed.

@ReemHamraz ReemHamraz force-pushed the tst/direct-np-utils-cython branch from 7511d46 to 87fa88a Compare March 25, 2026 21:03

@dataclass(kw_only=True, slots=True, frozen=True)
class ArrayMaskPair:
array: list | np.ndarray
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, this single type annotation has at least 3 flaws that we can use to up your skills !

list and np.ndarray are both generic types; in simple terms, this means they define the shape of the container, but additional information is needed to type their content (for instance list[float]).

np.ndarray isn't supported in numpy as a type annotation. It's quite a bit more involved to type these correctly, especially in cases where multiple dimensionalities and/or dtypes are possible, but to give a concise example of how to declare the type of a 2D array of single precision floats:
np.ndarray[tuple[int, int], np.dtype[np.float32]]

Moreover, type unions (|) are generally to be avoided when possible (because they make the whole exercise both more difficult and less impactful). They should not be needed here: I'm sure we can decide on a single type and use it unconditionally (probably arrays).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear: none of this matters at runtime. You could use array: int and the tests would run just the same, but that's not a reason not to make type annotations actually useful at type-check time !


assert n_out == 3
assert masked == 0
np.testing.assert_array_equal(left_out, [0, 1, 2])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we usually just import this function at the top level

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants