Bug/AP-25573 Fix bugs with pandas >= 2.1.0#56
Open
HedgehogCode wants to merge 8 commits intomasterfrom
Open
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR addresses bugs that emerged with pandas version 2.1.0 and later, specifically related to how pandas handles extension arrays and struct dict encoded data. The changes fix data corruption issues during pandas-arrow round-trips and ensure compatibility with newer pandas versions.
Changes:
- Fixed type inference issues in PyArrow array creation by explicitly specifying
type=pa.bool_() - Modified
KnimePandasExtensionArray.take()to prevent struct dict encoding corruption when pandas re-batches data - Added empty array handling in
isna()method - Converted test suite from unittest to pytest with parameterized testing for both populated and empty tables
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| pytest.ini | Re-enabled pytest class discovery to support class-based tests |
| org.knime.python3.arrow/src/main/python/knime/_arrow/_types.py | Added explicit boolean type to mask array creation |
| org.knime.python3.arrow/src/main/python/knime/_arrow/_table.py | Added TODO comment about struct-dict-encoded array re-batching issue |
| org.knime.python3.arrow/src/main/python/knime/_arrow/_pandas.py | Fixed pandas 2.1.0+ compatibility by handling empty arrays and optimizing take() to prevent encoding corruption |
| org.knime.python3.arrow/src/main/python/knime/_arrow/_dictencoding.py | Added explicit boolean type to mask array creation |
| org.knime.python3.arrow.tests/src/test/python/unittest/test_table.py | Converted from unittest to pytest with parameterized testing for empty/non-empty tables |
| org.knime.python3.arrow.tests/src/test/python/unittest/test_pandas_extension_type.py | Added regression test for struct dict encoding corruption bug |
| org.knime.python3.arrow.tests/src/test/python/unittest/structDictEncodedDataCellsWithBatches.zip | Added test data file for struct dict encoding regression test |
| org.knime.python3.arrow.tests/src/test/python/unittest/emptyGeneratedTestData.zip | Added test data file for empty table testing |
org.knime.python3.arrow/src/main/python/knime/_arrow/_pandas.py
Outdated
Show resolved
Hide resolved
Contributor
|
@HedgehogCode I've opened a new pull request, #57, to work on those changes. Once the pull request is ready, I'll request review from you. |
ce94bd8 to
239a578
Compare
AP-25573 (Workflow tests fail with pandas 2.3)
AP-25573 (Workflow tests fail with pandas 2.3)
…mpty arrays The usual case works for empty arrays and handles dtypes of extension arrays correctly. AP-25573 (Workflow tests fail with pandas 2.3)
… of int/bool AP-25573 (Workflow tests fail with pandas 2.3)
Otherwise, if the chunked array has no chunks and we attempt to concatenate no arrays, which fails. AP-25573 (Workflow tests fail with pandas 2.3)
239a578 to
6f8f9dd
Compare
org.knime.python3.arrow.tests/src/test/python/unittest/test_pandas_extension_type.py
Show resolved
Hide resolved
…are taken Bug: In pandas 2.1.0+, pd.concat changed its behavior and now calls KnimePandasExtensionArray.take with indices [0,1,2,...,n] to effectively copy the array. When take is called on struct dict encoded arrays, it delegates to storage.take(), which merges chunks of the ChunkedArray without recalculating the struct-dict encoding. This breaks the dictionary structure across chunk boundaries: keys in chunk 2 reference indices that only exist in chunk 1's data array, causing "Cannot read DataCell with empty type information" errors. Fix: Shortcut the specific case where all indices are taken sequentially by returning a copy instead of calling storage.take(). This fixes the pd.concat code path that triggers during arrow → pandas → arrow round-trips. Limitation: This is only a partial fix for the very specific case of taking all indices. The general take() and re-batching behavior for struct dict encoded arrays remains buggy when partial indices are selected or arbitrary re-batching occurs. AP-25573 (Workflow tests fail with pandas 2.3)
…ionArray.take AP-25573 (Workflow tests fail with pandas 2.3)
AP-25573 (Workflow tests fail with pandas 2.3)
6f8f9dd to
e7ba282
Compare
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.



No description provided.