Bug/AP-25573 Fix bugs with pandas >= 2.1.0 by HedgehogCode · Pull Request #56 · knime/knime-python

HedgehogCode · 2026-02-03T14:29:19Z

No description provided.

Copilot

Pull request overview

This PR addresses bugs that emerged with pandas version 2.1.0 and later, specifically related to how pandas handles extension arrays and struct dict encoded data. The changes fix data corruption issues during pandas-arrow round-trips and ensure compatibility with newer pandas versions.

Changes:

Fixed type inference issues in PyArrow array creation by explicitly specifying type=pa.bool_()
Modified KnimePandasExtensionArray.take() to prevent struct dict encoding corruption when pandas re-batches data
Added empty array handling in isna() method
Converted test suite from unittest to pytest with parameterized testing for both populated and empty tables

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
pytest.ini	Re-enabled pytest class discovery to support class-based tests
org.knime.python3.arrow/src/main/python/knime/_arrow/_types.py	Added explicit boolean type to mask array creation
org.knime.python3.arrow/src/main/python/knime/_arrow/_table.py	Added TODO comment about struct-dict-encoded array re-batching issue
org.knime.python3.arrow/src/main/python/knime/_arrow/_pandas.py	Fixed pandas 2.1.0+ compatibility by handling empty arrays and optimizing take() to prevent encoding corruption
org.knime.python3.arrow/src/main/python/knime/_arrow/_dictencoding.py	Added explicit boolean type to mask array creation
org.knime.python3.arrow.tests/src/test/python/unittest/test_table.py	Converted from unittest to pytest with parameterized testing for empty/non-empty tables
org.knime.python3.arrow.tests/src/test/python/unittest/test_pandas_extension_type.py	Added regression test for struct dict encoding corruption bug
org.knime.python3.arrow.tests/src/test/python/unittest/structDictEncodedDataCellsWithBatches.zip	Added test data file for struct dict encoding regression test
org.knime.python3.arrow.tests/src/test/python/unittest/emptyGeneratedTestData.zip	Added test data file for empty table testing

pytest.ini

org.knime.python3.arrow/src/main/python/knime/_arrow/_pandas.py

Copilot · 2026-02-05T09:19:14Z

@HedgehogCode I've opened a new pull request, #57, to work on those changes. Once the pull request is ready, I'll request review from you.

AP-25573 (Workflow tests fail with pandas 2.3)

…mpty arrays The usual case works for empty arrays and handles dtypes of extension arrays correctly. AP-25573 (Workflow tests fail with pandas 2.3)

… of int/bool AP-25573 (Workflow tests fail with pandas 2.3)

Otherwise, if the chunked array has no chunks and we attempt to concatenate no arrays, which fails. AP-25573 (Workflow tests fail with pandas 2.3)

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

org.knime.python3.arrow.tests/src/test/python/unittest/test_table.py

org.knime.python3.arrow.tests/src/test/python/unittest/test_pandas_extension_type.py

…are taken Bug: In pandas 2.1.0+, pd.concat changed its behavior and now calls KnimePandasExtensionArray.take with indices [0,1,2,...,n] to effectively copy the array. When take is called on struct dict encoded arrays, it delegates to storage.take(), which merges chunks of the ChunkedArray without recalculating the struct-dict encoding. This breaks the dictionary structure across chunk boundaries: keys in chunk 2 reference indices that only exist in chunk 1's data array, causing "Cannot read DataCell with empty type information" errors. Fix: Shortcut the specific case where all indices are taken sequentially by returning a copy instead of calling storage.take(). This fixes the pd.concat code path that triggers during arrow → pandas → arrow round-trips. Limitation: This is only a partial fix for the very specific case of taking all indices. The general take() and re-batching behavior for struct dict encoded arrays remains buggy when partial indices are selected or arbitrary re-batching occurs. AP-25573 (Workflow tests fail with pandas 2.3)

…ionArray.take AP-25573 (Workflow tests fail with pandas 2.3)

AP-25573 (Workflow tests fail with pandas 2.3)

sonarqubecloud · 2026-02-05T13:19:26Z

Quality Gate passed

Issues
2 New issues
0 Accepted issues

Measures
0 Security Hotspots
100.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

HedgehogCode requested a review from a team as a code owner February 3, 2026 14:29

HedgehogCode requested review from Copilot and knime-ghub-bot and removed request for a team February 3, 2026 14:29

Copilot AI reviewed Feb 3, 2026

View reviewed changes

pytest.ini Show resolved Hide resolved

org.knime.python3.arrow/src/main/python/knime/_arrow/_pandas.py Outdated Show resolved Hide resolved

Copilot AI mentioned this pull request Feb 5, 2026

Update test convention docs to reflect class discovery for parameterized tests #57

Closed

HedgehogCode force-pushed the bug/AP-25573-pandas-2-3-support branch from ce94bd8 to 239a578 Compare February 5, 2026 09:23

HedgehogCode added 5 commits February 5, 2026 10:29

AP-25573: Parameterize ArrowTableTest with empty table

d59fb50

AP-25573 (Workflow tests fail with pandas 2.3)

AP-25573: Add comment about changed pd.concat behavior

103ba71

AP-25573 (Workflow tests fail with pandas 2.3)

AP-25573: Remove special case in KnimePandasExtensionArray.take for e…

dec9263

…mpty arrays The usual case works for empty arrays and handles dtypes of extension arrays correctly. AP-25573 (Workflow tests fail with pandas 2.3)

AP-25573: Prevent using default float64 type for empty arrays instead…

264a497

… of int/bool AP-25573 (Workflow tests fail with pandas 2.3)

AP-25573: Shortcut for isna if extension array is empty

d198684

Otherwise, if the chunked array has no chunks and we attempt to concatenate no arrays, which fails. AP-25573 (Workflow tests fail with pandas 2.3)

HedgehogCode force-pushed the bug/AP-25573-pandas-2-3-support branch from 239a578 to 6f8f9dd Compare February 5, 2026 09:29

Copilot AI review requested due to automatic review settings February 5, 2026 09:29

Copilot AI reviewed Feb 5, 2026

View reviewed changes

HedgehogCode added 3 commits February 5, 2026 12:58

AP-25573: Add todo comment for batch merging bug in KnimePandasExtens…

973b17c

…ionArray.take AP-25573 (Workflow tests fail with pandas 2.3)

AP-25573: Add TODO comment about splitting before writing Arrow tables

e7ba282

AP-25573 (Workflow tests fail with pandas 2.3)

HedgehogCode force-pushed the bug/AP-25573-pandas-2-3-support branch from 6f8f9dd to e7ba282 Compare February 5, 2026 11:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug/AP-25573 Fix bugs with pandas >= 2.1.0#56

Bug/AP-25573 Fix bugs with pandas >= 2.1.0#56
HedgehogCode wants to merge 8 commits intomasterfrom
bug/AP-25573-pandas-2-3-support

HedgehogCode commented Feb 3, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI commented Feb 5, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sonarqubecloud bot commented Feb 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

Conversation

HedgehogCode commented Feb 3, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Copilot AI commented Feb 5, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sonarqubecloud bot commented Feb 5, 2026

Quality Gate passed

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants