FIX: Request SQL_CHAR as SQL_C_WCHAR in arrow fetch path#575
Conversation
There was a problem hiding this comment.
Pull request overview
Updates the Arrow fetch path in the C++ pybind layer to always request SQL_CHAR/SQL_VARCHAR data as SQL_C_WCHAR (UTF-16) so Arrow results are correct regardless of server/client codepage, locale, or platform—addressing the VARCHAR non-ASCII decoding issues reported in #553.
Changes:
- Switch Arrow batch binding/fetching for
SQL_CHAR/SQL_VARCHARfromSQL_C_CHARtoSQL_C_WCHARto avoid codepage-dependent decoding. - Remove the narrow-char copy path for
SQL_CHAR/SQL_VARCHARin Arrow batch production and route through the existing wide-char → UTF-8 conversion logic. - Add an Arrow regression test covering Unicode round-tripping through a UTF-8-collated
VARCHARcolumn.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
mssql_python/pybind/ddbc_bindings.cpp |
Changes Arrow batch binding and fetch handling so VARCHAR is requested as SQL_C_WCHAR, ensuring consistent Unicode correctness. |
tests/test_004_cursor_arrow.py |
Adds a regression test to validate Arrow output for Unicode data stored in VARCHAR with UTF-8 collation. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
| // it processes raw byte buffers directly, not via Python codecs. | ||
| ret = SQLBindColums(hStmt, buffers, columnNames, numCols, fetchSize, SQL_C_CHAR); | ||
| // Always request WCHARs so we don't have to deal with CHAR encodings | ||
| ret = SQLBindColums(hStmt, buffers, columnNames, numCols, fetchSize, SQL_C_WCHAR); |
There was a problem hiding this comment.
@ffelixg , I think we should avoid hardcoding SQL_C_WCHAR here. With the recent design update introduced in PR #495(#495) for CP1252 character set handling, we’ve moved toward a more flexible approach. It would be good to align with that design for Arrow support as well to ensure consistency and maintainability.
Work Item / Issue Reference
Summary
Due to #495, we can now request SQL_CHAR data as SQL_C_WCHAR, i.e. utf16le strings. Doing this for the arrow path ensures that arrow methods always return correct data no matter the encoding settings / locale / operating system. There does not seem to be any significant negative performance impact.