Skip to content

Conversation

@shuchu
Copy link
Collaborator

@shuchu shuchu commented Nov 17, 2023

What this PR does / why we need it:
Update the Pyarrow to the latest version v14.0.1 which has the fix for GHSA-5wvp-7f3h-6wmm

Which issue(s) this PR fixes:
Fixes # 3832

  1. Before this PR, the Pyarrow version is v10.0.1 for Feast. It's default parquet format version is v2.4 for the function "pyarrow.parquet.write_table()". Check the "version" argument.
  2. With parquet format v2.4, this function will change the timestamps's resolution (check the doc of argument "coerce_timestamps":
    "By default, for version='1.0' (the default) and version='2.4', nanoseconds are cast to microseconds (‘us’),"
  3. After upgrading the Feast with "pyarrow==14.0.1", the "version" parameter is set to "2.6". As a result, the datatype of timestamp will be "datetime[ns]" instead of "datetime[us]" as before. This change created a problem while writing a Pyarrow table to Google BigQuery and AWS Redshift. From my debugging they show a kind of Value error for columns with "datetime" type.
  4. I explicitly set the argument "coerce_timestamps" to "us" to allow the write_table() to have the same behavior about "timestamp" as before w.r.t this Pyarrow version upgrade.
  5. Not all function calls of "pyarrow.parquet.write_table()" are upgraded. Here is the list of founded function calls:

./python/feast/transformation_server.py:54: writer.write_table(result_arrow) ./python/feast/infra/offline_stores/file.py:109: pyarrow.parquet.write_table( ./python/feast/infra/offline_stores/file.py:470: writer.write_table(new_table) ./python/feast/infra/offline_stores/contrib/spark_offline_store/spark.py:236: pq.write_table(table, tmp_file.name) ./python/feast/infra/offline_stores/contrib/mssql_offline_store/mssql.py:373: pyarrow.parquet.write_table( ./python/feast/infra/offline_stores/bigquery.py:358: pyarrow.parquet.write_table(table=data, where=parquet_temp_file, coerce_timestamps="us") ./python/feast/infra/offline_stores/bigquery.py:407: pyarrow.parquet.write_table(table=table, where=parquet_temp_file, coerce_timestamps="us") ./python/feast/infra/utils/aws_utils.py:207: pq.write_table(table, file_path) ./python/feast/infra/utils/aws_utils.py:356: pq.write_table(table, parquet_temp_file, coerce_timestamps="us") ./python/feast/infra/utils/aws_utils.py:1049: pq.write_table(table, parquet_temp_file)

I try to keep a minimum change. If there is an error shows up in the future, for example, the "upload_arrow_table_to_athena()" function of "aws_utils.py:1049", a new PR can be created with the necessary unit tests and integration tests.

Signed-off-by: Shuchu Han <shuchu.han@gmail.com>
@shuchu shuchu changed the title fix: upgrade the pyarrow to latest v14.0.1 for CVE-2023-47248. fix: Upgrade the pyarrow to latest v14.0.1 for CVE-2023-47248. Nov 17, 2023
@achals achals merged commit 052182b into feast-dev:master Nov 18, 2023
@cburroughs cburroughs mentioned this pull request Jan 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants