Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/reference/data-sources/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,4 +28,4 @@ Below is a matrix indicating which data sources support which types.
| `float64` | yes | yes | yes | yes | yes | yes | yes | yes |
| `bool` | yes | yes | yes | yes | yes | yes | yes | yes |
| `timestamp` | yes | yes | yes | yes | yes | yes | yes | yes |
| array types | yes | yes | yes | no | yes | yes | no | no |
| array types | yes | yes | yes | no | yes | yes | yes | no |
4 changes: 2 additions & 2 deletions docs/reference/data-sources/trino.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ These can be specified either by a table reference or a SQL query.
## Disclaimer

The Trino data source does not achieve full test coverage.
Please do not assume complete stability.
Please do not assume complete stability.

## Examples

Expand All @@ -30,5 +30,5 @@ The full set of configuration options is available [here](https://rtd.feast.dev/

## Supported Types

Trino data sources support all eight primitive types, but currently do not support array types.
Trino data sources support all eight primitive types and their corresponding array types.
For a comparison against other batch data sources, please see [here](overview.md#functionality-matrix).
Original file line number Diff line number Diff line change
Expand Up @@ -207,9 +207,7 @@ def _to_df_internal(self, timeout: Optional[int] = None) -> pd.DataFrame:

def _to_arrow_internal(self, timeout: Optional[int] = None) -> pyarrow.Table:
"""Return payrrow dataset as synchronously including on demand transforms"""
return pyarrow.Table.from_pandas(
self._to_df_internal(timeout=timeout), schema=self.pyarrow_schema
)
return pyarrow.Table.from_pandas(self._to_df_internal(timeout=timeout))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why remove the explicit schema declaration?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's say I have Pandas DataFrame like this:

feature_x feature_y
[1] None
[3, 4] [54, 38]
[75, 1, 12] [40, 0]

If I output column dtypes then feature_x would still be object (array(int)) and feature_y would be converted to float by underlying NumPy because of the Null.

By default, pandas uses NumPy data types, which do not support missing values in integer arrays. If you create a Series or DataFrame column with integers and include a null value (e.g., None or np.nan), pandas will upcast the column to a floating-point type (float64) to accommodate the missing value.

If I later pass this dataframe WITH forced schema, PyArrow will see that I want to cast float to array(int) and throw an error. If I don't pass the schema though, it will infer type itself and work as expected.


def to_sql(self) -> str:
"""Returns the SQL query that will be executed in Trino to build the historical feature table"""
Expand Down
Loading