Skip to content

Parquet Schema Inference only supports File, not directory #2685

@dvanbrug

Description

@dvanbrug

When using a FileSource that is in Parquet format, if the source happens to be a directory of partitioned Parquet files, the following lines throw an error:

schema = ParquetFile(
path if filesystem is None else filesystem.open_input_file(path)
).schema_arrow

OSError: Expected file path, but /home/ubuntu/project/data/driver_stats_partitioned is a directory

How to replicate:

  1. Start with a demo feast project (feast init)
  2. Create a partitioned Parquet Dataset. Use the following to create a dataset with only a single timestamp for inference
import pyarrow.parquet as pq
df = pq.read_table("./data/driver_stats.parquet")
df = df.drop(["created"])
pq.write_to_dataset(df, "./data/driver_stats_partitioned")
  1. Update the file source in example.py to look like this:
driver_hourly_stats = FileSource(
    path="/home/ubuntu/cado-feast/feature_store/exciting_sunbeam/data/driver_stats_partitioned2",
)
  1. Run feast apply
    For now, I've been able to fix by updating the above lines to:
schema = ParquetDataset(
    path if filesystem is None else filesystem.open_input_file(path)
).schema.to_arrow_schema()

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions