1

Issue

I have data in form of a list of dicts (see MRE below). To make everything type strict I would always like to pass in the expected schema (dtypes) when I read in this data. This option is given in the pl.DataFrame constructor with either schema or schema_overrides. However I frequently run into trouble with the Datetime columns in the schema. Especially when they presented as strings in the dictionaries

Traceback

polars.exceptions.ComputeError: could not append value: "2020-02-11" of type: str to the builder; make sure that all rows have the same schema or consider increasing `infer_schema_length`

Question

Is there a way to "automatically" parse datetime strings when I construct the Dataframe (or use the pl.from_dicts() method)? Something comparable to the solution for data that is present as timestamps (int) in the dictionary of the data implemented early 2024 (github issue)?

Is there something similar for date information present as string (e.g. 2022-01-01)?

Or do I have to drop from my schema_override every pl.Datetime key and then later on convert this manually via

with_columns(pl.col(list_dropped_datetime_cols).cast(pl.Datetime))

MRE

import polars as pl

schema_override = {
    "some_int_override": pl.Int8,
    "some_date_override": pl.Datetime,
}

dict_data = [
    {
        "some_int_override": 1,
        "some_date_override": "2020-02-11",
        "some_date": "2025-02-11",
    }
]


df_naiive = pl.DataFrame(dict_data)
print(df_naiive)

df_schema_override = pl.DataFrame(dict_data, schema_overrides=schema_override)
print(df_schema_override)

4
  • 1
    I wonder if this is considered a bug? There seems to be a behaviour mismatch in the different frame construction methods. CSV "works" in this case: pl.read_csv(df_naiive.write_csv().encode(), schema_overrides=schema_override) - The manual cast() also gives me a ComputeError but .str.to_datetime() works. (It seems temporal cast is going to be deprecated github.com/pola-rs/polars/issues/23363) Commented Aug 2 at 8:21
  • Did not know this thanks. Inconsistent behavior across construction method could be considered a bug imo. However, I believe it could be more of a side-effect in the pl.read_csv. Do you think it is worth making this a github issue? With maybe an enhancement proposal? I know more people that have the same problem, e.g. they have a pipeline with multiple containers that need to access Data and just want to have a dtype map to reconstruct the data types from the previous step. Commented Aug 4 at 17:44
  • I did find github.com/pola-rs/polars/issues/19258 but it's about CSV/JSON/NDJSON behaving differently. I cant find any issues about how DataFrame() behaves. I guess people must be using the with_columns approach if it hasn't been raised before. Commented Aug 5 at 17:22
  • github.com/pola-rs/polars/issues/12900#issuecomment-1842170343 seems to suggest the schema= behaviour is expected for DataFrame() Commented Aug 8 at 9:55

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.