Create Polars DataFrame with Flattened Json File

Question

The problem that I have is trying to read in a flattened json file into a polars dataframe in Rust.

Here is the Json example with a flattened JSON format. How would this structure be read into a DataFrame without labeling each column dtype in a struct?

{
  "data": [
    {
      "requestId": "IBM",
      "date": "2024-03-19",
      "sales": 61860,
      "company": "International Business Machines",
      "price": 193.34,
      "score": 7
    },
    {
      "requestId": "AAPL",
      "date": "2024-03-19",
      "sales": 383285,
      "company": "Apple Inc.",
      "price": 176.08,
      "score": 9
    },
    {
      "requestId": "MSFT",
      "date": "2024-03-19",
      "sales": 211915,
      "company": "Microsoft Corporation",
      "price": 421.41,
      "score": 7
    } 
  ]
}

There are only Integers, Floats, and Strings in the data.

Here is the example struct that I tried creating. If there are 200+ columns that change, would it be best to create a HashMap to store the columns dynamically?

#[derive(Debug, Deserialize, Serialize)]
#[serde(rename_all = "camelCase")]
struct Row {
    requestId: String,
    date: String,
    #[serde(flatten)]
    company_data: HashMap<String, serde_json::Value>,
}

This is a second half question for the Non-Flattened JSON data: Transform JSON Key into a Polars DataFrame

Chayim Friedman · Accepted Answer · 2024-03-26 12:32:21Z

1

This format is almost what polars' JsonReader expects; it is only the top-level object that is the problem. However, we can strip it with string manipulation:

pub fn flattened(json: &str) -> Result<DataFrame, Box<dyn Error>> {
    let json = json.trim();
    let json = json
        .strip_prefix("{")
        .ok_or("invalid JSON")?
        .strip_suffix("}")
        .ok_or("invalid JSON")?;
    let json = json.trim_start();
    let json = json.strip_prefix(r#""data""#).ok_or("invalid JSON")?;
    let json = json.trim_start();
    let json = json.strip_prefix(":").ok_or("invalid JSON")?;

    let json_reader = JsonReader::new(std::io::Cursor::new(json));
    let mut df = json_reader.finish()?;
    let date = df.column("date")?.cast(&DataType::Date)?;
    df.replace("date", date)?;

    Ok(df)
}

edited Mar 26, 2024 at 12:32

answered Mar 25, 2024 at 22:03

Chayim Friedman

76.4k5 gold badges97 silver badges141 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Trevor Seibert Over a year ago

Thank you again for the example. Do you know if there is a way to avoid declaring each field? For example, this data is much larger with 200+ fields. It would be time consuming to declare especially when the fields change. I was trying to see if we could use requestId and date as a fixed reference where the rest of the fields align on. Essentially a muti-index with date and requestId.

Chayim Friedman Over a year ago

@TrevorSeibert Yes, it is possible to omit the schema, and the JsonReader will guess it. Updated the answer.

Collectives™ on Stack Overflow

Create Polars DataFrame with Flattened Json File

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related