0

The problem that I have is trying to read in a flattened json file into a polars dataframe in Rust.

Here is the Json example with a flattened JSON format. How would this structure be read into a DataFrame without labeling each column dtype in a struct?

{
  "data": [
    {
      "requestId": "IBM",
      "date": "2024-03-19",
      "sales": 61860,
      "company": "International Business Machines",
      "price": 193.34,
      "score": 7
    },
    {
      "requestId": "AAPL",
      "date": "2024-03-19",
      "sales": 383285,
      "company": "Apple Inc.",
      "price": 176.08,
      "score": 9
    },
    {
      "requestId": "MSFT",
      "date": "2024-03-19",
      "sales": 211915,
      "company": "Microsoft Corporation",
      "price": 421.41,
      "score": 7
    } 
  ]
}

There are only Integers, Floats, and Strings in the data.

Here is the example struct that I tried creating. If there are 200+ columns that change, would it be best to create a HashMap to store the columns dynamically?

#[derive(Debug, Deserialize, Serialize)]
#[serde(rename_all = "camelCase")]
struct Row {
    requestId: String,
    date: String,
    #[serde(flatten)]
    company_data: HashMap<String, serde_json::Value>,
}

This is a second half question for the Non-Flattened JSON data: Transform JSON Key into a Polars DataFrame

1 Answer 1

1

This format is almost what polars' JsonReader expects; it is only the top-level object that is the problem. However, we can strip it with string manipulation:

pub fn flattened(json: &str) -> Result<DataFrame, Box<dyn Error>> {
    let json = json.trim();
    let json = json
        .strip_prefix("{")
        .ok_or("invalid JSON")?
        .strip_suffix("}")
        .ok_or("invalid JSON")?;
    let json = json.trim_start();
    let json = json.strip_prefix(r#""data""#).ok_or("invalid JSON")?;
    let json = json.trim_start();
    let json = json.strip_prefix(":").ok_or("invalid JSON")?;

    let json_reader = JsonReader::new(std::io::Cursor::new(json));
    let mut df = json_reader.finish()?;
    let date = df.column("date")?.cast(&DataType::Date)?;
    df.replace("date", date)?;

    Ok(df)
}
Sign up to request clarification or add additional context in comments.

2 Comments

Thank you again for the example. Do you know if there is a way to avoid declaring each field? For example, this data is much larger with 200+ fields. It would be time consuming to declare especially when the fields change. I was trying to see if we could use requestId and date as a fixed reference where the rest of the fields align on. Essentially a muti-index with date and requestId.
@TrevorSeibert Yes, it is possible to omit the schema, and the JsonReader will guess it. Updated the answer.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.