I have several ndjson files that are nearly 800GB. They come from parsing the Wikipedia dump. I would like to remove duplicates html. As such, I group by "html" and pick the json with the most recent "dateModified".
from pathlib import Path
import polars as pl
inDir = r"E:\Personal Projects\tmp\tarFiles\result2"
outDir = r"C:\Users\Akira\Documents\out_polars.ndjson"
inDir = Path(inDir)
outDir = Path(outDir)
schema = {"name" : pl.String,
"dateModified": pl.String,
"identifier" : pl.UInt64,
"url" : pl.String,
"html" : pl.String}
lf = pl.scan_ndjson(inDir / "*wiktionary*.ndjson", schema=schema)
lf = lf.group_by(["html"]).agg(pl.max("dateModified").alias("dateModified"))
lf.sink_ndjson(outDir,
maintain_order=False,
engine="streaming")
However, I encounter the out-of-memory (OOM) error: the RAM usage increases gradually until my laptop crashes:
The Kernel crashed while executing code in the current cell or a previous cell.
Please review the code in the cell(s) to identify a possible cause of the failure.
Click <a href='https://aka.ms/vscodeJupyterKernelCrash'>here</a> for more info.
View Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details.
How can we resolve the OOM problem? Thank you for your elaboration.
.scan_ndjson()— a good approach is to leveragePartitionByKeyon the key that is most important and least recurrent. Perform your aggregation function on each partitioned file, then concatenate those intermediate results. Finally, run your final aggregation on the combined output."html"since that's where the website data is stored? So you could group-by on"url"without loading"html"(exclude it from schema), save the result to disk, and then do a streaming join where you load and join"html"on the above result. Alternatively, you could usepolars.Expr.hashon"html"in a first stream. Then do the group-by on the hash and save the result, and then do the above mentioned streaming join. In both options, you don't need to include the"html"data in the group-by, which I guess can't be streamed b/c of the max function, causing OOM.polars.Expr.hashis promising. Could you give me details and post it as an answer.