Polars: out-of-memory problem of groupby-max

Question

I have several ndjson files that are nearly 800GB. They come from parsing the Wikipedia dump. I would like to remove duplicates html. As such, I group by "html" and pick the json with the most recent "dateModified".

from pathlib import Path
import polars as pl

inDir   = r"E:\Personal Projects\tmp\tarFiles\result2"
outDir  = r"C:\Users\Akira\Documents\out_polars.ndjson"
inDir   = Path(inDir)
outDir  = Path(outDir)

schema = {"name"        : pl.String,
          "dateModified": pl.String,
          "identifier"  : pl.UInt64,
          "url"         : pl.String,
          "html"        : pl.String}

lf = pl.scan_ndjson(inDir / "*wiktionary*.ndjson", schema=schema)
lf = lf.group_by(["html"]).agg(pl.max("dateModified").alias("dateModified"))
lf.sink_ndjson(outDir,
               maintain_order=False,
               engine="streaming")

However, I encounter the out-of-memory (OOM) error: the RAM usage increases gradually until my laptop crashes:

The Kernel crashed while executing code in the current cell or a previous cell. 
Please review the code in the cell(s) to identify a possible cause of the failure. 
Click <a href='https://aka.ms/vscodeJupyterKernelCrash'>here</a> for more info. 
View Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details.

How can we resolve the OOM problem? Thank you for your elaboration.

Depending on your business case — and assuming you can use .scan_ndjson() — a good approach is to leverage PartitionByKey on the key that is most important and least recurrent. Perform your aggregation function on each partitioned file, then concatenate those intermediate results. Finally, run your final aggregation on the combined output. — Simon
– Simon, Commented Dec 10 at 11:33
I assume the problem is "html" since that's where the website data is stored? So you could group-by on "url" without loading "html" (exclude it from schema), save the result to disk, and then do a streaming join where you load and join "html" on the above result. Alternatively, you could use polars.Expr.hash on "html" in a first stream. Then do the group-by on the hash and save the result, and then do the above mentioned streaming join. In both options, you don't need to include the "html" data in the group-by, which I guess can't be streamed b/c of the max function, causing OOM. — usdn
– usdn, Commented Dec 10 at 13:46
@usdn The second approach using polars.Expr.hash is promising. Could you give me details and post it as an answer. — Akira
– Akira, Commented Dec 10 at 13:50

usdn · Accepted Answer · 2025-12-10 14:38:13Z

2

Following up on my comment, the below could be an option. It avoids including all "html" data in the group_by which likely causes the OOM. But since there is no MRE, I cannot test it to ensure it's correct...

# Create hash of "html" using streaming
schema = {"name"        : pl.String,
          "dateModified": pl.String,
          "identifier"  : pl.UInt64,
          "url"         : pl.String,
          "html"        : pl.String}

lf = pl.scan_ndjson(inDir / "*wiktionary*.ndjson", schema=schema)
lf = lf.select(pl.col("html", "dateModified"),
               pl.col("html").hash(10, 20, 30, 40).alias("html_hash"))
lf.sink_ipc(tmpDir / "hash.arrow",
            maintain_order=False,
            engine="streaming")

# Group-by on hash, not including "html"
lf = pl.read_ipc(tmpDir / "hash.arrow", columns=["html_hash", "dateModified"])
df = lf.group_by("html_hash").agg(pl.max("dateModified"))
df.write_ipc(tmpDir / "groupby.arrow")

# Join "html" on result of group-by using streaming
df = pl.scan_ipc(tmpDir / "groupby.arrow")
lf = pl.scan_ipc(tmpDir / "hash.arrow").select(pl.col("html_hash", "html"))
df = df.join(lf, on="html_hash", how="left")
df.sink_ipc(outDir / "output.arrow",
            maintain_order=False,
            engine="streaming")

answered Dec 10 at 14:38

usdn

5144 silver badges17 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Akira Dec 10 at 14:58

Two different htmls may have the same hash. So I guess we need to group by "html" on each subdataframe whose elements have the same "html_hash". Of course, the size of each subdataframe is small enough such that it can fit in RAM.

usdn Dec 10 at 16:22

Hash collision is something to consider - and depends on how many individual "html" values you have. Since Polars creates a UINT64 hash, the number of "html" values at which it becomes more likely than not to see a collision is Sqrt[n] = Sqrt[2^64] = 2^32 which is ~4billion. (source and more details: stackoverflow.com/questions/22029012/…)

Akira Dec 10 at 16:40

Thank you very much for your elaboration. The benefit of reduced computation outweighs the possibility of duplication.

Collectives™ on Stack Overflow

Polars: out-of-memory problem of groupby-max

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related