0

I have several ndjson files that are nearly 800GB. They come from parsing the Wikipedia dump. I would like to remove duplicates html. As such, I group by "html" and pick the json with the most recent "dateModified".

from pathlib import Path
import polars as pl

inDir   = r"E:\Personal Projects\tmp\tarFiles\result2"
outDir  = r"C:\Users\Akira\Documents\out_polars.ndjson"
inDir   = Path(inDir)
outDir  = Path(outDir)

schema = {"name"        : pl.String,
          "dateModified": pl.String,
          "identifier"  : pl.UInt64,
          "url"         : pl.String,
          "html"        : pl.String}

lf = pl.scan_ndjson(inDir / "*wiktionary*.ndjson", schema=schema)
lf = lf.group_by(["html"]).agg(pl.max("dateModified").alias("dateModified"))
lf.sink_ndjson(outDir,
               maintain_order=False,
               engine="streaming")

However, I encounter the out-of-memory (OOM) error: the RAM usage increases gradually until my laptop crashes:

The Kernel crashed while executing code in the current cell or a previous cell. 
Please review the code in the cell(s) to identify a possible cause of the failure. 
Click <a href='https://aka.ms/vscodeJupyterKernelCrash'>here</a> for more info. 
View Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details.

How can we resolve the OOM problem? Thank you for your elaboration.

3
  • Depending on your business case — and assuming you can use .scan_ndjson() — a good approach is to leverage PartitionByKey on the key that is most important and least recurrent. Perform your aggregation function on each partitioned file, then concatenate those intermediate results. Finally, run your final aggregation on the combined output. Commented Dec 10 at 11:33
  • 1
    I assume the problem is "html" since that's where the website data is stored? So you could group-by on "url" without loading "html" (exclude it from schema), save the result to disk, and then do a streaming join where you load and join "html" on the above result. Alternatively, you could use polars.Expr.hash on "html" in a first stream. Then do the group-by on the hash and save the result, and then do the above mentioned streaming join. In both options, you don't need to include the "html" data in the group-by, which I guess can't be streamed b/c of the max function, causing OOM. Commented Dec 10 at 13:46
  • @usdn The second approach using polars.Expr.hash is promising. Could you give me details and post it as an answer. Commented Dec 10 at 13:50

1 Answer 1

2

Following up on my comment, the below could be an option. It avoids including all "html" data in the group_by which likely causes the OOM. But since there is no MRE, I cannot test it to ensure it's correct...

# Create hash of "html" using streaming
schema = {"name"        : pl.String,
          "dateModified": pl.String,
          "identifier"  : pl.UInt64,
          "url"         : pl.String,
          "html"        : pl.String}

lf = pl.scan_ndjson(inDir / "*wiktionary*.ndjson", schema=schema)
lf = lf.select(pl.col("html", "dateModified"),
               pl.col("html").hash(10, 20, 30, 40).alias("html_hash"))
lf.sink_ipc(tmpDir / "hash.arrow",
            maintain_order=False,
            engine="streaming")

# Group-by on hash, not including "html"
lf = pl.read_ipc(tmpDir / "hash.arrow", columns=["html_hash", "dateModified"])
df = lf.group_by("html_hash").agg(pl.max("dateModified"))
df.write_ipc(tmpDir / "groupby.arrow")

# Join "html" on result of group-by using streaming
df = pl.scan_ipc(tmpDir / "groupby.arrow")
lf = pl.scan_ipc(tmpDir / "hash.arrow").select(pl.col("html_hash", "html"))
df = df.join(lf, on="html_hash", how="left")
df.sink_ipc(outDir / "output.arrow",
            maintain_order=False,
            engine="streaming")
Sign up to request clarification or add additional context in comments.

3 Comments

Two different htmls may have the same hash. So I guess we need to group by "html" on each subdataframe whose elements have the same "html_hash". Of course, the size of each subdataframe is small enough such that it can fit in RAM.
Hash collision is something to consider - and depends on how many individual "html" values you have. Since Polars creates a UINT64 hash, the number of "html" values at which it becomes more likely than not to see a collision is Sqrt[n] = Sqrt[2^64] = 2^32 which is ~4billion. (source and more details: stackoverflow.com/questions/22029012/…)
Thank you very much for your elaboration. The benefit of reduced computation outweighs the possibility of duplication.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.