I have several ndjson files that are nearly 800GB. They come from parsing the Wikipedia dump. I would like to remove duplicates html. As such, I group by "html" and pick the json with the most recent "dateModified".
import duckdb
from pathlib import Path
inDir = r"E:\Personal Projects\tmp\tarFiles\result2"
outDir = r"C:\Users\Akira\Documents\out_duckdb2.ndjson"
inDir = Path(inDir)
outDir = Path(outDir)
con = duckdb.connect()
result = con.sql(f"""
SET threads=10;
SET memory_limit='10GB';
SET preserve_insertion_order=false;
COPY(SELECT
html,
dateModified,
ROW_NUMBER() OVER (PARTITION BY html ORDER BY dateModified DESC) AS rn
FROM read_ndjson('{inDir / "*wiktionary*.ndjson"}'))
TO "{outDir}"
""")
Then I encounter error
---------------------------------------------------------------------------
OutOfMemoryException Traceback (most recent call last)
Cell In[3], line 10
7 outDir = Path(outDir)
9 con = duckdb.connect()
---> 10 result = con.sql(f"""
11 SET threads=10;
12 SET memory_limit='10GB';
13 SET preserve_insertion_order=false;
14 COPY(SELECT
15 html,
16 dateModified,
17 ROW_NUMBER() OVER (PARTITION BY html ORDER BY dateModified DESC) AS rn
18 FROM read_ndjson('{inDir / "*wiktionary*.ndjson"}'))
19 TO "{outDir}"
20 """)
OutOfMemoryException: Out of Memory Error: could not allocate block of size 256.0 KiB (9.3 GiB/9.3 GiB used)
Possible solutions:
* Reducing the number of threads (SET threads=X)
* Disabling insertion-order preservation (SET preserve_insertion_order=false)
* Increasing the memory limit (SET memory_limit='...GB')
See also https://duckdb.org/docs/stable/guides/performance/how_to_tune_workloads
My laptop has 32GB of RAM and 8 CPU cores (16 threads). I have read Memory Management in DuckDB but could not see how to fine tune the parameters.
Could you explain how to fine tune the parameters for my workload?
arg_max(field, dateModified) as fieldfor each field you want to retain. (Sorry, Chrome is not letting me Reply for some reason.)arg_max(field, dateModified) as fieldandarg_max(field2, dateModified) as field2. If two rows in the group have the same most recentdateModifiedbut different values offieldand different values offield2. Do the values offieldandfield2will come from the same row? So that we have consistency.unnest(arg_max(struct_pack(field1, field2), dateModified))