1

I have several ndjson files that are nearly 800GB. They come from parsing the Wikipedia dump. I would like to remove duplicates html. As such, I group by "html" and pick the json with the most recent "dateModified".

import duckdb
from pathlib import Path

inDir   = r"E:\Personal Projects\tmp\tarFiles\result2"
outDir  = r"C:\Users\Akira\Documents\out_duckdb2.ndjson"
inDir   = Path(inDir)
outDir  = Path(outDir)

con = duckdb.connect()
result = con.sql(f"""
    SET threads=10;
    SET memory_limit='10GB';
    SET preserve_insertion_order=false;
    COPY(SELECT
        html,
        dateModified,
        ROW_NUMBER() OVER (PARTITION BY html ORDER BY dateModified DESC) AS rn
    FROM read_ndjson('{inDir / "*wiktionary*.ndjson"}'))
    TO "{outDir}"
""")

Then I encounter error

---------------------------------------------------------------------------
OutOfMemoryException                      Traceback (most recent call last)
Cell In[3], line 10
      7 outDir  = Path(outDir)
      9 con = duckdb.connect()
---> 10 result = con.sql(f"""
     11     SET threads=10;
     12     SET memory_limit='10GB';
     13     SET preserve_insertion_order=false;
     14     COPY(SELECT
     15         html,
     16         dateModified,
     17         ROW_NUMBER() OVER (PARTITION BY html ORDER BY dateModified DESC) AS rn
     18     FROM read_ndjson('{inDir / "*wiktionary*.ndjson"}'))
     19     TO "{outDir}"
     20 """)

OutOfMemoryException: Out of Memory Error: could not allocate block of size 256.0 KiB (9.3 GiB/9.3 GiB used)

Possible solutions:
* Reducing the number of threads (SET threads=X)
* Disabling insertion-order preservation (SET preserve_insertion_order=false)
* Increasing the memory limit (SET memory_limit='...GB')

See also https://duckdb.org/docs/stable/guides/performance/how_to_tune_workloads

My laptop has 32GB of RAM and 8 CPU cores (16 threads). I have read Memory Management in DuckDB but could not see how to fine tune the parameters.

Could you explain how to fine tune the parameters for my workload?

4
  • 1
    Yes, you can use arg_max(field, dateModified) as field for each field you want to retain. (Sorry, Chrome is not letting me Reply for some reason.) Commented Dec 10 at 23:15
  • @hawkfish Assume I use arg_max(field, dateModified) as field and arg_max(field2, dateModified) as field2. If two rows in the group have the same most recent dateModified but different values of field and different values of field2. Do the values of field and field2 will come from the same row? So that we have consistency. Commented Dec 10 at 23:29
  • 1
    You could use a struct duckdb.org/docs/stable/sql/data_types/struct and unnest it back into individual columns, e.g. unnest(arg_max(struct_pack(field1, field2), dateModified)) Commented Dec 11 at 11:18
  • 1
    @jcurious's solution will ensure that you end up with value from the same row. Commented Dec 12 at 7:24

1 Answer 1

3

The query you are using does not actually remove duplicates - it adds a row number in reverse order to the entire table. This is very memory intensive because it has to sort the entire table.

Instead, you just need to aggregate, grouping by html and extracting the latest dateModified :

select html, max(dateModified) as dateModified
from FROM read_ndjson(...)
group by all
Sign up to request clarification or add additional context in comments.

1 Comment

Can I also select other variables? I meant other variables of the json object with the most recent "dateModified" in each group.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.