Python: finding duplicates in large jsonl file

Question

I'm trying to find all json objects in my jsonl file that contain the same identifier value.

So if my data look like:

{
   "data": {
      "value": 42,
      "url": "url.com",
      "details": {
         "timestamp": "07:32:29",
         "identifier": "123ABC"
         }
      },
   "message": "string"
}

I want to find every object that has the same identifier value. The file is too large to load all at once, so instead I check line by line and store just the identifier values. This has the drawback of missing the first object that has that identifier (ie, if objects A, B, and C all have the same identifier, I would only end up with B and C saved). To find the first occurrence of the identifier, I try reading through the file a second time to pick up only the first time each duplicate identifier is found. This is where I encounter some problems.

This part works as intended:

import gzip
import json_lines
import jsonlines
from itertools import groupby

identifiers=set()
duplicates=[]

with json_lines.open('file.jsonlines.gz') as f:
    for item in f:
        ID = item["data"]["details"]["identifier"]
        if ID in identifiers:
            duplicates.append(item)
        else:
            identifiers.add(ID)

dup_IDs={dup["data"]["details"]["identifier"] for dup in duplicates}

But when I read through the file a second time:

with json_lines.open('file.jsonlines.gz') as f:
    for item in f:
        ID = item["data"]["details"]["identifier"]
        if ID in dup_IDs:
            duplicates.append(item)
            dup_IDs.remove(ID)
        else:
            continue

        if len(dup_IDs)==0:
            break
        else:
            continue

It runs for ~30 minutes and eventually crashes my computer. I'm assuming (hoping) this is because there's a problem with my code and not my computer because the code is easier to fix.

I recommend you to apply database which will clear duplicates in process of insert data. — Olvin Roght
– Olvin Roght, Commented Oct 10, 2019 at 18:34
Working with large amount of JSON data, the suggestions for using a database are good ones. I can also recommend looking into Spark, it can handle this problem very elegantly and does the threading/large data caching/optimisation for you. — Richard Nemeth
– Richard Nemeth, Commented Oct 10, 2019 at 21:04

lucas rshane · Accepted Answer · 2019-10-10 18:33:51Z

2

If the file size is too large, I'd suggest to upload data into SQL database and use SQL queries to filter what you needed.

answered Oct 10, 2019 at 18:33

lucas rshane

374 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Faenatek Over a year ago

Is there a way to upload a jsonl.gz file to my database without unzipping it first? The file is too big to unzip on my local computer.

Narcisse Doudieu Siewe · Accepted Answer · 2019-10-10 19:30:00Z

-1

import gzip
import json_lines
import jsonlines
from itertools import groupby

duplicates=[]
nb = {}
i = 0

with json_lines.open('file.jsonlines.gz') as f:
    for item in f:
        ID = item["data"]["details"]["identifier"]
        if ID in nb:
           if ID not in b:
               nb[ID]=int(i)
        else:
            nb[ID]=str(i)
        i +=1
i = 0
k = set(nb[i] for i in nb if isinstance(nb[i], int))
del nb
with json_lines.open('file.jsonlines.gz') as f:
    for item in f:
        if i in k:
           duplicates.append(item)
        i +=1
print(duplicates)

edited Oct 10, 2019 at 19:30

answered Oct 10, 2019 at 19:01

Narcisse Doudieu Siewe

1,0941 gold badge7 silver badges9 bronze badges

Collectives™ on Stack Overflow

Python: finding duplicates in large jsonl file

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related