1

I'm trying to find all json objects in my jsonl file that contain the same identifier value.

So if my data look like:

{
   "data": {
      "value": 42,
      "url": "url.com",
      "details": {
         "timestamp": "07:32:29",
         "identifier": "123ABC"
         }
      },
   "message": "string"
}

I want to find every object that has the same identifier value. The file is too large to load all at once, so instead I check line by line and store just the identifier values. This has the drawback of missing the first object that has that identifier (ie, if objects A, B, and C all have the same identifier, I would only end up with B and C saved). To find the first occurrence of the identifier, I try reading through the file a second time to pick up only the first time each duplicate identifier is found. This is where I encounter some problems.

This part works as intended:

import gzip
import json_lines
import jsonlines
from itertools import groupby

identifiers=set()
duplicates=[]

with json_lines.open('file.jsonlines.gz') as f:
    for item in f:
        ID = item["data"]["details"]["identifier"]
        if ID in identifiers:
            duplicates.append(item)
        else:
            identifiers.add(ID)

dup_IDs={dup["data"]["details"]["identifier"] for dup in duplicates}

But when I read through the file a second time:

with json_lines.open('file.jsonlines.gz') as f:
    for item in f:
        ID = item["data"]["details"]["identifier"]
        if ID in dup_IDs:
            duplicates.append(item)
            dup_IDs.remove(ID)
        else:
            continue

        if len(dup_IDs)==0:
            break
        else:
            continue

It runs for ~30 minutes and eventually crashes my computer. I'm assuming (hoping) this is because there's a problem with my code and not my computer because the code is easier to fix.

3
  • 1
    I recommend you to apply database which will clear duplicates in process of insert data. Commented Oct 10, 2019 at 18:34
  • can you test my code? Commented Oct 10, 2019 at 19:10
  • Working with large amount of JSON data, the suggestions for using a database are good ones. I can also recommend looking into Spark, it can handle this problem very elegantly and does the threading/large data caching/optimisation for you. Commented Oct 10, 2019 at 21:04

2 Answers 2

2

If the file size is too large, I'd suggest to upload data into SQL database and use SQL queries to filter what you needed.

Sign up to request clarification or add additional context in comments.

1 Comment

Is there a way to upload a jsonl.gz file to my database without unzipping it first? The file is too big to unzip on my local computer.
-1
import gzip
import json_lines
import jsonlines
from itertools import groupby

duplicates=[]
nb = {}
i = 0

with json_lines.open('file.jsonlines.gz') as f:
    for item in f:
        ID = item["data"]["details"]["identifier"]
        if ID in nb:
           if ID not in b:
               nb[ID]=int(i)
        else:
            nb[ID]=str(i)
        i +=1
i = 0
k = set(nb[i] for i in nb if isinstance(nb[i], int))
del nb
with json_lines.open('file.jsonlines.gz') as f:
    for item in f:
        if i in k:
           duplicates.append(item)
        i +=1
print(duplicates)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.