6

I want to look at entities and relationships using Wikidata. I downloaded the Wikidata JSON dump (from here .bz2 file, size ~ 18 GB).

However, I cannot open the file, it's just too big for my computer.

Is there a way to look into the file without extracting the full .bz2 file. Especially using Python, I know that there is a PHP dump reader (here), but I can't use it.

3 Answers 3

5

I came up with a strategy that allows to use json module to access information without opening the file:

import bz2
import json

with bz2.open(filename, "rt") as bzinput:
lines = []
for i, line in enumerate(bzinput):
    if i == 10: break
    tweets = json.loads(line)
    lines.append(tweets)

In this way lines will be a list of dictionaries that you can easly manipulate and, for example, reduce their size by removing keys that you don't need.

Note also that (obviously) the condition i==10 can be arbitrarly changed to fit anyone(?) needings. For example, you may parse some line at a time, analyze them and writing on a txt file the indices of the lines you really want from the original file. Than it will be sufficient to read only those lines (using a similar condition in i in the for loop).

Sign up to request clarification or add additional context in comments.

Comments

2

you can use BZ2File interface to manipulate the compressed file. But you can NOT use json module to access information for it, it will take too much space. You will have to index the file meaning you have to read the file line by line and save position and length of interesting object in a Dictionary (hashtable) and then you could extract a given object and load it with the json module.

2 Comments

Thanks, that really helped.
Will json.load do string internment to save memory on identical key strings? E.g. I'd like to load wikidata-20231213-lexemes.json.bz2, but even doing json.load(bz2.open("wikidata-20231213-lexemes.json.bz2", "rt")) takes too long. The bz2 file is 300Mb, uncompressed on disk is 5Gb.
2

You'd have to do line-by-line processing:

import bz2
import json

path = "latest.json.bz2"

with bz2.BZ2File(path) as file:
    for line in file:
        entity = json.loads(line)
        # do your processing here
        print(str(entity)[:50] + "...")

Seeing as WikiData is now 70GB+, you might wish to process it directly from the URL:

import bz2
import json
from urllib.request import urlopen

path = "https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.bz2"

with urlopen(path) as stream:
    with bz2.BZ2File(path) as file:
        ...

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.