How to parse Wikidata JSON (.bz2) file using Python?

Question

I want to look at entities and relationships using Wikidata. I downloaded the Wikidata JSON dump (from here .bz2 file, size ~ 18 GB).

However, I cannot open the file, it's just too big for my computer.

Is there a way to look into the file without extracting the full .bz2 file. Especially using Python, I know that there is a PHP dump reader (here), but I can't use it.

GRquanti · Accepted Answer · 2019-02-20 23:05:35Z

5

I came up with a strategy that allows to use json module to access information without opening the file:

import bz2
import json

with bz2.open(filename, "rt") as bzinput:
lines = []
for i, line in enumerate(bzinput):
    if i == 10: break
    tweets = json.loads(line)
    lines.append(tweets)

In this way lines will be a list of dictionaries that you can easly manipulate and, for example, reduce their size by removing keys that you don't need.

Note also that (obviously) the condition i==10 can be arbitrarly changed to fit anyone(?) needings. For example, you may parse some line at a time, analyze them and writing on a txt file the indices of the lines you really want from the original file. Than it will be sufficient to read only those lines (using a similar condition in i in the for loop).

answered Feb 20, 2019 at 23:05

GRquanti

57411 silver badges26 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

alkaya · Accepted Answer · 2018-01-03 13:54:15Z

2

you can use BZ2File interface to manipulate the compressed file. But you can NOT use json module to access information for it, it will take too much space. You will have to index the file meaning you have to read the file line by line and save position and length of interesting object in a Dictionary (hashtable) and then you could extract a given object and load it with the json module.

answered Jan 3, 2018 at 13:54

alkaya

1466 bronze badges

2 Comments

pajamas Over a year ago

Thanks, that really helped.

Vadim Kantorov Over a year ago

Will json.load do string internment to save memory on identical key strings? E.g. I'd like to load wikidata-20231213-lexemes.json.bz2, but even doing json.load(bz2.open("wikidata-20231213-lexemes.json.bz2", "rt")) takes too long. The bz2 file is 300Mb, uncompressed on disk is 5Gb.

Chrisjan · Accepted Answer · 2024-06-10 09:07:09Z

2

You'd have to do line-by-line processing:

import bz2
import json

path = "latest.json.bz2"

with bz2.BZ2File(path) as file:
    for line in file:
        entity = json.loads(line)
        # do your processing here
        print(str(entity)[:50] + "...")

Seeing as WikiData is now 70GB+, you might wish to process it directly from the URL:

import bz2
import json
from urllib.request import urlopen

path = "https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.bz2"

with urlopen(path) as stream:
    with bz2.BZ2File(path) as file:
        ...

edited Jun 10, 2024 at 9:07

answered Oct 18, 2021 at 14:42

Chrisjan

3983 silver badges11 bronze badges

Collectives™ on Stack Overflow

How to parse Wikidata JSON (.bz2) file using Python?

3 Answers 3

Comments

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related