Python conversion from JSON to JSONL

Question

I wish to manipulate a standard JSON object to an object where each line must contain a separate, self-contained valid JSON object. See JSON Lines

JSON_file =

[{u'index': 1,
  u'no': 'A',
  u'met': u'1043205'},
 {u'index': 2,
  u'no': 'B',
  u'met': u'000031043206'},
 {u'index': 3,
  u'no': 'C',
  u'met': u'0031043207'}]

To JSONL:

{u'index': 1, u'no': 'A', u'met': u'1043205'}
{u'index': 2, u'no': 'B', u'met': u'031043206'}
{u'index': 3, u'no': 'C', u'met': u'0031043207'}

My current solution is to read the JSON file as a text file and remove the [ from the beginning and the ] from the end. Thus, creating a valid JSON object on each line, rather than a nested object containing lines.

I wonder if there is a more elegant solution? I suspect something could go wrong using string manipulation on the file.

The motivation is to read json files into RDD on Spark. See related question - Reading JSON with Apache Spark - `corrupt_record`

That's not valid JSON input, nor valid JSON output. You are handling Python objects here, not JSON serialisation. Even if your output was valid JSON, it would not be valid JSONL because you have trailing commas. — Martijn Pieters
– Martijn Pieters, Commented Aug 12, 2016 at 10:01
Also, if the objects in the output would be valid JSON, there would be no trailing commas. — user824425
– user824425, Commented Aug 12, 2016 at 10:03

Martijn Pieters · Accepted Answer · 2016-08-12 10:08:11Z

104

Your input appears to be a sequence of Python objects; it certainly is not valid a JSON document.

If you have a list of Python dictionaries, then all you have to do is dump each entry into a file separately, followed by a newline:

import json

with open('output.jsonl', 'w') as outfile:
    for entry in JSON_file:
        json.dump(entry, outfile)
        outfile.write('\n')

The default configuration for the json module is to output JSON without newlines embedded.

Assuming your A, B and C names are really strings, that would produce:

{"index": 1, "met": "1043205", "no": "A"}
{"index": 2, "met": "000031043206", "no": "B"}
{"index": 3, "met": "0031043207", "no": "C"}

If you started with a JSON document containing a list of entries, just parse that document first with json.load()/json.loads().

edited Aug 12, 2016 at 10:08

answered Aug 12, 2016 at 10:03

Martijn Pieters

1.1m326 gold badges4.2k silver badges3.4k bronze badges

Sign up to request clarification or add additional context in comments.

Comments

wouter bolsterlee · Accepted Answer · 2024-12-04 16:07:11Z

74

The jsonlines package is made exactly for your use case:

import jsonlines

items = [
    {'a': 1, 'b': 2},
    {'a', 123, 'b': 456},
]
with jsonlines.open('output.jsonl', 'w') as writer:
    writer.write_all(items)

(Yes, I wrote it years after you posted your original question.)

edited Dec 4, 2024 at 16:07

answered Sep 18, 2018 at 20:09

wouter bolsterlee

4,06725 silver badges31 bronze badges

2 Comments

Jose R. Zapata Over a year ago

items is a list

John Slegers Over a year ago

Kudos for taking the effort to maintain & document your project better than many (if not most) corporations out there!

Peter Mortensen · Accepted Answer · 2023-04-29 21:56:07Z

12

A simple way to do this is with the jq command in your terminal.

To install jq on Debian and derivatives:

sudo apt-get install jq

CentOS and RHEL users should run:

sudo yum -y install https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
sudo yum install jq -y

Basic usage:

jq -c '.[]' some_json.json >> output.jsonl

If you need to handle with huge files, I strongly recommend to use the --stream flag. This will make jq parse your JSON content in streaming mode.

jq -c --stream '.[]' some_json.json >> output.json

But, if you need to do this operation into a Python file, you can use bigjson, a useful library that parses the JSON in streaming mode:

pip3 install bigjson

To read a huge JSON file (in my case, it was 40 GB):

import bigjson

# Reads JSON file in streaming mode
with open('input_file.json', 'rb') as f:
    json_data = bigjson.load(f)

    # Open output file
    with open('output_file.jsonl', 'w') as outfile:
        # Iterates over input json
        for data in json_data:
            # Converts json to a Python dict
            dict_data = data.to_python()

            # Saves the output to output file
            outfile.write(json.dumps(dict_data)+"\n")

If you want, try to parallelize this code aiming to improve performance. Post the result here :)

Documentation and source code: bigjson

edited Apr 29, 2023 at 21:56

Peter Mortensen

31.3k22 gold badges110 silver badges134 bronze badges

answered Mar 19, 2021 at 14:08

Jonas Ferreira

2913 silver badges8 bronze badges

2 Comments

TheTechRobo Over a year ago

As of now this answer only works with Debian and derivatives. Are there otheer possible installation instructions for other operating systems?

Jonas Ferreira Over a year ago

Yes, but is a quite long, so, follow this link to install on RHEL/CentOS: cyberithub.com/…

Peter Mortensen · Accepted Answer · 2023-04-29 22:01:46Z

1

Note that a JSONL file is a compacted JSON file. You may need to pass separators without spaces:

with open(output_file_jsonl, 'a', encoding ='utf8') as json_file:
    for elem in rs:
        json_file.write(json.dumps(dict(elem), separators=(',', ':'), cls=DateTimeEncoder))
        json_file.write('\n')

edited Apr 29, 2023 at 22:01

Peter Mortensen

31.3k22 gold badges110 silver badges134 bronze badges

answered Nov 4, 2022 at 13:14

Laya

111 bronze badge

Comments

Yana · Accepted Answer · 2023-01-11 09:20:29Z

0

This is an edit to this answer which takes into account the possibility of special symbols or using a different alphabet in the JSONL file. For example, I use Cyrillic and without the encoding and ensure_ascii parameters edited, I get really ugly results. I thought it could be useful:

with open('output.jsonl', 'w', encoding='utf8') as outfile:
    for entry in dataset_donut:
        json.dump(entry, outfile, ensure_ascii=False)
        outfile.write('\n')

answered Jan 11, 2023 at 9:20

Yana

9852 gold badges13 silver badges39 bronze badges

Comments

questionto42 · Accepted Answer · 2024-01-29 18:03:14Z

You might also get this done with RegEx search and replace, for example in VSCode, if you switch on RegEx and search for \n:

and replace all by nothing. Then save the changed json file as "myfile.jsonl".

That works if you have only one JSON in your file (which should be the default). If you have a list of JSONs, like in your example, you can still search and replace with RegEx, take the negative lookahead "?!", see Stack Overflow Find 'word' not followed by a certain character with \n(?!\s*\{) so that you also skip the spaces after a linebreak:

And there you go:

Clean the rest of the unneeded characters as you showed it yourself, but take RegEx for it, and you could also do this RegEx replacement automatically with Python with the re package instead of doing this in VSCode by hand.

Replace ^\[ with nothing to get rid of the first bracket "[".
Replace \]$ with nothing to get rid of the last bracket "]".
Replace \,$ with nothing to get rid of the trailing commas ",".
Replace ^\s* with nothing to get rid of spaces at the beginning of a line.

Out:

{u'index': 1,  u'no': 'A',  u'met': u'1043205'}
{u'index': 2,  u'no': 'B',  u'met': u'000031043206'}
{u'index': 3,  u'no': 'C',  u'met': u'0031043207'}]

Peter Mortensen · Accepted Answer · 2023-04-29 21:58:31Z

-2

If you don't want a library, it's easy enough to do using JSON directly.

Source

[
    {"index": 1,"no": "A","met": "1043205"},
    {"index": 2,"no": "B","met": "000031043206"},
    {"index": 3,"no": "C","met": "0031043207"}
]

Code

import json

with open("test.json", 'r') as infile:
    data = json.load(infile)
    if len(data) > 0:
        print(json.dumps([t for t in data[0]]))
        for record in data:
            print(json.dumps([v for (k,v) in record.items()]))

Result

["index", "no", "met"]
[1, "A", "1043205"]
[2, "B", "000031043206"]
[3, "C", "0031043207"]

edited Apr 29, 2023 at 21:58

Peter Mortensen

31.3k22 gold badges110 silver badges134 bronze badges

answered Apr 20, 2022 at 13:05

Konchog

2,2481 gold badge24 silver badges30 bronze badges

1 Comment

Banty Over a year ago

Your result is a valid JSONL in traditional CSV format, but the question clearly wants an output in key, value pairs e.g. {"index": 1,"no": "A","met": "1043205"}.

Collectives™ on Stack Overflow

Python conversion from JSON to JSONL

7 Answers 7

Comments

2 Comments

2 Comments

Comments

Comments

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

7 Answers 7

Comments

2 Comments

2 Comments

Comments

Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related