62

I wish to manipulate a standard JSON object to an object where each line must contain a separate, self-contained valid JSON object. See JSON Lines

JSON_file =

[{u'index': 1,
  u'no': 'A',
  u'met': u'1043205'},
 {u'index': 2,
  u'no': 'B',
  u'met': u'000031043206'},
 {u'index': 3,
  u'no': 'C',
  u'met': u'0031043207'}]

To JSONL:

{u'index': 1, u'no': 'A', u'met': u'1043205'}
{u'index': 2, u'no': 'B', u'met': u'031043206'}
{u'index': 3, u'no': 'C', u'met': u'0031043207'}

My current solution is to read the JSON file as a text file and remove the [ from the beginning and the ] from the end. Thus, creating a valid JSON object on each line, rather than a nested object containing lines.

I wonder if there is a more elegant solution? I suspect something could go wrong using string manipulation on the file.

The motivation is to read json files into RDD on Spark. See related question - Reading JSON with Apache Spark - `corrupt_record`

2
  • 1
    That's not valid JSON input, nor valid JSON output. You are handling Python objects here, not JSON serialisation. Even if your output was valid JSON, it would not be valid JSONL because you have trailing commas. Commented Aug 12, 2016 at 10:01
  • Also, if the objects in the output would be valid JSON, there would be no trailing commas. Commented Aug 12, 2016 at 10:03

7 Answers 7

104

Your input appears to be a sequence of Python objects; it certainly is not valid a JSON document.

If you have a list of Python dictionaries, then all you have to do is dump each entry into a file separately, followed by a newline:

import json

with open('output.jsonl', 'w') as outfile:
    for entry in JSON_file:
        json.dump(entry, outfile)
        outfile.write('\n')

The default configuration for the json module is to output JSON without newlines embedded.

Assuming your A, B and C names are really strings, that would produce:

{"index": 1, "met": "1043205", "no": "A"}
{"index": 2, "met": "000031043206", "no": "B"}
{"index": 3, "met": "0031043207", "no": "C"}

If you started with a JSON document containing a list of entries, just parse that document first with json.load()/json.loads().

Sign up to request clarification or add additional context in comments.

Comments

74

The jsonlines package is made exactly for your use case:

import jsonlines

items = [
    {'a': 1, 'b': 2},
    {'a', 123, 'b': 456},
]
with jsonlines.open('output.jsonl', 'w') as writer:
    writer.write_all(items)

(Yes, I wrote it years after you posted your original question.)

2 Comments

items is a list
Kudos for taking the effort to maintain & document your project better than many (if not most) corporations out there!
12

A simple way to do this is with the jq command in your terminal.

To install jq on Debian and derivatives:

sudo apt-get install jq

CentOS and RHEL users should run:

sudo yum -y install https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
sudo yum install jq -y

Basic usage:

jq -c '.[]' some_json.json >> output.jsonl

If you need to handle with huge files, I strongly recommend to use the --stream flag. This will make jq parse your JSON content in streaming mode.

jq -c --stream '.[]' some_json.json >> output.json

But, if you need to do this operation into a Python file, you can use bigjson, a useful library that parses the JSON in streaming mode:

pip3 install bigjson

To read a huge JSON file (in my case, it was 40 GB):

import bigjson

# Reads JSON file in streaming mode
with open('input_file.json', 'rb') as f:
    json_data = bigjson.load(f)

    # Open output file
    with open('output_file.jsonl', 'w') as outfile:
        # Iterates over input json
        for data in json_data:
            # Converts json to a Python dict
            dict_data = data.to_python()

            # Saves the output to output file
            outfile.write(json.dumps(dict_data)+"\n")

If you want, try to parallelize this code aiming to improve performance. Post the result here :)

Documentation and source code: bigjson

2 Comments

As of now this answer only works with Debian and derivatives. Are there otheer possible installation instructions for other operating systems?
Yes, but is a quite long, so, follow this link to install on RHEL/CentOS: cyberithub.com/…
1

Note that a JSONL file is a compacted JSON file. You may need to pass separators without spaces:

with open(output_file_jsonl, 'a', encoding ='utf8') as json_file:
    for elem in rs:
        json_file.write(json.dumps(dict(elem), separators=(',', ':'), cls=DateTimeEncoder))
        json_file.write('\n')

Comments

0

This is an edit to this answer which takes into account the possibility of special symbols or using a different alphabet in the JSONL file. For example, I use Cyrillic and without the encoding and ensure_ascii parameters edited, I get really ugly results. I thought it could be useful:

with open('output.jsonl', 'w', encoding='utf8') as outfile:
    for entry in dataset_donut:
        json.dump(entry, outfile, ensure_ascii=False)
        outfile.write('\n')

Comments

0

You might also get this done with RegEx search and replace, for example in VSCode, if you switch on RegEx and search for \n:

enter image description here

and replace all by nothing. Then save the changed json file as "myfile.jsonl".

That works if you have only one JSON in your file (which should be the default). If you have a list of JSONs, like in your example, you can still search and replace with RegEx, take the negative lookahead "?!", see Stack Overflow Find 'word' not followed by a certain character with \n(?!\s*\{) so that you also skip the spaces after a linebreak:

enter image description here

And there you go:

enter image description here

Clean the rest of the unneeded characters as you showed it yourself, but take RegEx for it, and you could also do this RegEx replacement automatically with Python with the re package instead of doing this in VSCode by hand.

  • Replace ^\[ with nothing to get rid of the first bracket "[".
  • Replace \]$ with nothing to get rid of the last bracket "]".
  • Replace \,$ with nothing to get rid of the trailing commas ",".
  • Replace ^\s* with nothing to get rid of spaces at the beginning of a line.

Out:

{u'index': 1,  u'no': 'A',  u'met': u'1043205'}
{u'index': 2,  u'no': 'B',  u'met': u'000031043206'}
{u'index': 3,  u'no': 'C',  u'met': u'0031043207'}]

Comments

-2

If you don't want a library, it's easy enough to do using JSON directly.

Source

[
    {"index": 1,"no": "A","met": "1043205"},
    {"index": 2,"no": "B","met": "000031043206"},
    {"index": 3,"no": "C","met": "0031043207"}
]

Code

import json

with open("test.json", 'r') as infile:
    data = json.load(infile)
    if len(data) > 0:
        print(json.dumps([t for t in data[0]]))
        for record in data:
            print(json.dumps([v for (k,v) in record.items()]))

Result

["index", "no", "met"]
[1, "A", "1043205"]
[2, "B", "000031043206"]
[3, "C", "0031043207"]

1 Comment

Your result is a valid JSONL in traditional CSV format, but the question clearly wants an output in key, value pairs e.g. {"index": 1,"no": "A","met": "1043205"}.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.