33

I need to convert pandas data frame to JSONL format. I couldn't find a good package to do it and tried to implement myself, but it looks a bit ugly and not efficient.

For example, given a pandas df:

        label      pattern
  0      DRUG      aspirin
  1      DRUG    trazodone
  2      DRUG   citalopram

I need to convert to txt file of the form:

{"label":"DRUG","pattern":[{"lower":"aspirin"}]}
{"label":"DRUG","pattern":[{"lower":"trazodone"}]}
{"label":"DRUG","pattern":[{"lower":"citalopram"}]}

I tried with to_dict('records'), but I'm missing [ ] and nested 'lower' key.

df.to_dict('record')

creates:

[{'label': 'DRUG', 'pattern': 'aspirin'},
 {'label': 'DRUG', 'pattern': 'trazodone'},
 {'label': 'DRUG', 'pattern': 'citalopram'}]

I thought about converting the 'pattern' columns and include nested 'lower'?

UPD

So far, I succeeded to convert 'pattern' into list:

df_new = pd.concat((df[['label']], df[['pattern']].apply(lambda x: x.tolist(), axis=1)), axis=1)
df_new.columns = ['label', 'pattern']
df_new.head()

The result:

    label   pattern
0   DRUG    [aspirin]
1   DRUG    [trazodone]
2   DRUG    [citalopram]

and then:

df_new.to_dict(orient='records')

[{'label': 'DRUG', 'pattern': ['aspirin']},
 {'label': 'DRUG', 'pattern': ['trazodone']},
 {'label': 'DRUG', 'pattern': ['citalopram']}]

UPD 2

Eventually, I managed to get what I want, but in the most non-pythonic way.

df_1 = pd.DataFrame(df[['pattern']].apply(lambda x: {'lower': x[0]}, axis=1))
df_1.columns = ['pattern']

df_fin = pd.concat((df[['label']], df_1[['pattern']].apply(lambda x: x.tolist(), axis=1)), axis=1)
df_fin.columns = ['label', 'pattern']
df_fin.to_json(orient='records')

 '{'label': 'DRUG', 'pattern': [{'lower': 'aspirin'}]}
  {'label': 'DRUG', 'pattern': [{'lower': 'trazodone'}]}
  {'label': 'DRUG', 'pattern': [{'lower': 'citalopram'}]}'

Any chance you can show a neat solution?

4
  • 2
    Pandas DataFrame.to_json may be what you're looking for. Orient='records'. pandas.pydata.org/pandas-docs/stable/generated/… Commented Aug 9, 2018 at 20:24
  • @MichaelB, thanks, I tried, but it does not create '[ ]' after "pattern". Basically, 'pattern' values should be a list. Commented Aug 9, 2018 at 20:32
  • Have your tried df.to_json(orient = 'table')? Commented Aug 9, 2018 at 20:47
  • yes, just tried. Not even close :/ Commented Aug 9, 2018 at 20:55

3 Answers 3

73

In versions of Pandas > 0.19.0, DataFrame.to_json has a parameter, lines, that will write out JSONL format.

Given that, a more succinct version of your solution might look like this:

import pandas as pd

data = [{'label': 'DRUG', 'pattern': 'aspirin'},
        {'label': 'DRUG', 'pattern': 'trazodone'},
        {'label': 'DRUG', 'pattern': 'citalopram'}]
df = pd.DataFrame(data)

# Wrap pattern column in a dictionary
df["pattern"] = df.pattern.apply(lambda x: {"lower": x})

# Output in JSONL format
print(df.to_json(orient='records', lines=True))

Output:

{"label":"DRUG","pattern":{"lower":"aspirin"}}
{"label":"DRUG","pattern":{"lower":"trazodone"}}
{"label":"DRUG","pattern":{"lower":"citalopram"}}
Sign up to request clarification or add additional context in comments.

2 Comments

Thank you very much for this solution. I was grappling in the dark with the jsonlines and json libraries while this is a lot cleaner.
Since I've accidentally bumped into this answer a few times over the years and not had any luck: this does not work as expected with complex/nested data. It won't error, but instead will output struct field values without their keys, which is not very helpful if you were expecting something more like what happens when you do df.write.json(...) in spark
8

Very short code that should work for easy coping-pasting.

output_path = "/data/meow/my_output.jsonl"

with open(output_path, "w") as f:
    f.write(df_result.to_json(orient='records', lines=True, force_ascii=False))

If you are using jupyter notebook, you should use with open(output_path, "w") as f instead of f = open(output_path, "w") to make sure file is saved (correctly close) and ready to read in next cell.

Comments

4

To write to a file I have modified the last line of @kmsquire
# Output in JSONL format

import pandas as pd

data = [{'label': 'DRUG', 'pattern': 'aspirin'},
        {'label': 'DRUG', 'pattern': 'trazodone'},
        {'label': 'DRUG', 'pattern': 'citalopram'}]
df = pd.DataFrame(data)

# Wrap pattern column in a dictionary
df["pattern"] = df.pattern.apply(lambda x: {"lower": x})

# Output in JSONL format into a file 
f=open('records.jsonl', 'w')
print(df.to_json(orient='records', lines=True),file=f, flush=False)

1 Comment

Do not forget to close your file handler f with f.close()

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.