0

Good afternoon!
I have a .csv file like this (when opened with Notepad):

"2,"" Lorem ipsum dolor sit amet, consectetur adipiscing elit.
"""
"2,"" Proin a tortor leo. Morbi dictum laoreet nulla sit amet luctus. Donec euismod egestas velit, eget consequat ex porttitor vitae. Sed venenatis ornare enim sed rutrum. Aenean congue purus vitae congue rutrum. Ut ex felis, viverra imperdiet est vel, hendrerit luctus ligula.
"""
"2,"" estibulum consequat lorem enim, ut semper erat fringilla id.
"""
"2,"" Praesent a lobortis justo. Cras in sapien enim.
"""
...

I use this to get data from a file:

train = pd.read_csv('yelp_review_polarity_csv/train.csv', 
                    header=None, 
                    names=['Class', 'Review'],
                    encoding="cp1251",
                    sep=",")

Here is what I get: pict The second column filled with "Null" values. I need it to look something like this:

Class     Review
2         Lorem ipsum dolor sit amet...

I mean that the data should be divided into two columns with a "," delimiter. How to fix it?
Note: I am using encoding cp1251 so that there are no problems with some characters from another language.

5
  • 3
    The quotes in "2," escape the comma so its not considered a column separator. You could change the quote character, but then the quotes would be part of the value. The fundamental problem is that it isn't a CSV file so some other type of parsing is needed. You may be able to hack it by simply removing the quotes (likely read line by line, remove " and save to a temp file) and then reading into pandas. Commented Jul 8, 2022 at 19:18
  • @tdelaney The matter is that I can have the text in some lines. So I can't just remove the quotes. If I change the second quotes with a comma, will it work? Commented Jul 8, 2022 at 19:22
  • I'm not sure what you mean by the second qoutes, the lines with just """? I think you'd want to remove them completely. If you have a line 2, Lorem ipsum dolor sit amet, consectetur adipiscing elit., that's a 2 column CSV (with some left side padding on column 2) and I got there just by removing quotes. Commented Jul 8, 2022 at 19:27
  • @tdelaney I mean there are lines like this: "2,"" blablablabla \n blablablabla \n blablablabla \n """ P.S. "\n" just to understand that there is a line break. Commented Jul 8, 2022 at 19:29
  • Oh, those extra quotes are on the end of the first line, not on their own lines? This isn't any standard quoting scheme that I am aware of, and its the quotes that are messing things up, so from the bit of data I've seen, just get rid of them. It could be problematic if there is embedded commas the second column, though. Commented Jul 8, 2022 at 19:32

1 Answer 1

1

You can iterate over the lines and try parts = s.split(',""', 1) to split the input into 2 values and strip the bogus "" from the Review column value.

Assuming the format of each line in your "CSV" file is the same then you can parse the file like this.

import pandas as pd

val1 = []
val2 = []
with open("yelp_review_polarity_csv/train.csv") as fin:
    for s in fin:
        s = s.strip()
        if s == '"""':
            # skip lines with """
            continue
        if s[0] == '"':
            # change "2 to just '2'
            s = s[1:]
        parts = s.split(',""', 1)
        val1.append(parts[0])
        val2.append(parts[1])

# construct a data frame from the 2 lists
df = pd.DataFrame({'Class': val1, 'Review': val2})
print(df)

Output:

  Class                                             Review
0     2   Lorem ipsum dolor sit amet, consectetur adipi...
1     2   Proin a tortor leo. Morbi dictum laoreet null...
2     2   estibulum consequat lorem enim, ut semper era...
3     2    Praesent a lobortis justo. Cras in sapien enim.

If the format varies then will need to tweak the code accordingly.

Alternatively, you could change the format of the text file from
old: "2,"" Lorem ipsum dolor sit amet, consectetur adipiscing elit. """
new: 2,"Lorem ipsum dolor sit amet, consectetur adipiscing elit."
then pd.read_csv() will parse the input file correctly.

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks, this basically solves the problem, but I realized that I have the wrong file and I'll just fix it, and convert it to this format: "2","text" just by iterating over all the lines.
@maxet24 change to "2", "text" format then pd.read_csv() will work as expected.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.