1

I need to convert XLS files to CSV in order to the data contained in a PostgreSQL database, I used the following code to do the conversion :

import xlrd
import unicodecsv

def xls2csv (xls_filename, csv_filename):
    # Converts an Excel file to a CSV file.
    # If the excel file has multiple worksheets, only the first worksheet is converted.
    # Uses unicodecsv, so it will handle Unicode characters.
    # Uses a recent version of xlrd, so it should handle old .xls and new .xlsx equally well.

    wb = xlrd.open_workbook(xls_filename)
    sh = wb.sheet_by_index(0)

    fh = open(csv_filename,"wb")
    csv_out = unicodecsv.writer(fh, encoding='utf-8')

    for row_number in xrange (sh.nrows):
        csv_out.writerow(sh.row_values(row_number))

    fh.close()

The XLS files I'm using contains 212 columns and at least 100 rows, when I test the code with just 4 rows it works fine, but when nrows>5 the interpreter raises the following errors :

xls2csv ('e:/t.xls', 'e:/wh.csv')
WARNING *** file size (353829) not 512 + multiple of sector size (512)
WARNING *** OLE2 inconsistency: SSCS size is 0 but SSAT size is non-zero
*** No CODEPAGE record, no encoding_override: will use 'ascii'
*** No CODEPAGE record, no encoding_override: will use 'ascii'
Traceback (most recent call last):

  File "<ipython-input-14-ccae93f2d633>", line 1, in <module>
    xls2csv ('e:/t.xls', 'e:/wh.csv')

  File "C:/Users/hey/.spyder/temp.py", line 10, in xls2csv
    wb = xlrd.open_workbook(xls_filename)

  File "C:\Users\hey\Anaconda2\lib\site-packages\xlrd\__init__.py", line 441, in open_workbook
    ragged_rows=ragged_rows,

  File "C:\Users\hey\Anaconda2\lib\site-packages\xlrd\book.py", line 119, in open_workbook_xls
    bk.get_sheets()

  File "C:\Users\hey\Anaconda2\lib\site-packages\xlrd\book.py", line 678, in get_sheets
    self.get_sheet(sheetno)

  File "C:\Users\hey\Anaconda2\lib\site-packages\xlrd\book.py", line 669, in get_sheet
    sh.read(self)

  File "C:\Users\hey\Anaconda2\lib\site-packages\xlrd\sheet.py", line 804, in read
    strg = unpack_string(data, 6, bk.encoding or bk.derive_encoding(), lenlen=2)

  File "C:\Users\hey\Anaconda2\lib\site-packages\xlrd\biffh.py", line 269, in unpack_string
    return unicode(data[pos:pos+nchars], encoding)

UnicodeDecodeError: 'ascii' codec can't decode byte 0xb2 in position 2: ordinal not in range(128)

2 Answers 2

1

There is decoding issue when you open the xls file, I suspect the 5th line of xls file has special character, based on the xlrd documentation, you can use encoding_override="cp1251" to translate to Unicode:

wb = xlrd.open_workbook(xls_filename, encoding_override="cp1251")
Sign up to request clarification or add additional context in comments.

2 Comments

Do you have any idea how to set the separator of the generated csv to ; ?
simply try with: csv_out = unicodecsv.writer(fh, delimiter=';', encoding='utf-8')
1

It looks like the error isn't because of the number of rows, but because of a problem handling unicode characters in your source file.

I'd recommend trying Pandas:

import pandas as pd

df = pd.read_excel('input.xls')
df.to_csv('output.csv', encoding='utf-8')

Note that (while you don't expand on the Postgres part) if this is a first step to getting your data into Postgres, once your data is loaded into a Pandas dataframe, you can send it straight to Postgres.

2 Comments

I've already tested it, but it didn't work, here is the errors :
It requires to install pandas and xlrd as top level packages.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.