5

So I have a xls file with some weird style but there's nothing I can do about it so I just need to parse it.

enter image description here

As you can see I have some merged cells. What I want to do is fill the empty values for the merged cells ("ffill") but also keep the empty cells like they are.

Something like this

EIM, C,NI1 Enescu_Ioan, EIM, S,NI11,Enescu_Ioan EIM, C,NI1 Enescu_Ioan, Empty EIM, C,NI1 Enescu_Ioan EIM, S,NI11,Enescu_Ioan EIM, C,NI1,Enescu_Ioan Empty The way I'm loading the file right now is this.

xl = pd.ExcelFile("data/file.xls")
df = xl.parse(0, header=None)

I've also tried to open the file like this and access the merged cells but I get an empty list.

book = xlrd.open_workbook("data/file.xls")
book.sheet_by_index(0).merged_cells # This is empty []

Is there any way I could achieve this? Thanks!

EDIT

There might be some confusions regarding the question so I'll try to explain better. The attached image is a subset of a larger file where the columns may appear in different order. What I'm trying to achieve is a way of differentiating between merged cells NAN values (in a merged cell only the first column has a value, the rest are all nan) and empty cells NAN.

2

3 Answers 3

6

Managed to find a fix

def read_excel(path):
    excel = None
    if path.endswith('xlsx'):
        excel = pd.ExcelFile(xlrd.open_workbook(path), engine='xlrd')
    elif path.endswith('xls'):
        excel = pd.ExcelFile(xlrd.open_workbook(path, formatting_info=True), engine='xlrd')
    else:
        raise ValueError("Could not read this type of data")
    return excel

def parse_excel(excel_file):
    sheet_0 = excel_file.book.sheet_by_index(0)
    df = excel_file.parse(0, header=None)
    return sheet_0, df

def fill_merged_na(sheet, dataframe):
    for e in sheet.merged_cells:
        rl, rh, cl, ch = e
        base_value = sheet.cell_value(rl, cl)
        dataframe.iloc[rl:rh, cl:ch] = base_value
    return dataframe

Some of the important bits are opening the excel file with the formatting_info set to True in order to also read formatting such as merged cells and the fill_merged_na function that fills only the merged nan values but leaves the initial empty cells as they were.

Sign up to request clarification or add additional context in comments.

2 Comments

xlrd does't support xlsx now. This will give the error: XLRDError: Excel xlsx file; not supported.
Has anyone found a way to easily do this with openpyxl? Right now I'm doing it manually using Worksheet.merged_cells to determine which cells are merged, and applying the value to each of them.
1

This is how the DataFrame looks like if you do df = pd.read_excel('path')

print(df)

                     Col1                     Col2
0  EIM, C,NI1 Enescu_Ioan  EIM, S,NI11,Enescu_Ioan
1                     NaN                      NaN
2                     NaN  EIM, S,NI11,Enescu_Ioan
3                     NaN                      NaN

Now, fill the value from the first entry of merged cell:

df['Col1'] = df['Col1'].fillna(method = 'ffill')

And use pandas.DataFrame.replace to change the NaN with blanks:

df['Col2'] = df['Col2'].replace(np.nan,'')

An example:

df = pd.DataFrame({'Col1' : ['EIM, C,NI1 Enescu_Ioan',np.nan,np.nan,np.nan], 
                   'Col2' : ['EIM, S,NI11,Enescu_Ioan',np.nan,'EIM, S,NI11,Enescu_Ioan',np.nan]})

print(df)

                     Col1                     Col2
0  EIM, C,NI1 Enescu_Ioan  EIM, S,NI11,Enescu_Ioan
1                     NaN                      NaN
2                     NaN  EIM, S,NI11,Enescu_Ioan
3                     NaN                      NaN

df['Col1'] = df['Col1'].fillna(method = 'ffill')
df['Col2'] = df['Col2'].replace(np.nan,'')
print(df)
                     Col1                     Col2
0  EIM, C,NI1 Enescu_Ioan  EIM, S,NI11,Enescu_Ioan
1  EIM, C,NI1 Enescu_Ioan                         
2  EIM, C,NI1 Enescu_Ioan  EIM, S,NI11,Enescu_Ioan
3  EIM, C,NI1 Enescu_Ioan                         

2 Comments

Yeah, that works in this case but the columns can appear in any order and this was just a small subset of the file. What I need is a method to distinguish between merged cells nan and empty cells nan.
This was as per your expected output, please edit your question to include complete problem and expected output, also as per my understanding pandas returns NaN while reading any sort of empty cells.
1

I updated @radulescu-petru solution to work with openpyxl

import pandas as pd
import openpyxl

def read_excel(path):
    return openpyxl.load_workbook(path)

def parse_excel(excel_file, sheet_name):
    sheet = excel_file[sheet_name]
    df = pd.read_excel(file_path, sheet_name=sheet_name, header=None)
    return sheet, df
    
def fill_merged_na(sheet, dataframe):
    for merged_cell_range in sheet.merged_cells:
        # Get the start cell's row and column indices
        min_col, min_row, max_col, max_row = openpyxl.utils.cell.range_boundaries(str(merged_cell_range))
        # Get the value from the first cell
        base_value = df.iat[min_row-1, min_col-1]
        dataframe.iloc[min_row-1:max_row, min_col-1:max_col] = base_value
    return dataframe

excel_file = read_excel('some_file.xlsx')
sheet, df = parse_excel(excel_file, 'sheet A')
df_filled = fill_merged_na(sheet, df)
df_filled.head()

1 Comment

I think the var df is not defined in the function fill_merged_na

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.