Pandas merged cell issue when reading from excel

Question

So I have a xls file with some weird style but there's nothing I can do about it so I just need to parse it.

As you can see I have some merged cells. What I want to do is fill the empty values for the merged cells ("ffill") but also keep the empty cells like they are.

Something like this

EIM, C,NI1 Enescu_Ioan, EIM, S,NI11,Enescu_Ioan EIM, C,NI1 Enescu_Ioan, Empty EIM, C,NI1 Enescu_Ioan EIM, S,NI11,Enescu_Ioan EIM, C,NI1,Enescu_Ioan Empty The way I'm loading the file right now is this.

xl = pd.ExcelFile("data/file.xls")
df = xl.parse(0, header=None)

I've also tried to open the file like this and access the merged cells but I get an empty list.

book = xlrd.open_workbook("data/file.xls")
book.sheet_by_index(0).merged_cells # This is empty []

Is there any way I could achieve this? Thanks!

EDIT

There might be some confusions regarding the question so I'll try to explain better. The attached image is a subset of a larger file where the columns may appear in different order. What I'm trying to achieve is a way of differentiating between merged cells NAN values (in a merged cell only the first column has a value, the rest are all nan) and empty cells NAN.

Did you try pandas read_excel?? pandas.pydata.org/pandas-docs/stable/reference/api/… — Mahendra Singh
– Mahendra Singh, Commented Mar 11, 2020 at 8:36

Radulescu Petru · Accepted Answer · 2020-03-12 08:54:27Z

6

Managed to find a fix

def read_excel(path):
    excel = None
    if path.endswith('xlsx'):
        excel = pd.ExcelFile(xlrd.open_workbook(path), engine='xlrd')
    elif path.endswith('xls'):
        excel = pd.ExcelFile(xlrd.open_workbook(path, formatting_info=True), engine='xlrd')
    else:
        raise ValueError("Could not read this type of data")
    return excel

def parse_excel(excel_file):
    sheet_0 = excel_file.book.sheet_by_index(0)
    df = excel_file.parse(0, header=None)
    return sheet_0, df

def fill_merged_na(sheet, dataframe):
    for e in sheet.merged_cells:
        rl, rh, cl, ch = e
        base_value = sheet.cell_value(rl, cl)
        dataframe.iloc[rl:rh, cl:ch] = base_value
    return dataframe

Some of the important bits are opening the excel file with the formatting_info set to True in order to also read formatting such as merged cells and the fill_merged_na function that fills only the merged nan values but leaves the initial empty cells as they were.

answered Mar 12, 2020 at 8:54

Radulescu Petru

1372 silver badges8 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

secsilm Over a year ago

xlrd does't support xlsx now. This will give the error: XLRDError: Excel xlsx file; not supported.

KCharlie Over a year ago

Has anyone found a way to easily do this with openpyxl? Right now I'm doing it manually using Worksheet.merged_cells to determine which cells are merged, and applying the value to each of them.

ManojK · Accepted Answer · 2020-03-11 08:55:20Z

1

This is how the DataFrame looks like if you do df = pd.read_excel('path')

print(df)

                     Col1                     Col2
0  EIM, C,NI1 Enescu_Ioan  EIM, S,NI11,Enescu_Ioan
1                     NaN                      NaN
2                     NaN  EIM, S,NI11,Enescu_Ioan
3                     NaN                      NaN

Now, fill the value from the first entry of merged cell:

df['Col1'] = df['Col1'].fillna(method = 'ffill')

And use pandas.DataFrame.replace to change the NaN with blanks:

df['Col2'] = df['Col2'].replace(np.nan,'')

An example:

df = pd.DataFrame({'Col1' : ['EIM, C,NI1 Enescu_Ioan',np.nan,np.nan,np.nan], 
                   'Col2' : ['EIM, S,NI11,Enescu_Ioan',np.nan,'EIM, S,NI11,Enescu_Ioan',np.nan]})

print(df)

                     Col1                     Col2
0  EIM, C,NI1 Enescu_Ioan  EIM, S,NI11,Enescu_Ioan
1                     NaN                      NaN
2                     NaN  EIM, S,NI11,Enescu_Ioan
3                     NaN                      NaN

df['Col1'] = df['Col1'].fillna(method = 'ffill')
df['Col2'] = df['Col2'].replace(np.nan,'')
print(df)
                     Col1                     Col2
0  EIM, C,NI1 Enescu_Ioan  EIM, S,NI11,Enescu_Ioan
1  EIM, C,NI1 Enescu_Ioan                         
2  EIM, C,NI1 Enescu_Ioan  EIM, S,NI11,Enescu_Ioan
3  EIM, C,NI1 Enescu_Ioan

answered Mar 11, 2020 at 8:55

ManojK

1,6403 gold badges11 silver badges17 bronze badges

2 Comments

Radulescu Petru Over a year ago

Yeah, that works in this case but the columns can appear in any order and this was just a small subset of the file. What I need is a method to distinguish between merged cells nan and empty cells nan.

ManojK Over a year ago

This was as per your expected output, please edit your question to include complete problem and expected output, also as per my understanding pandas returns NaN while reading any sort of empty cells.

LaserJesus · Accepted Answer · 2023-11-28 06:59:05Z

1

I updated @radulescu-petru solution to work with openpyxl

import pandas as pd
import openpyxl

def read_excel(path):
    return openpyxl.load_workbook(path)

def parse_excel(excel_file, sheet_name):
    sheet = excel_file[sheet_name]
    df = pd.read_excel(file_path, sheet_name=sheet_name, header=None)
    return sheet, df
    
def fill_merged_na(sheet, dataframe):
    for merged_cell_range in sheet.merged_cells:
        # Get the start cell's row and column indices
        min_col, min_row, max_col, max_row = openpyxl.utils.cell.range_boundaries(str(merged_cell_range))
        # Get the value from the first cell
        base_value = df.iat[min_row-1, min_col-1]
        dataframe.iloc[min_row-1:max_row, min_col-1:max_col] = base_value
    return dataframe

excel_file = read_excel('some_file.xlsx')
sheet, df = parse_excel(excel_file, 'sheet A')
df_filled = fill_merged_na(sheet, df)
df_filled.head()

answered Nov 28, 2023 at 6:59

LaserJesus

8,5909 gold badges53 silver badges69 bronze badges

1 Comment

nenetto Over a year ago

I think the var df is not defined in the function fill_merged_na

Collectives™ on Stack Overflow

Pandas merged cell issue when reading from excel

3 Answers 3

2 Comments

2 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related