0

I want to use python to read an Excel file and transform it into a different structure (Example).

  • On the left side of the red marked area there are about 15 more columns
  • The red marked area continues for 5 more years to the right

My current approach is following

  1. Read the black area (it always has a fixed number of columns, so know where to start and to end)
  2. Read the remaining red area
  3. Merge the data

Python dataframe example to generate the red marked area above:

df = pd.DataFrame({'col1':  {0: 'Year'  , 1: 'Option'   , 2: 'Category' , 3: 'Type' , 4: 'Country'  , 5: 'Australia', 6: 'New Zealand'},
                   'col2':  {0: '2024'  , 1: 'S'        , 2: 'FTE'      , 3: 'A'    , 4: ''         , 5: '-1,0'     , 6: '-2,0'},
                   'col3':  {0: ''      , 1: ''         , 2: 'Budget'   , 3: 'B'    , 4: 'EUR'      , 5: '-100,5'   , 6: '-200,5'},
                   'col4':  {0: ''      , 1: ''         , 2: ''         , 3: 'C'    , 4: 'EUR'      , 5: '-1000'    , 6: '-2000'},
                   'col5':  {0: ''      , 1: 'T'        , 2: 'FTE'      , 3: 'A'    , 4: ''         , 5: '1,0'      , 6: '2,0'},
                   'col6':  {0: ''      , 1: ''         , 2: 'Budget'   , 3: 'B'    , 4: 'EUR'      , 5: '100,5'    , 6: '200,5'},
                   'col7':  {0: ''      , 1: ''         , 2: ''         , 3: 'C'    , 4: 'EUR'      , 5: '1000'     , 6: '2000'},
                   'col8':  {0: '2025'  , 1: 'S'        , 2: 'FTE'      , 3: 'A'    , 4: ''         , 5: '-3,0'     , 6: '-4,0'},
                   'col9':  {0: ''      , 1: ''         , 2: 'Budget'   , 3: 'B'    , 4: 'EUR'      , 5: '-300,5'   , 6: '-400,5'},
                   'col10': {0: ''      , 1: ''         , 2: ''         , 3: 'C'    , 4: 'EUR'      , 5: '3000'     , 6: '-4000'},
                   'col11': {0: ''      , 1: 'T'        , 2: 'FTE'      , 3: 'A'    , 4: ''         , 5: '3,0'      , 6: '4,0'},
                   'col12': {0: ''      , 1: ''         , 2: 'Budget'   , 3: 'B'    , 4: 'EUR'      , 5: '300,5'    , 6: '400,5'},
                    'col13': {0: ''      , 1: ''         , 2: ''         , 3: 'C'    , 4: 'EUR'     , 5: '3000'     , 6: '4000'},
                    })

I am struggling to read the data and set the multi index columns of the dataframe, as the country column does not fit into the hierarchy.

Because I have to use df = pd.read_excel(...usecols='T:Z', header=None... I am reading the data and the header separately and then add the headers using df.columns = pd.MultiIndex.from_arrays(...).

The result (for 2024) looks like this

2024
S T
A B C A B C
Country
0 Australia -1,0 -100,5 -1000 1,0 100,5 1000
1 New Zealand -2,0 -200,5 -2000 2,0 200,5 2000

Here I am stuck, I tried using .stack and .melt to achieve the target structure, however was not able to achieve it.

1 Answer 1

1

Regarding the red area, you can use the following code to obtain the desired format:

table = df.T # Assuming "df" is the dataframe you gave    
# Reset and eliminate the index
table.reset_index(inplace=True)
del table['index']

# Reset the dataframe to make the columns the first row
table.columns = table.iloc[0]
table = table[1:]

# Remove irrelevant columns
del table['Country']
del table['Category']

# Blank entries must be populated with the previous entry
def ReplaceBlankEntriesWithPrevious(col):
    out = col.copy()
    for i in range(1, len(col)+1):
        out[i] = col[i] if col[i] != "" else out[i-1]
    return out

table['Year'] = ReplaceBlankEntriesWithPrevious(table['Year'])
table['Option'] = ReplaceBlankEntriesWithPrevious(table['Option'])

# Set two of the three indexes
table.set_index(['Year', 'Option'], inplace=True)

# Swap the "Type" and "Country" positions
table = table.pivot(columns='Type')
table = table.stack(level=0)
table.rename_axis(index={0:"Country"}, inplace=True)

# Optionally, you can swap the index order to put it in the order of 
# Country, Year, Option
table = table.swaplevel(i=0, j=2)
table = table.swaplevel(i=1, j=2)

Note that I started by transposing the table, since both "Year" and "Option" were columns. Additionally, I had to fill empty cells with their implied values. This solution isn't the most elegant, but it gets the job done. I would welcome a more succinct solution.

Sign up to request clarification or add additional context in comments.

1 Comment

Amazing! this helped to solve it!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.