Reading multiple Excel files with Dask

Question

Can someone help me understand how to read multiple excel files in Dask? In Pandas, I would use Glob and do this

files = glob.glob('Working Files/*.xlsx')
df = pd.concat([pd.read_excel(i, skiprows=2) for i in files], ignore_index=True)

Need help with doing the same in Dask

Thanks,

Jac

SultanOrazbayev · Accepted Answer · 2021-06-20 14:04:50Z

0

The easiest solution is to wrap your function in a delayed API:

import dask

files = glob.glob('Working Files/*.xlsx')

# note we are wrapping in delayed only the function, not the arguments
delayeds = [dask.delayed(pd.read_excel)(i, skiprows=2) for i in files]

# the line below launches actual computations
results = dask.compute(delayeds)

# after computation is over the results object will 
# contain a list of pandas dataframes
df = pd.concat(results, ignore_index=True)

answered Jun 20, 2021 at 14:04

SultanOrazbayev

16.7k3 gold badges25 silver badges59 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Jac Over a year ago

Thanks a lot for your quick response; one question: By doing this pd.concat I will be moving to pandas from Dask, right? I'm trying to switch to Dask because I read that Dask is faster, is this method still the optimal way to do it? can I concatenate in Dask itself, won't that be faster? PS: pardon, if it's a dumb thought :P , I'm fairly new to these

SultanOrazbayev Over a year ago

Depending on your use case, data, etc, it might be faster to keep the data in pandas...

user17743486 · Accepted Answer · 2022-09-07 17:50:11Z

0

Following the approach, I had some issues with pd.concat, there I changed that creating an array insted of concat. Hope it works!

files = glob.glob(r"D:\XX\XX\XX\XX\XXX\*.xlsx")

# note we are wrapping in delayed only the function, not the arguments
delayeds = [dask.delayed(pd.read_excel)(i, skiprows=0) for i in files]

# the line below launches actual computations
results = dask.compute(delayeds)

# after computation is over the results object will 
# contain a list of pandas dataframes
dask_array = dd.from_delayed(delayeds) # here instead of pd.concat
dask_array.compute().to_csv(r"D:\XX\XX\XX\XX\XXX\*.csv") # Please be aware of the dtypes on your Excel.

answered Sep 7, 2022 at 17:50

user17743486

536 bronze badges

Collectives™ on Stack Overflow

Reading multiple Excel files with Dask

2 Answers 2

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related