Multiprocessing a for loop in Python

Question

I have a program that currently takes a very long time to run since it processes a large number of files. I was hoping to be able to run the program over all of the 12 processors on my computer at once, to decrease the run time. I've been trying to get it to work for a while now but something seems to be wrong whenever I try running it. My program looks something like this before I've tried to introduce multiprocessing:

files = [file for file in listdir(data/) if isfile(join(data/, file))]

for file in files:
    filename = file
    ir = xr.open_dataset(path + filename)

    if __name__ == "__main__":

        pic_nr = np.unique(ir.pic)[0]

        image_lst = ir.time.searchsorted(
            ir.where(ir.pic == pic_nr, drop=True).time
        )
        run_and_save(image_lst[0:6], pic_nr))
        gc.collect()

Essentially, what I want the code to do is to run the entire for-loop for several processors at once where each processor works on a file in the list called 'files', but I can't seem to get it right even after reading some guides. Would anyone know the quickest way to get it to work correctly?

Ondrej K. · Accepted Answer · 2021-02-23 13:49:08Z

I suppose the quickest / simplest way to get there is to use a multiprocessing pool and let it run across iterable (of your files)... A minimal example with fixed number of workers and a little extra info to observe behavior would be:

import datetime
import time

from multiprocessing import Pool

def long_running_task(filename):
    time.sleep(1)
    print(f"{datetime.datetime.now()} finished: {filename}")

filenames = range(15)

with Pool(10) as mp_pool:
    mp_pool.map(long_running_task, filenames)

This creates a pool of 10 workers and will call long_running_task with each item from filenames (here just series of 0..14 ints as a stand-in) as a task finishes and the worker becomes available.

Alternatively, if you wanted to iterate over the inputs yourself, you could do something like:

with Pool(10) as mp_pool:
    for fn in range(15):
        mp_pool.apply_async(long_running_task, (fn,))
    mp_pool.close()
    mp_pool.join()

This would pass fn as first positional argument for each long_running_task call... when assigning all the work, we need to close the pool to stop accepting any more requests and join to wait for any outstanding jobs to finish.

Collectives™ on Stack Overflow

Multiprocessing a for loop in Python

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related