2

Multiprocessing a for loop in Python

I have a program that currently takes a very long time to run since it processes a large number of files. I was hoping to be able to run the program over all of the 12 processors on my computer at once, to decrease the run time. I've been trying to get it to work for a while now but something seems to be wrong whenever I try running it. My program looks something like this before I've tried to introduce multiprocessing:

files = [file for file in listdir(data/) if isfile(join(data/, file))]

for file in files:
    filename = file
    ir = xr.open_dataset(path + filename)

    if __name__ == "__main__":

        pic_nr = np.unique(ir.pic)[0]

        image_lst = ir.time.searchsorted(
            ir.where(ir.pic == pic_nr, drop=True).time
        )
        run_and_save(image_lst[0:6], pic_nr))
        gc.collect()

Essentially, what I want the code to do is to run the entire for-loop for several processors at once where each processor works on a file in the list called 'files', but I can't seem to get it right even after reading some guides. Would anyone know the quickest way to get it to work correctly?

1 Answer 1

3

I suppose the quickest / simplest way to get there is to use a multiprocessing pool and let it run across iterable (of your files)... A minimal example with fixed number of workers and a little extra info to observe behavior would be:

import datetime
import time

from multiprocessing import Pool

def long_running_task(filename):
    time.sleep(1)
    print(f"{datetime.datetime.now()} finished: {filename}")

filenames = range(15)

with Pool(10) as mp_pool:
    mp_pool.map(long_running_task, filenames)

This creates a pool of 10 workers and will call long_running_task with each item from filenames (here just series of 0..14 ints as a stand-in) as a task finishes and the worker becomes available.

Alternatively, if you wanted to iterate over the inputs yourself, you could do something like:

with Pool(10) as mp_pool:
    for fn in range(15):
        mp_pool.apply_async(long_running_task, (fn,))
    mp_pool.close()
    mp_pool.join()

This would pass fn as first positional argument for each long_running_task call... when assigning all the work, we need to close the pool to stop accepting any more requests and join to wait for any outstanding jobs to finish.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.