Python multiprocessing on For Loop

Question

First of all, I know there are quite some threads about multiprocessing on python already, but none of these seems to solve my problem.

Here is my problem: I want to implement Random Forest Algorithm, and a naive way to do so would be like this:

def random_tree(Data):
    tree = calculation(Data)
    forest.append(tree)

forest = list()
for i in range(300):
    random_tree(Data)

And theforest with 300 "trees" inside would be my final result. In this case, how do I turn this code into a multiprocessing version?

Update: I just tried Mukund M K's method, in a very simplified script:

from multiprocessing import Pool

def f(x):
    return 2*x

data = np.array([1,2,5])

pool = Pool(processes=4)
forest = pool.map(f, (data for i in range(4))) 
# I use range() instead of xrange() because I am using Python 3.4

And now....the script is running like forever.....I open a python shell and enter the script line by line, and this is the messages I've got:

> Process SpawnPoolWorker-1:  
> Process SpawnPoolWorker-2:  
> Traceback (most recent call last):  
> Process SpawnPoolWorker-3:  
> Traceback (most recent call last):  
> Process SpawnPoolWorker-4:  
> Traceback (most recent call last):  
> Traceback (most recent call last):  
> File "E:\Anaconda3\lib\multiprocessing\process.py", line 254, in _bootstrap
    self.run()  
> File "E:\Anaconda3\lib\multiprocessing\process.py", line 254, in _bootstrap
    self.run()  
> File "E:\Anaconda3\lib\multiprocessing\process.py", line 254, in _bootstrap
    self.run()  
> File "E:\Anaconda3\lib\multiprocessing\process.py", line 254, in _bootstrap
    self.run()  
> File "E:\Anaconda3\lib\multiprocessing\process.py", line 93, in run
    self._target(*self._args, **self._kwargs)  
> File "E:\Anaconda3\lib\multiprocessing\process.py", line 93, in run
    self._target(*self._args, **self._kwargs)  
> File "E:\Anaconda3\lib\multiprocessing\process.py", line 93, in run
    self._target(*self._args, **self._kwargs)  
> File "E:\Anaconda3\lib\multiprocessing\process.py", line 93, in run
    self._target(*self._args, **self._kwargs)  
> File "E:\Anaconda3\lib\multiprocessing\pool.py", line 108, in worker
    task = get()  
> File "E:\Anaconda3\lib\multiprocessing\pool.py", line 108, in worker
    task = get()  
> File "E:\Anaconda3\lib\multiprocessing\pool.py", line 108, in worker
    task = get()  
> File "E:\Anaconda3\lib\multiprocessing\pool.py", line 108, in worker
    task = get()  
> File "E:\Anaconda3\lib\multiprocessing\queues.py", line 357, in get
    return ForkingPickler.loads(res)  
> File "E:\Anaconda3\lib\multiprocessing\queues.py", line 357, in get
    return ForkingPickler.loads(res)  
> AttributeError: Can't get attribute 'f' on   
> AttributeError: Can't get attribute 'f' on 
  File "E:\Anaconda3\lib\multiprocessing\queues.py", line 357, in get
    return ForkingPickler.loads(res)  
> AttributeError: Can't get attribute 'f' on 
  File "E:\Anaconda3\lib\multiprocessing\queues.py", line 357, in get
    return ForkingPickler.loads(res)  
> AttributeError: Can't get attribute 'f' on

Update: I edited my sample code according to some other example code like this:

from multiprocessing import Pool
import numpy as np

def f(x):
    return 2*x

if __name__ == '__main__':
    data = np.array([1,2,3])
    with Pool(5) as p:
        result = p.map(f, (data for i in range(300)))

And it works now. What I need to do now is to fill in this with more sophisticated algorithm now..
Yet another question in my mind is: why could this code work, while the previous version couldn't?

are you just reading it or modifying the contents as well in calculation? if so, does the order in which it is modified matter? — Mukund M K
– Mukund M K, Commented Dec 25, 2015 at 15:16
I only read the data. In random forest algorithm, I would randomly sample from the original data ("Data")to build a tree. So every iteration is independent, that is why I think it should be able to parallelized. — Sidney
– Sidney, Commented Dec 25, 2015 at 15:26
i know this is old but just in case. the cultprit here probably is the missing if __name__ == '__main__':. if you read the multiprocessing python docs you will find that this is an explicit requirement for mp to work. — Ramon
– Ramon, Commented Apr 20, 2017 at 11:50

Mukund M K · Accepted Answer · 2015-12-25 16:22:22Z

1

You can do it with multiprocessing this way:

from multiprocessing import Pool

def random_tree(Data):
    return calculation(Data)

pool = Pool(processes=4)
forest = pool.map(random_tree, (Data for i in range(300)))

edited Dec 25, 2015 at 16:22

answered Dec 25, 2015 at 15:31

Mukund M K

814 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Sidney Over a year ago

I don't get the point of this suggestion. But I tried it in my test script(initialize the Pool after defining the function), and it still runs and not stopping.

Mukund M K Over a year ago

well when i ran it on python3 it seemed to pickle the data..since the function was defined later on, it wasn't able to find it and threw an error. The code seems to work for me. Your simplified script as well works.

Sidney Over a year ago

Tried to run the script in python shell and still doesn't work.(see above)

Mukund M K Over a year ago

this might help? Since you are using windows.

Sidney Over a year ago

Nope, this did not really help me for my original problem. But somehow I found a way to get my sample code work.(See updates)

Erobrere · Accepted Answer · 2015-12-25 15:05:52Z

0

Package processing might help you. Check it out here.

answered Dec 25, 2015 at 15:05

Erobrere

3881 silver badge12 bronze badges

Collectives™ on Stack Overflow

Python multiprocessing on For Loop

2 Answers 2

5 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related