0

First of all, I know there are quite some threads about multiprocessing on python already, but none of these seems to solve my problem.

Here is my problem: I want to implement Random Forest Algorithm, and a naive way to do so would be like this:

def random_tree(Data):
    tree = calculation(Data)
    forest.append(tree)

forest = list()
for i in range(300):
    random_tree(Data)

And theforest with 300 "trees" inside would be my final result. In this case, how do I turn this code into a multiprocessing version?


Update: I just tried Mukund M K's method, in a very simplified script:

from multiprocessing import Pool

def f(x):
    return 2*x

data = np.array([1,2,5])

pool = Pool(processes=4)
forest = pool.map(f, (data for i in range(4))) 
# I use range() instead of xrange() because I am using Python 3.4

And now....the script is running like forever.....I open a python shell and enter the script line by line, and this is the messages I've got:

> Process SpawnPoolWorker-1:  
> Process SpawnPoolWorker-2:  
> Traceback (most recent call last):  
> Process SpawnPoolWorker-3:  
> Traceback (most recent call last):  
> Process SpawnPoolWorker-4:  
> Traceback (most recent call last):  
> Traceback (most recent call last):  
> File "E:\Anaconda3\lib\multiprocessing\process.py", line 254, in _bootstrap
    self.run()  
> File "E:\Anaconda3\lib\multiprocessing\process.py", line 254, in _bootstrap
    self.run()  
> File "E:\Anaconda3\lib\multiprocessing\process.py", line 254, in _bootstrap
    self.run()  
> File "E:\Anaconda3\lib\multiprocessing\process.py", line 254, in _bootstrap
    self.run()  
> File "E:\Anaconda3\lib\multiprocessing\process.py", line 93, in run
    self._target(*self._args, **self._kwargs)  
> File "E:\Anaconda3\lib\multiprocessing\process.py", line 93, in run
    self._target(*self._args, **self._kwargs)  
> File "E:\Anaconda3\lib\multiprocessing\process.py", line 93, in run
    self._target(*self._args, **self._kwargs)  
> File "E:\Anaconda3\lib\multiprocessing\process.py", line 93, in run
    self._target(*self._args, **self._kwargs)  
> File "E:\Anaconda3\lib\multiprocessing\pool.py", line 108, in worker
    task = get()  
> File "E:\Anaconda3\lib\multiprocessing\pool.py", line 108, in worker
    task = get()  
> File "E:\Anaconda3\lib\multiprocessing\pool.py", line 108, in worker
    task = get()  
> File "E:\Anaconda3\lib\multiprocessing\pool.py", line 108, in worker
    task = get()  
> File "E:\Anaconda3\lib\multiprocessing\queues.py", line 357, in get
    return ForkingPickler.loads(res)  
> File "E:\Anaconda3\lib\multiprocessing\queues.py", line 357, in get
    return ForkingPickler.loads(res)  
> AttributeError: Can't get attribute 'f' on   
> AttributeError: Can't get attribute 'f' on 
  File "E:\Anaconda3\lib\multiprocessing\queues.py", line 357, in get
    return ForkingPickler.loads(res)  
> AttributeError: Can't get attribute 'f' on 
  File "E:\Anaconda3\lib\multiprocessing\queues.py", line 357, in get
    return ForkingPickler.loads(res)  
> AttributeError: Can't get attribute 'f' on 

Update: I edited my sample code according to some other example code like this:

from multiprocessing import Pool
import numpy as np

def f(x):
    return 2*x

if __name__ == '__main__':
    data = np.array([1,2,3])
    with Pool(5) as p:
        result = p.map(f, (data for i in range(300)))

And it works now. What I need to do now is to fill in this with more sophisticated algorithm now..
Yet another question in my mind is: why could this code work, while the previous version couldn't?

4
  • "Data" is a 2-D(100*3) numpy array. Commented Dec 25, 2015 at 15:11
  • are you just reading it or modifying the contents as well in calculation? if so, does the order in which it is modified matter? Commented Dec 25, 2015 at 15:16
  • I only read the data. In random forest algorithm, I would randomly sample from the original data ("Data")to build a tree. So every iteration is independent, that is why I think it should be able to parallelized. Commented Dec 25, 2015 at 15:26
  • i know this is old but just in case. the cultprit here probably is the missing if __name__ == '__main__':. if you read the multiprocessing python docs you will find that this is an explicit requirement for mp to work. Commented Apr 20, 2017 at 11:50

2 Answers 2

1

You can do it with multiprocessing this way:

from multiprocessing import Pool

def random_tree(Data):
    return calculation(Data)

pool = Pool(processes=4)
forest = pool.map(random_tree, (Data for i in range(300)))
Sign up to request clarification or add additional context in comments.

5 Comments

I don't get the point of this suggestion. But I tried it in my test script(initialize the Pool after defining the function), and it still runs and not stopping.
well when i ran it on python3 it seemed to pickle the data..since the function was defined later on, it wasn't able to find it and threw an error. The code seems to work for me. Your simplified script as well works.
Tried to run the script in python shell and still doesn't work.(see above)
this might help? Since you are using windows.
Nope, this did not really help me for my original problem. But somehow I found a way to get my sample code work.(See updates)
0

Package processing might help you. Check it out here.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.