Cause of the problem
Recently, we need to divide a text into several topics, and design a regressor for each topic. Each regressor is independent of each other, and finally summarize the regressors for all topics to obtain the overall prediction result. That's right! Similar to bagging ensemble! I just didn't sample. The text is not big, about 3000 lines, and there are 8 topics, so I wrote a serial program to calculate one topic and then another topic. But I used GridSearchCV for parameter tuning in each topic, and had to select features and adjust the parameters of the regressor, resulting in a total of 1782 parameter combinations. I really underestimated the time required for parameter tuning. The program ran all day and night, but in the end, I forgot to import a library, which resulted in the final prediction accuracy not being calculated. Later on, I thought to myself, since each topic's prediction is independent, can it be done in parallel?.
Multi threading and multiprocessing in Python
However, it has been heard that Python's multithreading does not actually allow for the true utilization of multi-core, so if multithreading is used, concurrency processing is still done on one core. However, using multiple processes can truly utilize multi-core, as each process is independent of each other and does not share resources. Different processes can be executed on different cores, achieving parallel effects. In my question, the topics are independent of each other and do not involve communication between processes. Only the final summary results are needed, so using multiple processes is a good choice.
Multiprocessing
A subprocess
The multiprocessing module provides a process class to implement the creation of a new process. The following code is for creating a new subprocess.
from multiprocessing import Process
def f(name):
print 'hello', name
if __name__== '__main__':
p = Process(target=f, args=('bob',)) #create a new child process p the objective function is f,args it's a function f list of parameters for
p.start() #start executing process
p.join() #waiting for the child process to end
The meaning of p.join() in the above code is to wait for the child process to end before executing subsequent operations, usually used for inter process communication. For example, there is a read process pw and a write process pr that need to write pr.join() before calling pw, indicating waiting for the write process to end before starting to execute the read process.
Multiple sub processes
If you want to create multiple child processes simultaneously, you can use the multiprocessing.Pool class. This class can create a process pool and then execute these processes on multiple cores.
import multiprocessing
import time
def func(msg):
print multiprocessing.current_process().name + '-' + msg
if __name__== "__main__":
pool = multiprocessing.Pool(processes=4) #create 4 processes
for i in xrange(10):
msg = "hello %d" %(i)
pool.apply_async(func, (msg, ))
pool.close() #closing the process pool means that processes cannot be added to the process pool
pool.join() #waiting for all processes in the process pool to complete execution must be done before close call after ()
print "Sub-process(es) done."
The output result is as follows:
Sub-process(es) done.
PoolWorker-34-hello 1
PoolWorker-33-hello 0
PoolWorker-35-hello 2
PoolWorker-36-hello 3
PoolWorker-34-hello 7
PoolWorker-33-hello 4
PoolWorker-35-hello 5
PoolWorker-36-hello 6
PoolWorker-33-hello 8
PoolWorker-36-hello 9
The pool.apply_async() in the above code is a variant of the apply() function, apply_async() is the parallel version of apply(), and apply() is the blocking version of apply_async(). The main process using apply() will be blocked until the function execution ends, so it is called the blocking version. apply() is both a method of Pool and a built-in function in Python, and the two are equivalent. It can be seen that the output result is not in the order of the code for loop.
Multiple child processes and return values
apply_async() itself can return the return value of the function called by the process. In the previous code that created multiple child processes, if a value is returned in function func, then the result of pool.apply_async(func, (msg, )) is to return the values of all processes in the pool
Value object (note that it is an object, not the value itself) .
import multiprocessing
import time
def func(msg):
return multiprocessing.current_process().name + '-' + msg
if __name__== "__main__":
pool = multiprocessing.Pool(processes=4) #create 4 processes
results = []
for i in xrange(10):
msg = "hello %d" %(i)
results.append(pool.apply_async(func, (msg, )))
pool.close() #closing the process pool means that processes cannot be added to the process pool anymore and need to be added in the join called before
pool.join() #wait for all processes in the process pool to finish executing
print ("Sub-process(es) done.")
for res in results:
print (res.get())
The output result of the above code is as follows:
Sub-process(es) done.
PoolWorker-37-hello 0
PoolWorker-38-hello 1
PoolWorker-39-hello 2
PoolWorker-40-hello 3
PoolWorker-37-hello 4
PoolWorker-38-hello 5
PoolWorker-39-hello 6
PoolWorker-37-hello 7
PoolWorker-40-hello 8
PoolWorker-38-hello 9
Unlike the previous output, this output is ordered.
If the computer has eight cores, create eight processes, enter the top command in Ubuntu, and then press 1 on the keyboard, you can see that the usage rate of each CPU is relatively average, as shown in the following figure:

The difference in CPU usage curve before and after executing multiple processes can also be clearly seen in the system monitor
