Python Multiprocessing a For Loop

Question

My script doesn't work. All 10 Processes take the first list item, then stop. output 10x 1 list entry How to fix this? Error must be in the loop or do I need a queue for this?

import finanzen_fundamentals.stocks as ff
import mysql.connector
import pandas as pd
import multiprocessing
import time

results = []

def get_list():
    try:
        mydb = mysql.connector.connect( host="localhost", user="changed", password="changed", database="stockdata")
        mycursor = mydb.cursor()
        mycursor.execute("select * from url_name")
        record = mycursor.fetchall()
        return record
    except Exception as e:
        return str(e)

def create_json(record):
    for row in record:
        try:
            df = ff.get_current_value_lxml(str(row[2])[:-1], exchange = "FSE")
            print('Name:' + row[0] + ' WKN:' + df['wkn'].values[0] + ' Preis:' + str(df['price'].values[0]) + ' Currency:' + df['currency'].values[0] + ' Zeit:' + df['time'].values[0])
            result = [[row[0], df['wkn'].values[0], df['price'].values[0], df['currency'].values[0], df['time'].values[0]]]
            return result
        except Exception as e:
            print(str(e))

def collect_results(result):
     results.extend(result)

if __name__ == '__main__':
    record = get_list()
    start_time = time.time()
    pool = multiprocessing.Pool(processes=multiprocessing.cpu_count())
    for i in range(10):
        pool.apply_async(create_json, args=(record, ), callback=collect_results)
    pool.close()
    pool.join()

    df_out = pd.DataFrame(results, columns=['Name', 'WKN', 'Preis', 'Currency', 'Zeit'])
    print(df_out)

Output:

                      Name     WKN  Preis Currency        Zeit
0  21VIANET GRP ADR A/6 O.  A1H9DT   20.0      EUR  23.10.2020
1  21VIANET GRP ADR A/6 O.  A1H9DT   20.0      EUR  23.10.2020
2  21VIANET GRP ADR A/6 O.  A1H9DT   20.0      EUR  23.10.2020
3  21VIANET GRP ADR A/6 O.  A1H9DT   20.0      EUR  23.10.2020
4  21VIANET GRP ADR A/6 O.  A1H9DT   20.0      EUR  23.10.2020
5  21VIANET GRP ADR A/6 O.  A1H9DT   20.0      EUR  23.10.2020
6  21VIANET GRP ADR A/6 O.  A1H9DT   20.0      EUR  23.10.2020
7  21VIANET GRP ADR A/6 O.  A1H9DT   20.0      EUR  23.10.2020
8  21VIANET GRP ADR A/6 O.  A1H9DT   20.0      EUR  23.10.2020
9  21VIANET GRP ADR A/6 O.  A1H9DT   20.0      EUR  23.10.2020

Tomerikoo · Accepted Answer · 2020-10-24 22:11:09Z

2

You got the loops structure wrong. Inside create_json you are looping the rows of record but you always call it with the same original record list and return on the first iteration. So all workers will always just work on the first line. You need to change the worker function to operate on a row:

def create_json(row):
    try:
        df = ff.get_current_value_lxml(str(row[2])[:-1], exchange = "FSE")
        print('Name:' + row[0] + ' WKN:' + df['wkn'].values[0] + ' Preis:' + str(df['price'].values[0]) + ' Currency:' + df['currency'].values[0] + ' Zeit:' + df['time'].values[0])
        result = [[row[0], df['wkn'].values[0], df['price'].values[0], df['currency'].values[0], df['time'].values[0]]]
        return result
    except Exception as e:
        print(str(e))

And then call it with each row, in the main code:

if __name__ == '__main__':
    ...
    for row in record:
        pool.apply_async(create_json, args=(row, ), callback=collect_results)
    ...

Note that in this case, instead of looping and calling apply_async, you can just use map. It even already returns a list of the results so you don't even need the callback anymore, something like:

def create_json(row):
    try:
        df = ff.get_current_value_lxml(str(row[2])[:-1], exchange = "FSE")
        print('Name:' + row[0] + ' WKN:' + df['wkn'].values[0] + ' Preis:' + str(df['price'].values[0]) + ' Currency:' + df['currency'].values[0] + ' Zeit:' + df['time'].values[0])
        result = [row[0], df['wkn'].values[0], df['price'].values[0], df['currency'].values[0], df['time'].values[0]]
        # NOTE THAT NOW IT'S A 1-D LIST!
        return result
    except Exception as e:
        print(str(e))

if __name__ == '__main__':
    record = get_list()
    start_time = time.time()
    with multiprocessing.Pool(processes=multiprocessing.cpu_count()) as pool:
        results = pool.map(create_json, record)

    df_out = pd.DataFrame(results, columns=['Name', 'WKN', 'Preis', 'Currency', 'Zeit'])
    print(df_out)

edited Oct 24, 2020 at 22:11

answered Oct 24, 2020 at 17:04

Tomerikoo

19.6k16 gold badges57 silver badges68 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Navy Over a year ago

Your answer works pretty well, but it isn't as fast as excepted. Do you have any recommendations to improve the speed? (Of course better Hardware but at first Softwareside?). Here my updatet Method: def on_call(): record = get_list() start_time = time.time() pool = multiprocessing.Pool(processes=multiprocessing.cpu_count()) for row in record: pool.apply_async(create_json, args=(row, ), callback=collect_results) pool.close() pool.join()

Tomerikoo Over a year ago

Did you try the map version I provided? It might be faster because the loop is internal and not an explicit for. You might also try with a thread pool by changing the import to multiprocessing.dummy but not sure if that will make it better because of GIL

Collectives™ on Stack Overflow

Python Multiprocessing a For Loop

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related