0

I'm currently optimising my code and I have found bottle neck. I have dataframe df with column 'Numbers' with numbers from 1 to 100 (integers). I would like to map those numbers with dictionary. I know that I can use .map() or .replace() function but it seems that both solutions are slow and does not take into account that numbers from 'Numbers' are index of my dictionary (which is series), i.e.: I would like to perform the following:

dict_simple=[]
for i in range(100):
    dict_simple.append('a' +str(i))

df['Numbers_with_a']=df['Numbers'].apply(lambda x: dict_simple[x])

Unfortunatelly apply function is also very slow. Is there any other way to do it faster? Dataframe is 50M+ records.

I have tried .map(), replace() and .apply() functions from pandas package, but performance is very poor. I would like to improve calculation time.

3
  • 1
    In your example dict_simple is a list... so this is confusing. Are you just trying to map those 100 possible integers to strings? And its 1 to 100, inclusive, not 0 to 99? Commented Aug 10, 2023 at 20:47
  • In example I wanted to stress out that I do not need to check every element in dict_simple to perform mapping. I know exactly which element I want (position in the list). Commented Aug 10, 2023 at 20:54
  • It's not obvious what do you mean by poor performance. Let's say you have a code like mapper = pd.Series('a', index=range(100)) + pd.Series(range(100), dtype=str); seq = pd.Series(rng.choice(100, size=50_000_000)); seq.map(mapper). It runs 2 sec on my quite old machine. Is it not enough? Commented Aug 10, 2023 at 21:07

4 Answers 4

1

convert your list to numpy array and map them as below:

dict_simple=[]
for i in range(100):
    dict_simple.append('a' +str(i))

dict_array = np.array(dict_simple)
df['Numbers_with_a'] = dict_array[df['Numbers'].values]
Sign up to request clarification or add additional context in comments.

Comments

1

pandas.Series have an index that can be used to map one value to another natively in pandas without the extra expense of calling apply for each row or converting values to python int type. Since the numbers you want to map start from zero and a Series indexes from 0 by default, you can

import pandas as pd

df = pd.DataFrame({"numbers":[1,4,22,7,99]})
str_map = pd.Series([f'a{i}' for i in range(100)])
df['numbers_with_a'] = str_map.iloc[df.numbers].reset_index(drop=True)
print(df)

str_map is a Series created from your "a0"... strings. str_map.iloc[df.numbers] uses your numbers as indicies, giving you a new Series of the mapped values. That series is indexed by your numbers, so you drop that index and assign the result back to the original dataframe.

2 Comments

@Vitalizzare - A mistake. I was playing with better ways to index the array.
I like the idea of using iloc for mapping in cases like this one. Good catch!
0

Thanks for all answers. I have done some comparison:

import pandas as pd
import time
import numpy as np

df=pd.DataFrame(np.random.randint(1,10,size=(10000000,1)), columns=list('N'))

dict_dictionary={}
dict_list=[]
for i in range(10):
    dict_dictionary[i]='a' + str(i)
    dict_list.append('a' + str(i))
dict_array=np.array(dict_list)
dict_series=pd.Series([f'a{i}' for i in range(10)])

print('map')
start_time=time.time()
df['Numbers_map']=df['N'].map(dict_dictionary)
print(time.time()-start_time)

print('replace')
start_time=time.time()
df['Numbers_replace']=df['N']
df['Numbers_replace'].replace(dict_dictionary,inplace=True)
print(time.time()-start_time)

print('array')
start_time=time.time()
df['Numbers_array']=dict_array[df['N'].values]
print(time.time()-start_time)

print('series')
start_time=time.time()
df['Numbers_series']=dict_series.iloc[df.N].reset_index(drop=True)
print(time.time()-start_time)

print('end')

Results are as follows:

map
1.424480676651001
replace
3.657830238342285
array
1.4687621593475342
series
0.4687619209289551
end

"replace" gains some performance for small dictionaries, but overall approach with series is the fastest.

2 Comments

Interesting. I thought I had a crappy machine, but my times were faster and .map beat the rest: map 0.207, replace 0.584, array 0.473, series 0.368. Versions are: python 3.10.12, numpy 1.21.5, pandas 1.3.5
Do you have the same results with higher number of integers in df (randint 1 to 90) and dicts up to 100? I thought that .map and .replace should be slower because dicts are not indexed.
0

I have updated numpy and pandas to the newest versions and right now "map" is very close to "series" approach. Computing time depends on the size of dictonary. Sometimes "map" is better and sometimes "series". Also I have tried parallell pandas to use all CPUs, but performance was worse than "map". Quite probably p_map performance is worse because I'm using only 2 CPU.

import pandas as pd
from  parallel_pandas import ParallelPandas
import time
import numpy as np
ParallelPandas.initialize(n_cpu=2, split_factor=2, disable_pr_bar=True)

df=pd.DataFrame(np.random.randint(1,99,size=(1000000,1)), columns=list('N'))

dict_dictionary={}
dict_list=[]
for i in range(100):
    dict_dictionary[i]='a' + str(i)
    dict_list.append('a' + str(i))
dict_array=np.array(dict_list)
dict_series=pd.Series([f'a{i}' for i in range(100)])
print('p_map')
start_time=time.time()
df['Numbers_p_map']=df['N'].p_map(dict_dictionary)
print(time.time()-start_time)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.