Efficient maping of large pandas dataframe (by index)

Question

I'm currently optimising my code and I have found bottle neck. I have dataframe df with column 'Numbers' with numbers from 1 to 100 (integers). I would like to map those numbers with dictionary. I know that I can use .map() or .replace() function but it seems that both solutions are slow and does not take into account that numbers from 'Numbers' are index of my dictionary (which is series), i.e.: I would like to perform the following:

dict_simple=[]
for i in range(100):
    dict_simple.append('a' +str(i))

df['Numbers_with_a']=df['Numbers'].apply(lambda x: dict_simple[x])

Unfortunatelly apply function is also very slow. Is there any other way to do it faster? Dataframe is 50M+ records.

I have tried .map(), replace() and .apply() functions from pandas package, but performance is very poor. I would like to improve calculation time.

In your example dict_simple is a list... so this is confusing. Are you just trying to map those 100 possible integers to strings? And its 1 to 100, inclusive, not 0 to 99? — tdelaney
– tdelaney, Commented Aug 10, 2023 at 20:47
In example I wanted to stress out that I do not need to check every element in dict_simple to perform mapping. I know exactly which element I want (position in the list). — wojteka
– wojteka, Commented Aug 10, 2023 at 20:54
It's not obvious what do you mean by poor performance. Let's say you have a code like mapper = pd.Series('a', index=range(100)) + pd.Series(range(100), dtype=str); seq = pd.Series(rng.choice(100, size=50_000_000)); seq.map(mapper). It runs 2 sec on my quite old machine. Is it not enough? — Vitalizzare
– Vitalizzare, Commented Aug 10, 2023 at 21:07

yashaswi k · Accepted Answer · 2023-08-10 21:05:29Z

1

convert your list to numpy array and map them as below:

dict_simple=[]
for i in range(100):
    dict_simple.append('a' +str(i))

dict_array = np.array(dict_simple)
df['Numbers_with_a'] = dict_array[df['Numbers'].values]

answered Aug 10, 2023 at 21:05

yashaswi k

4631 gold badge9 silver badges23 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

tdelaney · Accepted Answer · 2023-08-10 21:19:32Z

1

pandas.Series have an index that can be used to map one value to another natively in pandas without the extra expense of calling apply for each row or converting values to python int type. Since the numbers you want to map start from zero and a Series indexes from 0 by default, you can

import pandas as pd

df = pd.DataFrame({"numbers":[1,4,22,7,99]})
str_map = pd.Series([f'a{i}' for i in range(100)])
df['numbers_with_a'] = str_map.iloc[df.numbers].reset_index(drop=True)
print(df)

str_map is a Series created from your "a0"... strings. str_map.iloc[df.numbers] uses your numbers as indicies, giving you a new Series of the mapped values. That series is indexed by your numbers, so you drop that index and assign the result back to the original dataframe.

edited Aug 10, 2023 at 21:19

answered Aug 10, 2023 at 21:05

tdelaney

78k6 gold badges91 silver badges129 bronze badges

2 Comments

tdelaney Over a year ago

@Vitalizzare - A mistake. I was playing with better ways to index the array.

Vitalizzare Over a year ago

I like the idea of using iloc for mapping in cases like this one. Good catch!

wojteka · Accepted Answer · 2023-08-10 21:54:24Z

0

Thanks for all answers. I have done some comparison:

import pandas as pd
import time
import numpy as np

df=pd.DataFrame(np.random.randint(1,10,size=(10000000,1)), columns=list('N'))

dict_dictionary={}
dict_list=[]
for i in range(10):
    dict_dictionary[i]='a' + str(i)
    dict_list.append('a' + str(i))
dict_array=np.array(dict_list)
dict_series=pd.Series([f'a{i}' for i in range(10)])

print('map')
start_time=time.time()
df['Numbers_map']=df['N'].map(dict_dictionary)
print(time.time()-start_time)

print('replace')
start_time=time.time()
df['Numbers_replace']=df['N']
df['Numbers_replace'].replace(dict_dictionary,inplace=True)
print(time.time()-start_time)

print('array')
start_time=time.time()
df['Numbers_array']=dict_array[df['N'].values]
print(time.time()-start_time)

print('series')
start_time=time.time()
df['Numbers_series']=dict_series.iloc[df.N].reset_index(drop=True)
print(time.time()-start_time)

print('end')

Results are as follows:

map
1.424480676651001
replace
3.657830238342285
array
1.4687621593475342
series
0.4687619209289551
end

"replace" gains some performance for small dictionaries, but overall approach with series is the fastest.

answered Aug 10, 2023 at 21:54

wojteka

11 bronze badge

2 Comments

tdelaney Over a year ago

Interesting. I thought I had a crappy machine, but my times were faster and .map beat the rest: map 0.207, replace 0.584, array 0.473, series 0.368. Versions are: python 3.10.12, numpy 1.21.5, pandas 1.3.5

wojteka Over a year ago

Do you have the same results with higher number of integers in df (randint 1 to 90) and dicts up to 100? I thought that .map and .replace should be slower because dicts are not indexed.

wojteka · Accepted Answer · 2023-08-13 16:16:42Z

I have updated numpy and pandas to the newest versions and right now "map" is very close to "series" approach. Computing time depends on the size of dictonary. Sometimes "map" is better and sometimes "series". Also I have tried parallell pandas to use all CPUs, but performance was worse than "map". Quite probably p_map performance is worse because I'm using only 2 CPU.

import pandas as pd
from  parallel_pandas import ParallelPandas
import time
import numpy as np
ParallelPandas.initialize(n_cpu=2, split_factor=2, disable_pr_bar=True)

df=pd.DataFrame(np.random.randint(1,99,size=(1000000,1)), columns=list('N'))

dict_dictionary={}
dict_list=[]
for i in range(100):
    dict_dictionary[i]='a' + str(i)
    dict_list.append('a' + str(i))
dict_array=np.array(dict_list)
dict_series=pd.Series([f'a{i}' for i in range(100)])
print('p_map')
start_time=time.time()
df['Numbers_p_map']=df['N'].p_map(dict_dictionary)
print(time.time()-start_time)

Collectives™ on Stack Overflow

Efficient maping of large pandas dataframe (by index)

4 Answers 4

Comments

2 Comments

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

2 Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related