I have to concatenate many short strings into on a dictionary entry and I realized it was quite slow (roughly quadratic). However, if the strings are concatenated beforehand and then added to the dictionary then the concatenation time is almost linear.
Here is a simple illustration. This functions basically concatenate many strings and create a dictionary with a single entry that contains the concatenated string. The first functions does it directly on the entry "d[key] += str(num)" and the second on a string "out_str += str(num)". Down below the times are printed and can be seen the quadratic and linear behaviour.
I wonder where the overhead is coming from. Thanks!
def method1(loop_count):
""" String concatenation on dictionary entry directly (slow)"""
out_str = ''
d={}
key = 'key'
d[key] = ''
for num in range(loop_count):
d[key] += str(num)
return d
def method2(loop_count):
""" Concatenation 'on string' and then add to dictionary (fast) """
out_str = ''
d={}
key = 'key'
for num in range(loop_count):
out_str += str(num)
d[key] = out_str
return d
def method3(loop_count):
""" Concatenation 'on string' and then add to dictionary (fast) """
out_str = ''
d={}
key = 'key'
out_str = ''.join(str(n) for n in range(loop_count))
d[key] = out_str
return d
from timeit import default_timer as timer
import numpy as np
for p in range(10,20):
t0 = timer()
method1(np.power(2,p))
t1 = timer()
method2(np.power(2,p))
t2 = timer()
method3(np.power(2,p))
t3 = timer()
print("2^{}:\t{:4.2g}\t{:4.2g}\t{:4.2g}".format(p, t1-t0, t2-t1, t3-t2))
in dict += join
2^10: 0.0003 0.0002 0.0002
2^11: 0.00069 0.0004 0.00038
2^12: 0.0017 0.00079 0.00076
2^13: 0.0057 0.0016 0.0015
2^14: 0.021 0.0032 0.0031
2^15: 0.095 0.0065 0.0065
2^16: 0.77 0.013 0.013
2^17: 3.2 0.026 0.027
2^18: 15 0.052 0.052
2^19: 67 0.1 0.11
Note: This is not a question strictly about efficient string concatenation. It is about when string optimizations take place in string concatenation.
Note 2: I added a third method using the "join" idiom and it takes exactly the same time as +=
As the answers suggest this seems to be an optimization issue. Further tests seem to prove this. The code below shows that += reuses many of the strings whether the concatenation in dictionary entry does not:
a=''
for n in range(10):
print(id(a))
a+=str(n)
140126222965424
140126043294720
140126043294720
140126043294720
140126043294720
140126043294720
140126043294720
140126043294720
140126042796464
140126042796464
d={}
d['key']=''
for n in range(10):
print(id(d['key']))
d['key']+=str(n)
140126222965424
140126042643120
140126042643232
140126042643176
140126042643120
140126042643232
140126042643176
140126042643120
140126042761520
140126042761456
I still wonder why is so. Thank you!
+=on dictionary items. BTW, you might find it even faster to accumulate them in a list and then join them from there (you'd have to try it).+=is quadratic. There has been some effort at adding optimizations that will not use quadratic-time algorithms under the hood when you do a loop with+=with strings, but it is hard for the interpreter to optimize anything but the most obvious instances. In your case, the interpreter is not able to optimize it. But you shouldn't rely on interpreter optimizations, rather, use the canonical way of joining many strings: use''.join