0

I have to concatenate many short strings into on a dictionary entry and I realized it was quite slow (roughly quadratic). However, if the strings are concatenated beforehand and then added to the dictionary then the concatenation time is almost linear.

Here is a simple illustration. This functions basically concatenate many strings and create a dictionary with a single entry that contains the concatenated string. The first functions does it directly on the entry "d[key] += str(num)" and the second on a string "out_str += str(num)". Down below the times are printed and can be seen the quadratic and linear behaviour.

I wonder where the overhead is coming from. Thanks!

def method1(loop_count):
    """ String concatenation on dictionary entry directly (slow)"""
    out_str = ''
    d={}
    key = 'key' 
    d[key] = ''
    for num in range(loop_count):
        d[key] += str(num)
    return d

def method2(loop_count):
    """ Concatenation 'on string' and then add to dictionary (fast) """
    out_str = ''
    d={}
    key = 'key' 
    for num in range(loop_count):
        out_str += str(num)

    d[key] = out_str
    return d

def method3(loop_count):
    """ Concatenation 'on string' and then add to dictionary (fast) """
    out_str = ''
    d={}
    key = 'key'

    out_str = ''.join(str(n) for n in range(loop_count))

    d[key] = out_str
    return d

from timeit import default_timer as timer
import numpy as np
for p in range(10,20):
    t0 = timer()
    method1(np.power(2,p))
    t1 = timer()
    method2(np.power(2,p))
    t2 = timer()
    method3(np.power(2,p))
    t3 = timer()
    print("2^{}:\t{:4.2g}\t{:4.2g}\t{:4.2g}".format(p, t1-t0, t2-t1, t3-t2))

        in dict   +=    join
2^10:   0.0003  0.0002  0.0002
2^11:   0.00069 0.0004  0.00038
2^12:   0.0017  0.00079 0.00076
2^13:   0.0057  0.0016  0.0015
2^14:   0.021   0.0032  0.0031
2^15:   0.095   0.0065  0.0065
2^16:   0.77    0.013   0.013
2^17:    3.2    0.026   0.027
2^18:     15    0.052   0.052
2^19:     67     0.1    0.11

Note: This is not a question strictly about efficient string concatenation. It is about when string optimizations take place in string concatenation.

Note 2: I added a third method using the "join" idiom and it takes exactly the same time as +=


As the answers suggest this seems to be an optimization issue. Further tests seem to prove this. The code below shows that += reuses many of the strings whether the concatenation in dictionary entry does not:

a=''
for n in range(10):
    print(id(a))
    a+=str(n)

140126222965424
140126043294720
140126043294720
140126043294720
140126043294720
140126043294720
140126043294720
140126043294720
140126042796464
140126042796464

d={}
d['key']=''
for n in range(10):
    print(id(d['key']))
    d['key']+=str(n)

140126222965424
140126042643120
140126042643232
140126042643176
140126042643120
140126042643232
140126042643176
140126042643120
140126042761520
140126042761456

I still wonder why is so. Thank you!

5
  • 3
    Python optimizes some string concatenations. My guess is that "some" does not include += on dictionary items. BTW, you might find it even faster to accumulate them in a list and then join them from there (you'd have to try it). Commented Oct 31, 2018 at 16:53
  • 2
    Possible duplicate of How to speed up string concatenation in Python? Commented Oct 31, 2018 at 16:57
  • 1
    Yes, because you should assume that string concatenation with += is quadratic. There has been some effort at adding optimizations that will not use quadratic-time algorithms under the hood when you do a loop with += with strings, but it is hard for the interpreter to optimize anything but the most obvious instances. In your case, the interpreter is not able to optimize it. But you shouldn't rely on interpreter optimizations, rather, use the canonical way of joining many strings: use ''.join Commented Oct 31, 2018 at 19:31
  • Updated with join. The same performace of +=. Commented Nov 2, 2018 at 10:06
  • So it seems that the issue is an optimization that does not enter in with dictionaries. I suspected something like that. Thank you! I wonder if there is any python documentation where this issue is described... Commented Nov 2, 2018 at 10:09

1 Answer 1

1

"E.Coms" linked article points to an old mailing list entry: https://mail.python.org/pipermail/python-dev/2004-August/046686.html which talks about code like:

s = ''
for x in y:
  s += some_string(x)

mentioning:

The question is important because the difference in performance is enormous -- we are not talking about 2x or even 10x faster but roughly Nx faster where N is the size of the input data set.

interesting, as I'd assumed that

out_str = ''.join(str(n) for n in range(loop_count))

would have been faster than

out_str = ''
for num in range(loop_count):
    out_str += str(num)

but they have the same performance as far as I can measure. I think I made this assumption because of the repeated messages that str.join() is a "good thing"

not sure if that answers the question, but I found the history interesting!

as indicated by juanpa.arrivillaga below, method1 is slow because storing references to the string elsewhere invalidates the above optimisation. there will also be a O(n) cost to doing the extra dictionary lookups as well, but in timings this small cost is dominated by the O(n^2) work of creating n copies of the string rather than just the amortised O(n) version the optimisation permits

Sign up to request clarification or add additional context in comments.

9 Comments

yes, because the interpreter is able to optimize that simple loop. Adding the indirection of the dictionary makes the interpreter unable to optimize it. But you should never be writing code hoping that these interpreter optimizations, which were really put into place because too many people cannot be bothered to learn the correct way to do things. So if you are joining many strings, use ''.join.
@juanpa.arrivillaga have updated my answer, hope I explained it usefully
"O(n) cost to doing the extra dictionary lookups as well" um, I don't think so. What do you mean? dict lookups in Python are safely assumed to be O(1). Of course, it is possible that certain dictionary structures will fail at this, but it is one of the most optimized data-structures in the language, especially for anything like str and int objects. Note, many variable references, like global ones are simply dict accesses
I think we're saying the same thing just using different words! if you've read knuth's taocp I think I might be going into that level of detail — basically unnecessary for answering the OP
Agreed, speaking informally, one scales linearly, the other quadratically.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.