Python string concatenation in dictionary entries

Question

I have to concatenate many short strings into on a dictionary entry and I realized it was quite slow (roughly quadratic). However, if the strings are concatenated beforehand and then added to the dictionary then the concatenation time is almost linear.

Here is a simple illustration. This functions basically concatenate many strings and create a dictionary with a single entry that contains the concatenated string. The first functions does it directly on the entry "d[key] += str(num)" and the second on a string "out_str += str(num)". Down below the times are printed and can be seen the quadratic and linear behaviour.

I wonder where the overhead is coming from. Thanks!

def method1(loop_count):
    """ String concatenation on dictionary entry directly (slow)"""
    out_str = ''
    d={}
    key = 'key' 
    d[key] = ''
    for num in range(loop_count):
        d[key] += str(num)
    return d

def method2(loop_count):
    """ Concatenation 'on string' and then add to dictionary (fast) """
    out_str = ''
    d={}
    key = 'key' 
    for num in range(loop_count):
        out_str += str(num)

    d[key] = out_str
    return d

def method3(loop_count):
    """ Concatenation 'on string' and then add to dictionary (fast) """
    out_str = ''
    d={}
    key = 'key'

    out_str = ''.join(str(n) for n in range(loop_count))

    d[key] = out_str
    return d

from timeit import default_timer as timer
import numpy as np
for p in range(10,20):
    t0 = timer()
    method1(np.power(2,p))
    t1 = timer()
    method2(np.power(2,p))
    t2 = timer()
    method3(np.power(2,p))
    t3 = timer()
    print("2^{}:\t{:4.2g}\t{:4.2g}\t{:4.2g}".format(p, t1-t0, t2-t1, t3-t2))

        in dict   +=    join
2^10:   0.0003  0.0002  0.0002
2^11:   0.00069 0.0004  0.00038
2^12:   0.0017  0.00079 0.00076
2^13:   0.0057  0.0016  0.0015
2^14:   0.021   0.0032  0.0031
2^15:   0.095   0.0065  0.0065
2^16:   0.77    0.013   0.013
2^17:    3.2    0.026   0.027
2^18:     15    0.052   0.052
2^19:     67     0.1    0.11

Note: This is not a question strictly about efficient string concatenation. It is about when string optimizations take place in string concatenation.

Note 2: I added a third method using the "join" idiom and it takes exactly the same time as +=

As the answers suggest this seems to be an optimization issue. Further tests seem to prove this. The code below shows that += reuses many of the strings whether the concatenation in dictionary entry does not:

a=''
for n in range(10):
    print(id(a))
    a+=str(n)

140126222965424
140126043294720
140126043294720
140126043294720
140126043294720
140126043294720
140126043294720
140126043294720
140126042796464
140126042796464

d={}
d['key']=''
for n in range(10):
    print(id(d['key']))
    d['key']+=str(n)

140126222965424
140126042643120
140126042643232
140126042643176
140126042643120
140126042643232
140126042643176
140126042643120
140126042761520
140126042761456

I still wonder why is so. Thank you!

Python optimizes some string concatenations. My guess is that "some" does not include += on dictionary items. BTW, you might find it even faster to accumulate them in a list and then join them from there (you'd have to try it). — kindall
– kindall, Commented Oct 31, 2018 at 16:53
Possible duplicate of How to speed up string concatenation in Python? — E.Coms
– E.Coms, Commented Oct 31, 2018 at 16:57
Yes, because you should assume that string concatenation with += is quadratic. There has been some effort at adding optimizations that will not use quadratic-time algorithms under the hood when you do a loop with += with strings, but it is hard for the interpreter to optimize anything but the most obvious instances. In your case, the interpreter is not able to optimize it. But you shouldn't rely on interpreter optimizations, rather, use the canonical way of joining many strings: use ''.join — juanpa.arrivillaga
– juanpa.arrivillaga, Commented Oct 31, 2018 at 19:31
So it seems that the issue is an optimization that does not enter in with dictionaries. I suspected something like that. Thank you! I wonder if there is any python documentation where this issue is described... — Gonzalo
– Gonzalo, Commented Nov 2, 2018 at 10:09

Sam Mason · Accepted Answer · 2018-10-31 19:42:33Z

1

"E.Coms" linked article points to an old mailing list entry: https://mail.python.org/pipermail/python-dev/2004-August/046686.html which talks about code like:

s = ''
for x in y:
  s += some_string(x)

mentioning:

The question is important because the difference in performance is enormous -- we are not talking about 2x or even 10x faster but roughly Nx faster where N is the size of the input data set.

interesting, as I'd assumed that

out_str = ''.join(str(n) for n in range(loop_count))

would have been faster than

out_str = ''
for num in range(loop_count):
    out_str += str(num)

but they have the same performance as far as I can measure. I think I made this assumption because of the repeated messages that str.join() is a "good thing"

not sure if that answers the question, but I found the history interesting!

as indicated by juanpa.arrivillaga below, method1 is slow because storing references to the string elsewhere invalidates the above optimisation. there will also be a O(n) cost to doing the extra dictionary lookups as well, but in timings this small cost is dominated by the O(n^2) work of creating n copies of the string rather than just the amortised O(n) version the optimisation permits

edited Oct 31, 2018 at 19:42

answered Oct 31, 2018 at 19:21

Sam Mason

16.5k1 gold badge49 silver badges71 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

juanpa.arrivillaga Over a year ago

yes, because the interpreter is able to optimize that simple loop. Adding the indirection of the dictionary makes the interpreter unable to optimize it. But you should never be writing code hoping that these interpreter optimizations, which were really put into place because too many people cannot be bothered to learn the correct way to do things. So if you are joining many strings, use ''.join.

Sam Mason Over a year ago

@juanpa.arrivillaga have updated my answer, hope I explained it usefully

juanpa.arrivillaga Over a year ago

"O(n) cost to doing the extra dictionary lookups as well" um, I don't think so. What do you mean? dict lookups in Python are safely assumed to be O(1). Of course, it is possible that certain dictionary structures will fail at this, but it is one of the most optimized data-structures in the language, especially for anything like str and int objects. Note, many variable references, like global ones are simply dict accesses

Sam Mason Over a year ago

I think we're saying the same thing just using different words! if you've read knuth's taocp I think I might be going into that level of detail — basically unnecessary for answering the OP

juanpa.arrivillaga Over a year ago

Agreed, speaking informally, one scales linearly, the other quadratically.

|

Collectives™ on Stack Overflow

Python string concatenation in dictionary entries

1 Answer 1

9 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

9 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related