Speeding up dynamic programming in python/numpy

Question

I have a 2D cost matrix M, perhaps 400x400, and I'm trying to calculate the optimal path through it. As such, I have a function like:

M[i,j] = M[i,j] + min(M[i-1,j-1],M[i-1,j]+P1,M[i,j-1]+P1)

which is obviously recursive. P1 is some additive constant. My code, which works more or less, is:

def optimalcost(cost, P1=10):
    width1,width2 = cost.shape
    M = array(cost)
    for i in range(0,width1):
       for j in range(0,width2):
          try:
              M[i,j] = M[i,j] + min(M[i-1,j-1],M[i-1,j]+P1,M[i,j-1]+P1)
          except:
              M[i,j] = inf
    return M

Now I know looping in Numpy is a terrible idea, and for things like the calculation of the initial cost matrix I've been able to find shortcuts to cutting the time down. However, as I need to evaluate potentially the entire matrix I'm not sure how else to do it. This takes around 3 seconds per call on my machine and must be applied to around 300 of these cost matrices. I'm not sure where this time comes from, as profiling says the 200,000 calls to min only take 0.1s - maybe memory access?

Is there a way to do this in parallel somehow? I assume there may be, but to me it seems each iteration is dependent unless there's a smarter way to memoize things.

There are parallels to this question: Can I avoid Python loop overhead on dynamic programming with numpy?

I'm happy to switch to C if necessary, but I like the flexibility of Python for rapid testing and the lack of faff with file IO. Off the top of my head, is something like the following code likely to be significantly faster?

#define P1 10
void optimalcost(double** costin, double** costout){
    /* 
        We assume that costout is initially
        filled with costin's values.
    */
    float a,b,c,prevcost;

    for(i=0;i<400;i++){
        for(j=0;j<400;j++){
            a = prevcost+P1;
            b = costout[i][j-1]+P1;
            c = costout[i-1][j-1];
            costout[i][j] += min(prevcost,min(b,c));
            prevcost = costout[i][j];
        }
    }
}

return;

Update:

I'm on Mac, and I don't want to install a whole new Python toolchain so I used Homebrew.

> brew install llvm --rtti
> LLVM_CONFIG_PATH=/usr/local/opt/llvm/bin/llvm-config pip install llvmpy
> pip install numba

New "numba'd" code:

from numba import autojit, jit
import time
import numpy as np

@autojit
def cost(left, right):
    height,width = left.shape
    cost = np.zeros((height,width,width))

    for row in range(height):
        for x in range(width):
            for y in range(width):
                cost[row,x,y] = abs(left[row,x]-right[row,y])

    return cost

@autojit
def optimalcosts(initcost):
    costs = zeros_like(initcost)
    for row in range(height):
        costs[row,:,:] = optimalcost(initcost[row])
    return costs

@autojit
def optimalcost(cost):
    width1,width2 = cost.shape
    P1=10
    prevcost = 0.0
    M = np.array(cost)
    for i in range(1,width1):
        for j in range(1,width2):
            M[i,j] += min(M[i-1,j-1],prevcost+P1,M[i,j-1]+P1)
            prevcost = M[i,j]
    return M

prob_size = 400
left = np.random.rand(prob_size,prob_size)
right = np.random.rand(prob_size,prob_size)

print '---------- Numba Time ----------'
t =  time.time()
c = cost(left,right)
optimalcost(c[100])
print time.time()-t

print '---------- Native python Time --'
t =  time.time()
c = cost.py_func(left,right)
optimalcost.py_func(c[100])
print time.time()-t

It's interesting writing code in Python that is so un-Pythonic. Note for anyone interested in writing Numba code, you need to explicitly express loops in your code. Before, I had the neat Numpy one-liner,

abs(left[row,:][:,newaxis] - right[row,:])

to calculate the cost. That took around 7 seconds with Numba. Writing out the loops properly gives 0.5s.

It's an unfair comparison to compare it to native Python code, because Numpy can do that pretty quickly, but:

Numba compiled: 0.509318113327s

Native: 172.70626092s

I'm impressed both by the numbers and how utterly simple the conversion is.

Is the wrapping around of the indexing intentional or a bug? When you run your code, the first item that gets looked at is i = 0, j = 0, so you get M[0, 0] = M[0, 0] + min(M[-1, -1], M[-1, 0] + P1, M[0, -1] + P1). It seems to me that your try was trying to catch indices out of bounds (you should, by the way, make explicit what you are trying to catch, i.e. do except IndexError), but the -1s in your indices are taken to mean "last element along that dimension" so nothing ever gets set to np.inf. I don't think that's what you want, but please confirm. — Jaime
– Jaime, Commented Dec 6, 2013 at 16:57
Yes, unintentional. In practice it doesn't seem to have affected anything which is why I didn't catch it! — Josh
– Josh, Commented Dec 6, 2013 at 17:32
Note that you actually don't need to expand out all of the loops. In Numba, whether you write native Python for-loops or you write Numpy-based vectorized operations, the Numba JIT will automatically convert either case to the exact same optimized C code. And it's even better at this if you give it some type info with the regular decorators instead of autojit. — ely
– ely, Commented Dec 6, 2013 at 19:26
I found that this was fine in general, but the case I gave for the one line code in Numpy only improved once I expressed the loops. — Josh
– Josh, Commented Dec 7, 2013 at 0:44
+1 for excellent self-contained application of numba. Just what I was looking for. I've taken the liberty of correcting indentation and few other things in the "numba'd" code. — kabdulla
– kabdulla, Commented Mar 1, 2017 at 10:36

ely · Accepted Answer · 2013-12-06 16:15:46Z

2

If it's not hard for you to switch to the Anaconda distribution of Python, you can try using Numba, which for this particular simple dynamic algorithm would probably offer a lot of speedup without making you leave Python.

answered Dec 6, 2013 at 16:15

ely

77.8k36 gold badges158 silver badges234 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Josh Over a year ago

Thanks for the suggestion, looks like it's cut down the runtime so something very manageable. I think to get it much lower is going to require tinkering in C, but this will do! I didn't bother with Anaconda, I've installed it via brew/pip. See my edit for details.

ZisIsNotZis · Accepted Answer · 2018-11-14 03:28:39Z

Numpy is usually not very good at iterative jobs (though it do have some commonly used iterative functions such as np.cumsum, np.cumprod, np.linalg.* and etc). But for simple tasks like finding the shortest path (or lowest energy path) above, you can vectorize the problem by thinking about what can be computed at the same time (also try to avoid making copy:

Suppose we are finding a shortest path in the "row" direction (i.e. horizontally), we can first create our algorithm input:

# The problem, 300 400*400 matrices
# Create infinitely high boundary so that we dont need to handle indexing "-1"
a = np.random.rand(300, 400, 402).astype('f')
a[:,:,::a.shape[2]-1] = np.inf

then prepare some utility arrays which we will use later (creation takes constant time):

# Create self-overlapping view for 3-way minimize
# This is the input in each iteration
# The shape is (400, 300, 400, 3), separately standing for row, batch, column, left-middle-right
A = np.lib.stride_tricks.as_strided(a, (a.shape[1],len(a),a.shape[2]-2,3), (a.strides[1],a.strides[0],a.strides[2],a.strides[2]))

# Create view for output, this is basically for convenience
# The shape is (399, 300, 400). 399 comes from the fact that first row is never modified
B = a[:,1:,1:-1].swapaxes(0, 1)

# Create a temporary array in advance (try to avoid cache miss)
T = np.empty((len(a), a.shape[2]-2), 'f')

and finally do the computation and timeit:

%%timeit
for i in np.arange(a.shape[1]-1):
    A[i].min(2, T)
    B[i] += T

The timing result on my (super old laptop) machine is 1.78s, which is already way faster than 3 minute. I believe you can improve even more (while stick to numpy) by optimize the memory layout and alignment (somehow). Or, you can simply use multiprocessing.Pool. It is easy to use, and this problem is trivial to split to smaller problems (by dividing on the batch axis).

Collectives™ on Stack Overflow

Speeding up dynamic programming in python/numpy

2 Answers 2

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related