Comments on Technical Discovery: Speeding up Python Again

I’m usually looking about the internet for content...

2012-03-26T13:56:41.506-07:00

I’m usually looking about the internet for content articles that can help me.
Thank you. It is extremely helpful for me.
would you mind updating your blog with more information?
source: www.wbupdates.com

Thanks for sharing your info. I really appreciate ...

2011-11-17T04:58:21.492-08:00

Thanks for sharing your info. I really appreciate your efforts and I will be waiting for your further write ups thanks once again.
WINDOWS PHONE 7 DEVELOPMENT

Hi, I was wondering why C++/Weave turned out to be...

2011-10-27T05:51:13.860-07:00

Hi, I was wondering why C++/Weave turned out to be so much slower than Fortran in your tests. But there are two major differences:
* the Weave code uses functions for indexing the array, probably including boundary checks etc.
* the Fortran code is compiled with optimizations

After optimizing the code, Weave beats Fortran on my machine now:
Looped Fortran: 2.050914 seconds
Vectorized Fortran: 1.505481 seconds
Weave: 1.351250 seconds
(Weave: 0.716219 seconds, with openmp)

Some notes:
* This test-bench does not call weave-update before running the actual timing test, which means Weave will be slow when you use it for the first time as it needs to compile and cache the code first. So you have to ignore the first test.
* The multi-core aware solution is easy! (@MySchizuBuddy) Just added a single openmp line and some compiler options and the calculation completed almost 2x faster on my AMD quad core. Change "fno-openmp" to "fopenmp" for testing the openmp solution.
* When changing compiler flags, like "fopenmp", be aware that Weave does NOT recompile the code, as the code-string itself did not change. In order to test different compiler flags, you have to clear your code cache (~/.python27_compiled).
* There is a bug in gcc that prevents vectorization when openmp is enabled http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46032
You will notice this when looking at the verbose gcc output. So once this bug is fixed, one could check out the fully optimized, vectorized, parallelized calculation :D.

And this is the code: http://pastebin.com/MNfhWjSm

is either of these solution multi-core aware

2011-09-24T05:45:00.021-07:00

is either of these solution multi-core aware

Hi, could anyone include a comparison between Weav...

2011-09-20T14:23:14.490-07:00

Hi, could anyone include a comparison between Weave and Instant? Thanks!

compiling laplace2.py using 'shedskin -b' ...

2011-07-09T02:50:28.201-07:00

compiling laplace2.py using 'shedskin -b' gives a speedup of about a 100 times on my PC (http://shedskin.googlecode.com).

If you run laplace_for.f90, make sure you change N...

2011-07-07T11:12:02.209-07:00

If you run laplace_for.f90, make sure you change N to 150, and use for_update2 (vectorized) function in there (for_update1 is default, which is a simple loop). Then I get 0.500s on my computer. For NumPy solution, I get 5.56s, which seems comparable to your timings. As such, doing "pure vectorized Fortran" (0.5s) seems to speed up the "vectorized Fortran" (1.42s) timing almost three times. (f2py fails to compile on my computer due to some missing main() routine, so I can't easily check it myself.)

Let me know if someone can reproduce my pure Fortan timing, i.e. being 10x faster than NumPy. So that I don't jump to conclusions too fast.

Hi Travis, you should also mention, that the loop ...

2011-07-07T00:48:42.870-07:00

Hi Travis, you should also mention, that the loop and vectorized loop are *not* mathematically equivalent, as is shown by printing for example the sum of the values of the array. You would need to greatly increase the number of iterations. Eventually it converges to the same values though, but not for Niter=8000. This was confusing to me, when I was testing the Fortran solution.

Another note is that it seems that the pure Fortran solution is still quite a bit faster than Python + f2py + f90 solution. You might say that the main loop should always be in Python to be fair, but I would argue that since pypy can do some magic (?) optimizations anyway even for the main loop, it makes sense to really see what the maximum speed is. One can just run the laplace_for.f90 example in scipy/speed repository.

For what it's worth if I switch this to use ar...

2011-07-05T10:54:24.442-07:00

For what it's worth if I switch this to use array.array("d", [0.0]) * N when creating the arrays it goes about 60% faster on my machine. As Ismael mentioned this will eventually be handled basically automatically.

Thanks for posting the code on github. The PyPy r...

2011-07-05T10:29:46.844-07:00

Thanks for posting the code on github. The PyPy results are amazing. At some level this changes everything.

You should add Cython without numpy. There is wor...

2011-07-05T07:27:40.284-07:00

You should add Cython without numpy.

There is work going on in PyPy to optimize lists of the same type of objects to be as fast as arrays, so you can expect further speed up in the future :)

Hi Travis. I've been going through a similar t...

2011-07-05T05:41:43.723-07:00

Hi Travis. I've been going through a similar test after my EuroPython High Performance Python tutorial, I've written up a v0.1 doc here:
http://ianozsvald.com/2011/06/29/high-performance-python-tutorial-v0-1-from-my-4-hour-tutorial-at-europython-2011/
and v0.2 should be in the works in a couple of weeks. I'll be taking Maciej's advice of trying arrays in PyPy (and hopefully testing the micronumpy lib too). I'm impressed (as you are) with the improvements that the PyPy team are bringing out! The trunk version is even faster than the official v1.5.
Ian.

Does the impressive performance of PyPy suggest th...

2011-07-05T04:17:51.616-07:00

Does the impressive performance of PyPy suggest that there may be value in a pure Python NumPy? It would run on Iron Python!

Richard

You would get much better performance in PyPy if y...

2011-07-05T00:08:38.531-07:00

You would get much better performance in PyPy if you've used numpy arrays or array.array instead of list of lists.

Also, for obscure reasons a while loop is a tiny bit faster than a for i in xrange(..) one.

Cheers,
fijal