Skip to content

Profiling python C extensions

14/04/2010

In last days I was optimizing some code I’ve written for PyQuante
http://pyquante.sourceforge.net/. I had to do a lot of searches to find
my way in profiling C extensions from python.

I’ll also digress on various tools I’ve used in this case, feel free to
skip on “The Solution” section.

It’s all tested on Linux platforms, if you need help for other platforms
I can see what I can do.

The Test Case

Let’s see a typical example, imagine that you have some data (numbers)
you want to calculate something about that.

You have this data from some sources or generated by python control
code, in this example we’ll use xml, but there are a lot of use cases.

Let’s see a snippet, this code can block your computer also for a small
set of data (like 100 numbers)

#!/usr/bin/python
# -*- coding: utf-8 -*-
import math
import xml.etree.ElementTree as ET

def parse_numbers(xmlfile):
    '''
    Parses numeric data from a dummy xmlfile
    '''

    tree = ET.parse('data.xml')
    root = tree.getroot()
    numbers = []

    for element in root:
        numbers.append(float(entry.text))
    return numbers

def stddev(numbers):
    res = 0
    for n in numbers:
        res += math.pow(n, 2)
    return math.sqrt(res)

import operator

def heavycrunch(numbers):
    res = 0
    for n1 in numbers:
        for n2 in numbers:
            for n3 in numbers:
                for n4 in numbers:
                    tocalc = (n1, n2, n3, n4)
                    res += stddev(tocalc)
    return res / len(numbers) ** 4

if __name__ == '__main__':
    numbers = parse_numbers('data.xml')
    result = heavycrunch(numbers)

First point: Profile and optimize it in python, it’s
not uncommon that you can reach enough speed in pure python, leaving it
in python has a lot of advantages like ease of deploying and ease of
testing. There is the cProfile profiler that’s pretty good.

This command lets me profile the code and sorts the results by time
spent in the functions (excluded subfunctions):

python -m cProfile -s time example.py

         77765214 function calls (77765107 primitive calls) in 173.175 CPU seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
 12960000   89.673    0.000  156.182    0.000 example.py:33(stddev)
 51840000   53.832    0.000   53.832    0.000 {math.pow}
        1   16.961   16.961  173.143  173.143 example.py:39(heavycrunch)
 12960001   12.677    0.000   12.677    0.000 {math.sqrt}
        1    0.007    0.007    0.008    0.008 sre_compile.py:307(_optimize_unicode)
       61    0.003    0.000    0.004    0.000 ElementTree.py:1075(start)
        1    0.002    0.002  173.174  173.174 example.py:5()
        2    0.002    0.001    0.002    0.001 {range}
...........................................................
  • ncalls: number of calls
  • tottime: total time spent in this function, excluding the time spent in
    subfunctions. (this is the most important)

This output tells me that 89 secs are spent in the inner cycle in stddev
and maybe in the call overhead, 53 secs are spent in computing the
powers and 16 seconds in the heavycrunch cycle.

This is well optimizable just in python, there are a lot of suggestions
that comes in mind..

  • Inlining the cycle inside of stddev
  • Substituting x*x instead of pow(x,2)
  • Using sum(map(operator.mul, numbers,numbers)) instead of all of this
    cycles.

Let’s see what’s happening editing the interested lines:

import operator

def heavycrunch(numbers):
    res = 0
    for n1 in numbers:
        for n2 in numbers:
            for n3 in numbers:
                for n4 in numbers:
                    res += sum(map(operator.pow, tocalc, tocalc))**0.5
    return res / (len(numbers)**4)

The profiling output:

         25925214 function calls (25925107 primitive calls) in 79.984 CPU seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1   34.535   34.535   79.951   79.951 example.py:43(heavycrunch)
 12960000   29.576    0.000   29.576    0.000 {map}
 12960000   15.840    0.000   15.840    0.000 {sum}
        1    0.008    0.008    0.009    0.009 sre_compile.py:307(_optimize_unicode)
       61    0.003    0.000    0.004    0.000 ElementTree.py:1075(start)
        2    0.002    0.001    0.002    0.001 {range}

We have mostly doubled the speed, but we can’t do much more than that.

Anyway, the profiling in python is not the main purpose of this article :).

Writing the function and the wrapper in C

Let’s implement the interesting functions in C, headers in example.h and
sources in example.c

Nothing much to say, they’re just pure C functions.

/* example.h */

double stddev(double *numbers, int len);
double heavycrunch(double *numbers, int len);

/* example.c */

#include <math.h>

double stddev(double *numbers, int len)
{
  double res =0;
  int i;
  for (i=0; i&lt;len; i++)
    {
      res += pow(numbers[i],2);
    }
  return res;
}

double heavycrunch(double *numbers, int len)
{
  double res;
  double topass[4];
  int i,j,k,l;

  for (i=0; i&lt;len; i++)
    {
      for (j = 0; j &lt;len; j++)
        {
          for (k = 0; k&lt;len; k++)
            {
              for (l=0; l&lt;len; l++)
                {
                  topass[0] = numbers[i];
                  topass[1] = numbers[j];
                  topass[2] = numbers[k];
                  topass[3] = numbers[l];

                  res+=stddev(topass,4);
                }
            }
        }
    }
  return sqrt(res)/pow(len,4);
}

Now we have to implement the wrappers, we will just wrap heavycrunch
because we don’t use stddev externally.

I used cython to wrap the extensions and I suggest you to do so, it
makes the process of wrapping C stuff straightforward and fun :).

# example_wrap.pyx
from stdlib cimport * # We need to allocate a double *

cdef extern from &quot;example.h&quot;:
    double cheavycrunch &quot;heavycrunch&quot; (double *numbers, int len)

def heavycrunch(numbers):
    cdef double *numarray

    numarray = &lt;double *&gt; malloc(sizeof(double)*len(numbers))
    for i,num in enumerate(numbers):
        numarray[i] = num

    res = cheavycrunch(numarray,len(numbers))
    free(numarray)
    return res

OK, sorry for the code spamming, you will find all at the end of the
article so you can run a working example.

Let’s compile and run, I used scons to do that, however you can use the
following commands (well, scons generated this for me).

gcc -o example.os -c -fPIC -I/usr/include/python2.6 example.c
cython -o example_wrap.c example_wrap.pyx
gcc -o example_wrap.os -c -fPIC -I/usr/include/python2.6 example_wrap.c
gcc -o example_wrap.so -shared example_wrap.os example.os -lpython2.6

We’ve finished with all this stuff, let’s test how much speed we’ve
gained.

from example_wrap import heavycrunch

if __name__ == '__main__':
    numbers = parse_numbers('data.xml')
    result = heavycrunch(numbers)

5213 function calls (5106 primitive calls) in 0.402 CPU seconds

Ordered by: internal time

ncalls tottime percall cumtime percall filename:lineno(function)
1 0.369 0.369 0.369 0.369 {example_wrap.heavycrunch}
1 0.007 0.007 0.008 0.008 sre_compile.py:307(_optimize_unicode)
61 0.004 0.000 0.005 0.000 ElementTree.py:1075(start)
2 0.002 0.001 0.002 0.001 {range}
Nice! But now starts the real purpose of the article. How to know
profiling information *inside* of heavycrunch?

If you pass a big xml file with ~100 numbers we have still problem in C
and you may want to profile and optimize it.

gprof – Plain old way

You can try to profile the functions independently from the python
code. You can write a main.c file and implement some tests, this implies
feed the data inside the function so you will need to use a C xmlparser.

After that you can profile your code with various C tools (like gprof).

This is a big work, “scripting” in C is somewhat difficult and you can
lost a lot of time.

There should be another way to accomplish that, but I couldn’t find how
to do it. It consists compiling the whole python with the profiling
flags, this can work as well but it’s a bit overwhelming.

The Solution – google-perftools

There’s a really nice library over there, google-perftools
http://code.google.com/p/google-perftools/

This library is not invasive, you can profile without particular needs
and you can run the code under each condition. Nice!

The library works in this way:

StartProfiler("logfile.log")
... code to profile ...
StopProfiler()

And it dumps in logfile.log information about the code inside.

Question: But if we call this 2 functions *from python*?
Answer: We can have profile information on what’s happening inside this
2 calls!

So, basically we have to install this library and wrap this 2 functions,
it’s quite easy with cython.

# prof.pyx
cdef extern from "google/profiler.h":
    void ProfilerStart( char* fname )
    void ProfilerStop()

def profiler_start(fname):
    ProfilerStart(<char *>fname)

def profiler_stop():
    ProfilerStop()

Compiling is something like that:

cython -o prof.c prof.pyx
gcc -o prof.so -shared prof.c -fPIC -lpython2.6 -lprofiler -I/usr/include/python2.6

Now that we have our profiling extension we can profile and analyze the
output:

from example_wrap import heavycrunch
from prof import profiler_start,profiler_stop
if __name__ == '__main__':
    numbers = parse_numbers('data.xml')
    profiler_start("heavycrunch.log")
    result = heavycrunch(numbers)
    profiler_stop()

OK, we have produced our profiling data, we can analyze in a lot of ways
http://google-perftools.googlecode.com/svn/trunk/doc/cpuprofile.html
Personally I like the graphical interface of kcachegrind.

pprof --callgrind example_wrap.so heavycrunch.log > heavycrunch.callgrind
kcachegrind heavycrunch.callgrind

The graphical interface is quite intuitive, however look at the
picture.

Kcachegrind example

There is a “Flat Profile” on the left, with fields:

  • incl: the “time” (well it’s more like a rate) spent in this
    function, subcalls included
  • self: “time” spent in the function, subcalls excluded, this is the
    most important field.
  • The other two are trivial

There’s a nice feature that gives you a graphical representation of all
this stuff, on the right go on “Callee Map”. This shows you graphically
and interactively how the time is distributed between callers.

In our case is evident that we have to optimize stddev, (you guessed it
yet) anyway we can cut this 75% off. But this is left as exercise.

If some reader has some questions or has suggestion for improve the
article, I would be happy to answer quickly! Bye!

Advertisements
No comments yet

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: