Optimisation – it’s sometimes needed.

Tim Wintle - November 10th, 2009

Here at Team Rubber, we pride ourselves on working in a fairly agile manner. For the viral ad network in particular, this means that the most important thing at the end of an iteration is to ship working features, rather than to wait until everything is perfect before shipping.

As a side effect of this, optimisation is generally left until it becomes a noticeable issue. Obviously we worry about the complexity of our algorithms, and not designing ourselves into a corner – but I’m generally happy to take a constant speed reduction in exchange for faster development.

The code I’ve been working on this iteration was a different story, however, and I thought this story might be useful to people in the same situation.

My first implementation took about six CPU hours (in userspace alone!) to process just 1.5 Gb of data. Sure it scales sub-linearly, but I don’t want to have to bring loads of extra hardware on-line just to support this program.

The first thing to look at was the profiling information. pstats showed me that we were spending over 35 minutes looking up entries in my custom cache class – which is shared between different applications. This has a knock-on effects on the rest of the system as our cached items have to expire within a set time – this 35 minute delay means at least 10,000 extra (expensive) cache misses during this run. Each cache miss takes an average of 0.03 CPU seconds, so that’s an extra five minutes on top

Writing a C extension module to behave in the same way as my python class proved relatively painless – and reduced memory overhead at the same time (a single empty dict takes up about 140 bytes of space).  As I was working in C, I could easily optimise the structure for looking up the most common items, and tune the memory management for my use case.

Since I know this code isn’t going anywhere near windows, I could also inline the gettimeofday() calls into my code – which reduces a few python function calls.

I tried two different methods – a binary tree (which let me drop entire sections of the cache at once), and a prioritised array (i.e. the most requested are highest up). Here are my results (tested over a mixture of lookups and assignment – a simulation of my real data):

Time for 9,000,000 entries:
python
took  44.56 seconds
C (tree)
took  31.20 seconds
C (array)
took  15.74 seconds

That’s about 15 minutes of CPU time saved already on my test data :-)

Next on my list of possible improvements was this little method:

def batchRow(self, rowdata):
    self._batchlist.append(rowdata)
    self._batchsize += 1
    if self._batchsize > self._batchmax:
        self.sendBatch()

It turns out that this single method was taking over 26 minutes of CPU time.

Why does this take so much time? Well I tend to write analysis code in a map and reduce fashion so it can be easily distributed (note: I’m not using a framework like Hadoop for this – I mean that in the functional way). This method is where the mappers add their results to a batch list ready for a reduce step, and it gets called many times for each entry being processed.

I tried re-writing this as such:

def batchRow(self, rowdata):
    self._batchlist.append(rowdata)
    i = self._batchsize + 1
    self._batchsize = i
    if i > self._batchmax:
        self.sendBatch()

, which did give a noticable speed improvement (due to one less property lookup), but not significant enough.

Clearly the easiest things to optimise were the property access, and the  integer comparison – so I turned to Cython to (painlessly) re-write this method on a mixin class as such:

cdef class _MixinClass:
    cdef long _batchmax
    cdef long _batchsize
    cdef object _batchlist
    def batchRow(object self, object rowdata):
        self._batchlist.append(rowdata)
        self._batchsize += 1
        if self._batchsize > self._batchmax:
        self.sendBatch()

How this works:

The “cdef” statements serve two purposes – they provide the (C) type of the variables, and they move the class properties into a C struct.

For example, this generates C that looks something like this:

struct _MixinClass {
  PyObject_HEAD;
  long _batchmax;
  long _batchsize;
  PyObject *_batchlist;
}


, which means that property access becomes very cheap. The integer comparison part above becomes something far more like:

self->_batchsize += 1;
if( self->_batchsize > self._batchmax ){
    // Do method call
}


Results (1,000,000 runs):

python:
Took  0.995 seconds
_MixinClass
Took  0.187 seconds
Subclass of _MixinClass
Took  0.223 seconds

- that’s a five times increase, which I’m more than happy with :-)

2 Responses to “Optimisation – it’s sometimes needed.”

  1. Richard Barrell says:

    The improvement there is quite impressive, especially when it’s almost as great even when accessed through the subclass machinery.

    Next question: since self._batchmax is known in advance, why not use an array of PyObject pointers instead of a Python list for _batchlist? That way, there’d be no array resizing as extra elements get added.

  2. Tim Wintle says:

    … because I’m leaving subclasses the option of setting _batchlist to a different (list-like) class in case I want to do lazy joins in the future.

    (there is actually a descriptor called “batchlist” that lets you get/set the _batchlist variable that I’ve missed out from this)

    For the same reason, I only cast batchlist to (PyObject *) -”object” in Cython.

    I suspect it would be faster by importing listobject.h and casting direct to the list type (Cython’s generated C is actually quite messy around calling the “append” method in case the method doesn’t exist) – but Cython makes use of a lot of gcc optimisations so the effect shouldn’t be too great.

Leave a Reply