Python tail-optimisation using byteplay

Tim Wintle - April 20th, 2009

(I’m going to start off by emphasizing that this is not for production use, it is just a bit of harmless fun while I was looking at the structure of python’s bytecode, and I thought it might be interesting reading for others)

There have been quite a few hacks in the past to add tail-call optimisation to python – normally in cross-interpreter python, but while I was playing with the byteplay module thought I’d have a go at writing a function that re-compiles a function with basic tail-call optimisation inserted.

My method is basic, and converts tail-recursive calls in a (pure) function into jump statements.

For example, the (very simple) python function:

def f(x):
    """pointless sample function"""
    return f(x)

Starts out with bytecode like this:

1 LOAD_GLOBAL          f
2 LOAD_FAST            x
3 CALL_FUNCTION        1
4 RETURN_VALUE

The CALL_FUNCTION opcode can be replaced with a JUMP_ABSOLUTE to the first opcode:

2 LOAD_FAST x
3 STORE_FAST x
4 JUMP_ABSOLUTE to 2
5 RETURN_VALUE

Obviously the function above simply gets stuck in an infinite loop, but I’m keeping the sample bytecode short.

(The reason you might want to make such a replacement is to prevent the interpreter from having to create a new local scope for the new function call on the stack, having to jump through multiple RETURN_VALUE operations as it returns from each inner function call, and to reduce the size of the stack.)

My simple function takes a function, decompiles it, looks for tail calls like the above and replaces the tail calls with opcodes to set the parameter variables, and then loop back to the beginning (as above).

I tested this on various recursive functions, and here are the results:

********************
Testing: gcd - 'greatest common devisor - Euclid'
( 59.70% speed increase )
********************
Testing: gcd_dijkstra - 'greatest common devisor - Dijkstra'
( 147.89% speed increase )
********************
Testing: power - 'raise to the power - no square optimisation (lisp-like)'
( 95.60% speed increase )
********************
Testing: quickpower - 'raise to the power - with square optimisation'
( 2.79% speed increase )
********************
Testing: fact - 'Factorial (lisp-like)'
( 115.74% speed increase )
********************
Testing: binary_search - 'Simple binary search or ordered list'
( 25.95% speed increase )
********************
Testing: f - 'pointless sample function'
( 165.97% speed increase )
********************
Testing: ackermann - 'Ackermann-Peter function (doesn't get properly optimised)'
( 1.94% speed increase )

For the interested, here are the functions that were tested:

tested functions

There’s a lot more that could be done though – for example if variables haven’t been changed on the stack then they don’t have to be saved again.