(I’m going to start off by emphasizing that this is not for production use, it is just a bit of harmless fun while I was looking at the structure of python’s bytecode, and I thought it might be interesting reading for others)
There have been quite a few hacks in the past to add tail-call optimisation to python – normally in cross-interpreter python, but while I was playing with the byteplay module thought I’d have a go at writing a function that re-compiles a function with basic tail-call optimisation inserted.
My method is basic, and converts tail-recursive calls in a (pure) function into jump statements.
For example, the (very simple) python function:
def f(x):
"""pointless sample function"""
return f(x)
Starts out with bytecode like this:
1 LOAD_GLOBAL f 2 LOAD_FAST x 3 CALL_FUNCTION 1 4 RETURN_VALUE
The CALL_FUNCTION opcode can be replaced with a JUMP_ABSOLUTE to the first opcode:
2 LOAD_FAST x 3 STORE_FAST x 4 JUMP_ABSOLUTE to 2 5 RETURN_VALUE
Obviously the function above simply gets stuck in an infinite loop, but I’m keeping the sample bytecode short.
(The reason you might want to make such a replacement is to prevent the interpreter from having to create a new local scope for the new function call on the stack, having to jump through multiple RETURN_VALUE operations as it returns from each inner function call, and to reduce the size of the stack.)
My simple function takes a function, decompiles it, looks for tail calls like the above and replaces the tail calls with opcodes to set the parameter variables, and then loop back to the beginning (as above).
I tested this on various recursive functions, and here are the results:
******************** Testing: gcd - 'greatest common devisor - Euclid' ( 59.70% speed increase ) ******************** Testing: gcd_dijkstra - 'greatest common devisor - Dijkstra' ( 147.89% speed increase ) ******************** Testing: power - 'raise to the power - no square optimisation (lisp-like)' ( 95.60% speed increase ) ******************** Testing: quickpower - 'raise to the power - with square optimisation' ( 2.79% speed increase ) ******************** Testing: fact - 'Factorial (lisp-like)' ( 115.74% speed increase ) ******************** Testing: binary_search - 'Simple binary search or ordered list' ( 25.95% speed increase ) ******************** Testing: f - 'pointless sample function' ( 165.97% speed increase ) ******************** Testing: ackermann - 'Ackermann-Peter function (doesn't get properly optimised)' ( 1.94% speed increase )
For the interested, here are the functions that were tested:
There’s a lot more that could be done though – for example if variables haven’t been changed on the stack then they don’t have to be saved again.












Interesting experiment!
Even though I believe tail-recursive calls are pretty rare in real-world scenarios, it would be interesting to know which percentage of the calls are eligible for TCO in a typical Python program. Would you happen to have the numbers at hand?
Also, reusing the frame is not going to be possible when it is not wide enough to hold all the variables of the callee. I’ll probably toy a bit with the idea in our Python JIT when I get the opportunity; tell me if you would like to hear about the results.
I’d definitely be interested in hearing your results.
For what it’s worth, these results were on python 2.6.
Crosstwine looks interesting, but unfortunately bytecodehacks isn’t available for python 3 (and 2to3 hasn’t managed an automatic conversion), so I haven’t been able to quickly test if this still offers a performance increase on crosstwine.
I don’t know what kind of JIT you are using, but I think that a tracing JIT could be modified automatically do most of the optimisation I’ve done – I started having a look at the tracing JIT in the newest version of pypy to see if it does, but I haven’t managed to get my head around the code well enough yet.
I’m afraid I don’t have any stats about how often tail-recursive calls are likely to be used, although there is also the wider variety of calls that could be converted to one.
e.g.
def f(x):
return x*f(x-1)
(which can easily be converted to the lisp-like version used above)
Hi Tim,
I’ll keep you in the loop. And yes, it’s a pity we only support 3.x for now (we were not expecting to have an usable engine so “early,” and were hoping for a faster uptake of the new version). In fact, that makes it very difficult for us to cut our teeth on real-world examples; I guess I’ll just bite the bullet and integrate with 2.6 too.
CrossTwine Linker does not currently include a tracing JIT, it’s more of a “classical” design. I’m not sure a tracing JIT buys you a lot on this particular front, though: the compiler still has to recognize the situation to be able to reuse frames (TANSTAAFL). On the other hand, you are right that it might be able to automatically inline—and possibly coalesce—frame creation, and to peephole the sequence of returns away.
Truth to be told, TCO (and inlining in general) are not “free” optimizations, especially in a dynamic language where a program can examine (and manipulate!) its call stack at any time. I’ll probably hack the Python compiler to do analysis on a couple large applications before I venture down that path…
P.-S. — As you mentioned, your “factorial” function is not eligible for TCO, which would require an accumulator to be used. I’m not aware of any (non-research) compiler which does that kind of transformation behind the scenes, though, which leads me to think it’s not worth the trouble. (But then, I’ve been wrong before.