2012-12-09

How much of Python can be written in Python?


Now I don't mean in the PyPy sense where you can bootstrap yourself with another Python installation. No, I'm talking about all you have is a checkout of the CPython repository and a C compiler. How far could you go in writing stuff for Python in Python and not C (from my perspective, for maintainability, for others perhaps ease of extensibility). In Python 3.3 we now have import written in Python (technically the main import loop that is used is implemented in C to save 5% at startup, but that is entirely optional as equivalent pure Python code still exists) and it's actually faster than the C version from Python 3.2 thanks to directory content caching. So it is not entirely ridiculous to think about how far one could push the idea of replacing C code in CPython with Python code.

What restrictions do we have for this thought experiment? One is that CPython needs to continue to be performant. That means either that the feature is not executed constantly or can be made to work as close to C code as possible. The other requirement is that it can't really have dependencies on the stdlib beyond built-in modules. Since this concept works based on freezing Python bytecode into C-level char arrays you don't want to have to pull in half the stdlib just to make something work. But that's pretty much it.

The first possibility is the parser. If you either generated the parser like the one CPython uses (that has not really changed much since Guido wrote it way back when) or wrote a recursive descent one by hand, it could probably be written in Python. The real problem is how performance might be hit. Now if you are working off of bytecode files then this really is only a one-time cost per bytecode file creation. But if you are working primarily with modules that you specify on the command line then they get parsed every time you invoke the interpreter and that could be costly if you can't get performance to be good enough.

Going down the compiler chain, you could also go from CST (concrete syntax tree) to AST (abstract syntax tree) in pure Python. You can already get to the CST from the parser module, so the work to expose the CST at the Python level is done. And with the ast module already exposed it then becomes a matter of creating the AST nodes from the CST. But once again, it's a question of performance since this is invoked every time source code is compiled.

Next would be transforming the AST to bytecode. The AST is already exposed to Python code, so once again the initial work is done for access. But also once again there is the question of performance as this is also on the critical path if you continually compiling Python source code because you are executing scripts instead of importing code which was previously stored as a bytecode file.

You can't do anything for the interpreter eval loop as that becomes a bootstrap issue. If you really wanted to push this you could do a basic eval loop to bootstrap a more complex one, but that seems like more work than it's worth.

I suspect most of Python's builtins could be re-implemented in pure Python without any trouble. Re-implementing something like any(), map(), etc. is not exactly difficult. In this instance, though, performance definitely becomes a key issue due to the extensive use of builtin functions. And in the case of exceptions you have to worry about the C API surrounding them on top of any possible performance issue from exception raising (although I'm willing to bet this can easily be alleviated by just caching at the interpreter level the builtin exception classes so that at the C level it's still just PyObject pointers instead of having to extract them dynamically every time from the builtin module).

And as always every single module in the stdlib does not have to be implemented in C code if it doesn't wrap other C code. In that instance it is simply taking the time to either copy over and get working the pure Python versions of modules that other VMs have written or writing one from scratch. But thanks to PEP 399 this is only an issue for pre-existing modules (which is also why no one has bothered to backfill all of those modules as the other VMs have already done the work for themselves so no one really needs this to happen; I opened issue 16651 to find out exactly what modules don't have a pure Python version).

In other words, there are various possibilities for technically writing more of CPython in pure Python exists, but performance considerations will quite possibly not make it worth pursuing (but I would be quite happy if proved wrong =).