I always wondered how Python can be one of the world's most popular languages without anyone (company) stepping up and make the runtime as fast as modern JavaScript runtimes.
I always wondered how Python can be one of the world's most popular languages without anyone (company) stepping up and make the runtime as fast as modern JavaScript runtimes.
For the lazy who just want to know if this makes Python faster yet, this is foundational work to enable later improvements:
The initial benchmarks show something of a 2-9% performance improvement.
I think that whilst the first version of this JIT isn’t going to seriously dent any benchmarks (yet), it opens the door to some huge optimizations and not just ones that benefit the toy benchmark programs in the standard benchmark suite.
From the write-up, I honestly don't understand how this paves the way. I don't see an architectural path from a cut-and-paste JIT to something optimizing. That's the whole point of a cut-and-paste JIT.
Isn't it the case that Python allows for type specifier (type hints) since 3.5, albeit the CPython interpreter ignores them? The JIT might take advantage of them, which ought to improve performance significantly for some code.
That what makes Python flexible is what makes it slow. Restricting the flexibility were possible offers opportunities to improve performance (and allows for tools and humans to spot errors more easily).
Isn't CL a good counter-example to that "dynamism inherently stunts performances" mantra?
To the contrary. In CL some flexibility was given up (compared to other LISP dialects) in favor of enabling optimizing compilers, e.g. the standard symbols cannot be reassigned (also preserving the sanity of human readers). CL also offers what some now call 'gradual typing', i.e. optional type declarations. And remaining flexibility, e.g. around the OO support, limits how well the compiler can optimize the code.
But type declarations in Python are not required to be correct, are they? You are allowed to write
def twice(x: int) -> int:
return x + x
print(twice("nope"))
and it should print "nopenope". Right?Yep. Therefore it’s better to
def twice(x: int) -> int:
if not isinstance(x, int):
raise TypeError("Expected x to be an int, got " + str(type(x)))
return x + x
Surely this is the job for a linter or code generator (or perhaps even a hypothetical ‘checked’ mode in the interpreter itself)? Ain’t nobody got time to add manual type checks to every single function.
This can have substantial performance implications, not to mention DX considerations.
Or use mypy.
AFAIK good JITs like V8 can do runtime introspection and recompile on the fly if types change. Maybe using the type hints will be helpful but I don't think they are necessary for significant improvement.
Are there any benchmarks that give an idea of how much this might improve Python's speed?
Pypy is a different JIT that gives anything from slower/same to 100x speedup depending on the benchmark. They give a geometric mean of 4.8x speedup across their suite of benchmarks. https://speed.pypy.org/
Well, GraalPython is a Python JIT compiler which can exploit dynamically determined types, and it advertises 4.3x faster, so it's possible to do drastically better than a few percent. I think that's state of the art but might be wrong.
That's for this benchmark:
https://pyperformance.readthedocs.io/
Note that this is with a relatively small investment as these things go, the GraalPython team is about ~3 people I guess, looking at the GH repo. It's an independent implementation so most of the work went into being compatible with Python including native extensions (the hard part).
But this speedup depends a lot on what you're doing. Some types of code can go much faster. Others will be slower even than CPython, for example if you want to sandbox the native code extensions.
Doesn't Python already do this? https://www.youtube.com/watch?v=shQtrn1v7sQ
You can't really on type annotations to help interpret the code.
I doubt it with a copy-and-patch JIT, not the way they work now. I'm a serious mypy/python-static-types user and as is they currently wouldn't allow you to do much optimization wise.
- All integers are still big integers
- Use of the typing opt-out 'Any' is very common
- All functions/methods can still be overwritten at runtime
- Fields can still be added and removed from objects at runtime
The combination basically makes it mandatory to not use native arithmetic, allocate everything on the heap, and need multiple levels of indirection for looking up any variable/field/function. CPU perf nightmare. You need a real optimizing JIT to track when integers are in a narrow range and things aren't getting redefined at runtime.
. I don't see an architectural path from a cut-and-paste JIT to something optimizing.
One approach used in V8 is to have a dumb-but-very-fast JIT (ie. this), and keep counters of how often each block of code runs (perhaps actual counters, perhaps using CPU sampling features), and then any block of code running more than a few thousand times run through a far more complex yet slower optimizing jit.
That has the benefit that the 0.2% of your code which uses 95% of the runtime is the only part that has to undergo the expensive optimization passes.
Note that V8 didn't have a dumb-but-very-fast JIT (Sparkplug) until 2021; the interpreter (Ignition) did that block counting and sent it straight to the optimizing JIT (TurboFan).
V8 pre-2021 (i.e., only Ignition+TurboFan) was significantly faster than current CPython is, and the full current four-tier bundle (Ignition+Sparkplug+Maglev+TurboFan) only scores roughly twice as good on Speedometer as pure Ignition does. (Ignition+Sparkplug is about 40% faster than Ignition alone; compare that “dumbness” with CPython's 2–9%.) The relevant lesson should be that things like very carefully designed value representation and IR is a much more important piece of the puzzle than having as many tiers of compilation as possible.
In case anyone is interested, V8 pre-ignition/TurboFan had different tiers [1]: full-codegen (dumb and fast) and crankshaft (optimizing). It's interesting to see how these things change over time.
keep counters of how often each block of code runs ... and then any block of code running more than a few thousand times run through a far more complex yet slower optimizing jit.
That's just all JITs. Sometimes its counters for going from interpreter -> JIT rather than levels of JITs, but this idea is as old as JITs.
It should be fairly easy to add instruction fusing, where they recognize often-used instruction pairs, combine their C code, and then let the compiler optimize the combined code. Combining LOAD_CONST with the instruction following it if that instruction pops the const from the stack seems an easy win, for example.
If it was that easy, you'd do that in the interpreter and proportionally reduce interpretation overhead.
In the interpreter, I don’t think it wouldn’t reduce overhead much, if at all. You’d still have to recognize the two byte codes, and your interpreter would spend additional time deciding, for most byte code pairs, that it doesn’t know how to combine them.
With a compiler, that part is done once and, potentially, run zillions of times.
If fusing a certain pair would significantly improve performance of most code, you'd just add that fused instruction to your bytecode and let the C compiler optimize the combined code in the interpreter. I have to assume CPython as already done that for all the low hanging fruit.
In fact, for such a fused instruction to be optimized that way on a copy-and-patch JIT it'd need to exist as a new bytecode in interpreter. A JIT that fuses instructions is no longer a copy-and-patch JIT.
A copy-and-patch JIT reduces interpretation overhead by making sure the branches in the executed machine code are the branches in the code to be interpreted, not branches in the interpreter.
This is make a huge difference in more naive interpreters, not so much in an heavily optimized threaded-code interpreter.
The 10% is great, and nothing to sneeze at for a first commit. But I'd actually like some realistic analysis of next steps for improvement, because I'm skeptical instruction fusing and other things being hand waved are it. Certainly not on a copy-and-patch JIT.
For context: I spent significant effort trying to add such instruction fusing to a simple WASM AOT compiler and got nowhere (the equivalent of constant loading was precisely one of the pairs). Only moving to a much smarter JIT (capable of looking at whole basic blocks of instructions) started making a difference.
There's a lot of effort going on to improve CPython performance, with optimization tiers, etc. It seems the JIT is how at least part of that effort will materialize: https://github.com/python/cpython/issues/113710
We're getting a JIT. Now it's time to optimize the traces to pass them to the JIT.
Support for generating machine code at all seems like a necessary building block to me and probably is quite a bit of effort to work on top of a portable interpreter code base.
Honestly, 2-9% already seems like a very signficant improvement, especially since as they mention "remember that CPython is already written in C". Whilst it's great to look at the potential for even greater gains by building upon this work, I feel we shouldn't undersell what's been accomplished.
Also recall that a 50% speed improvement in SQLite was caused by 50-100 different optimisations that each eeked out 0.5-1% speedups. On phone now don’t have the ref but it all adds up.
That's true, and Rust compiler speed has seen similar speedups from lots of 1% improvements.
But even if you can get a 2x improvement from lots of 1% improvements (if you work really really hard), you're never going to get a 10x improvement.
Rust is never going to compile remotely as quickly as Go.
Python is never going to be remotely as fast as Rust, C++, Go, Java, C#, Dart, etc.
Does it matter?
Trains are never going to beat jets in pure speed. But in certain scenarios, trains make a lot more sense to use than jets, and in those scenarios, it is usually preferable having a 150 mph train to a 75 mph train.
Looking at the world of railways, high-speed rail has attracted a lot more paying customers than legacy railways, even though it doesn't even try to achieve flight-like speeds.
Same with programming languages, I guess.
What is the programming analogy here?
Two decades ago, you could (as e.g. Paul Graham did at the time) argue that dynamically typed languages can get your ideas to market faster so you become viable and figure out optimization later.
It's been a long time since that argument held. Almost every dynamic programming language still under active development is adding some form of gradual typing because the maintainability benefits alone are clearly recognized, though such languages still struggle to optimize well. Now there are several statically typed languages to choose from that get those maintainability benefits up-front and optimize very well.
Different languages can still be a better fit for different projects, e.g. Rust, Go, and Swift are all statically typed compiled languages better fit for different purposes, but in your analogy they're all jets designed for different tactical roles, none of them are "trains" of any speed.
Analogies about how different programming languages are like different vehicles or power tools or etc go way back and have their place, but they have to recognize that sometimes one design approach largely supersedes another for practical purposes. Maybe the analogy would be clearer comparing jets and trains which each have their place, to horse-drawn carriages which still exist but are virtually never chosen for their functional benefits.
I cut my teeth on C/C++, and I still develop the same stuff faster in Python, with which I have less overall experience by almost 18 years. Python is also much easier to learn than, say, Rust, or the current standard of C++ which is a veritable and intimidating behemoth.
In many domains, it doesn't really matter if the resulting program runs in 0.01 seconds or 0.1 seconds, because the dominant time cost will be in user input, DB connection etc. anyway. But it matters if you can crank out your basic model in a week vs. two.
I tried searching for that article because I vaguely recall it, but can't find it either. But yeah, a lot of small improvements add up. Reminds me of this talk: https://www.youtube.com/watch?v=NZ5Lwzrdoe8
Here is a source for the SQLite case: https://topic.alibabacloud.com/a/sqlite-387-a-large-number-o...
That looks like blogspam to me, rather than an actual source.
Marginal gains. https://www.bbc.co.uk/news/magazine-34247629
Many small improvements is the way to go in most situations. It's not great clickbait, but we should remember that we got from a single cell at some time to humans through many small changes. The world would be a lot better if people just embraced the grind of many small improvements...
"remember that CPython is already written in C"
What is this supposed to say? Most scripting language interpreters are written in low level languages (or assembly), but that alone doesn't say anything about the performance of the language itself.
This means, that a lot of python libraries like polars or tensorflow are written not in python.
So python programs, that already spend most of its cpu time running these libraries code, won't see much of an impact.
Isn't the point that if pure Python was faster they wouldn't need to be written in other [compiled] languages? Having dealt with Cython it's not bad, but if I could write more of my code in native Python my development experience would be a lot simpler.
Granted we're still very far from that and probably won't ever reach it, but there definitely seems to be a lot of progress.
Since Nim compiles to C, a middle step worth being aware of is Nim + nimporter which isn't anywhere near "just python" but is (maybe?) closer than "compile a C binary and call it from python".
Or maybe it's just syntactic sugar around that. But sugar can be nice.
I think they mean that a lot of runtime of any benchmark is going to be spent in the C bits of the standard library, and therefore not subject to the JIT. Only the glue code and the bookkeeping or whatnot that the benchmark introduces would be improved by the JIT. This reduces the impact that the JIT can make.
What is being accomplished then?
2-9%
Anyone know if there will be any better tools for cross-compiling python projects?
The package management and build tools for python have been so atrociously bad (environments add far too much complexity to the ecosystem) that it turns many developers away from the language altogether. A system like Rust's package management, build tools, and cross compilation capability is an enormous draw, even without the memory safety. The fact that it actually works (because of the package management and build tools) is the main reason to use the language really. Python used to do that ~10 years ago. Now absolutely nothing works. It takes weeks to get simple packages working, only can do anything under extremely brittle conditions that nullify the project you're trying to use this other package for, etc.
If python could ever get it's act together and make better package management, and allow for cross-compiling, it could make a big difference. (I am aware of the very basic fact that it's interpreted rather than compiled yada yada - there are still ways to make executables, they are just awful). Since python is data science centric, it would be good to have decent data management capabilities too, but perhaps that could be after fundamental problem are dealt with.
I tried looking at mojo, but it's not open source, so I'm quite certain that kills any hope of it ever being useful at all to anyone. The fact that I couldn't even install it without making an account made me run away as fast as possible.
"It takes weeks to get simple packages working"
Can you expand on what you mean by that? I have trouble imagining a Python packaging problem that takes weeks to resolve - I'd expect them to either be resolvable in relatively short order or for them to prove effectively impossible such that people give up.
- Trying to figure out what versions the scripts used and specifying them in a new poetry project - Realizing some OS-dependent software is needed so making a docker file/docker-compose.yml - Getting some of it working in the container with a poetry environment - Realizing that other parts of the code work with other versions, so making a different poetry environment for those parts - Trying to tie this package/container as a dependency of another project - Oh actually, this is a dependency of a dependency - How do you call a function from a package running in a container with multiple poetry environments in a package? - What was I doing again? - 2 weeks have passed trying to get this to work, perhaps I'll just do something else
Rinse and repeat.
¯\_(ツ)_/¯ That's python!
I can't answer your initial question, but I do like to pile onto the package management points.
Package consumption sucks so bad, since the sensible way of using are virtual envs where you copy all dependencies. Then for freezing venvs or dumping package versions, so you can port your project to a different system, doesn't consider only packages actually used/imported in code, but it just dumps everything in the venv. The fact you need external tools for this is frustrating.
Then there is package creation. Legacy vs modern approach, cryptic __init__ files, multiple packaging backends, endless sections in pyproject.toml, manually specifying dependencies and dev-dependencies, convoluted ways of getting package metadata actually in code without having it in two places (such as CLI programs with --version).
Cross compilation really would be a nice feature to simply distribute a single file executable. I haven' tested it, but a Linux system with Wine should in theory be capable of "cross" compiling between Linux and Windows.
Still, like you, as a beginning I would prefer a sensible package management and package creation process.
Have you taken a look at Nuitka with GitHub actions for cross compilation? https://github.com/Nuitka/Nuitka-Action
An important context here is that the same code was reused for interpreter and JIT implementations (that's a main selling point for copy-and-patch JIT). In the other words, this 2--9% improvement mostly represents the core interpreter overhead that JIT should significant reduce. It was even possible that JIT itself might have no performance impact by itself, so this result is actually very encouraging; any future opcode specialization and refinement should directly translate to a measurable improvement.
Copy&patch seems not much worse than compiling pure Python with Cython, which roughly corresponds to "just call whatever CPython API functions the bytecode interpreter would call for this bunch of Python", so that's roughly a baseline for how much overhead you get from the interpeter bit.
There is no reason to use copy-and-patch JIT if that were the case, because the good old threaded interpreter would have been fine. There are other optimization works in parallel with this JIT effort, including finer-grained micro operations (uops) that can replace usual opcodes at higher tiers. Uops themselves can be used without JIT, but the interpreter overhead is proportional to the number of (u)ops executed and would be too large for uops. The hope is that copy-and-patch JIT combined with uops have to be much faster than threaded code.
I wouldn't be so enthusiastic. Look at other languages that have JIT now: Ruby and PHP. After years of efforts, they are still an order of magnitude slower than V8 and even PyPy [1]. It seems to me that you need to design a JIT implementation from ground up to get good performance – V8, Dart and LuaJIT are like this; if you start with a pure interpreter, it may be difficult to speed it up later.
PyPy is designed from the ground up and is still slower than V8 AFAIK. Don’t forget that v8 has enormous amounts of investment from professionally paid developers whereas PyPy is funded by government grants. Not sure about Ruby & PHP and it’s entirely possible that the other JIT implementations are choosing simplicity of maintenance over eking out every single bit of performance.
Python also has structural challenges like native extensions (don’t exist in JavaScript) where the API forces slow code or massive hacks like avoiding the C API at all costs (if I recall correctly I read that’s being worked on) and the GIL.
One advantage Python had is the ability to use multiple cores way before JS but the JS ecosystem remained single threaded longer & decided to use message passing instead to build WebWorkers which let the JIT remain fast.
PyPy is only twice as slow as v8 and is about an order of magnitude faster than CPython. It is quite an achievement. I would be very happy if CPython could get this performance but I doubt.
You're right, and in this case "foundational work" even undersells how minimal this work really is compared to the results it already gets.
I recommend that people watch Brandt Bucher's "A JIT Compiler for CPython" from last year's CPython Core Developer Sprint[0]. It gives a good impression of the current implementation and its limitations, and some hints at what may or may not work out. It also indirectly gives a glimpse into the process of getting this into Python through the exchanges during the Q&A discussion.
One thing to especially highlight is that this copy-and-patch has a much, much lower implementation complexity for the maintainers, as a lot of the heavy lifting is offloaded to LLVM.
Case in point: as of the talk this was all just Brandt Bucher's work. The implementation at the time was ~700 lines of "complex" Python, ~100 lines of "complex" C, plus of course the LLVM dependency. This produces ~3000 lines of "simple" generated C, requires an additional ~300 lines of "simple" hand-written C to come together, and no further dependencies (so no LLVM necessary to run the JIT. Also "complex" and "simple" qualifiers are Bucher's terms, not mine).
Another thing to note is that these initial performance improvements are just from getting this first version of the copy-and-patch JIT to work at all, without really doing any further fine-tuning or optimization.
This may have changed a bit in the months since, but the situation is probably still comparable.
So if one person can get this up and running in a few klocs, most of which are generated, I think it's reasonable to have good hopes for its future.
I removed the SDKs of some big (big for the wrong reasons) open source projects which generates a lot of code using python3 scripts.
In those custom SDKs, I do generate all the code at the start of the build, which takes a significant amount of time for mostly non-pertinent anymore/inappropiately done code generation.. I will really feel python3 speed improvement for those builds.
Unfortunate to see a couple of comments here drive-by pulling out the “x% faster” stat whilst minimising the context. This is a big deal and it’s effectively a given that this’ll pave the way for further enhancements.
It is a very big deal, as it will finally shift the mentality regarding:
- "C/C++/Fortran libs are Python"
- "Python is too dynamic", while disregarding Smalltalk, Common Lisp, Dylan, SELF, NewtonScript JIT capabilities, all dynamic languages where anything can change at any given moment
What do you mean by "it will shift the mentality"? There is no magical JIT that will ever make e.g. the data science Python & C++ amalgamations slower than a pure Python. Likely never happening, too.
Also no mentality shift is expected on the "Python is too dynamic" -- which is a strange thing to say anyway -- because Python is not getting any more static due to these JIT news.
Python with JIT is faster than Python without JIT.
Having a Python with JIT, in many cases it will be fast enough for most cases.
Data science running CUDA workloads isn't the only use case for Python.
I think Python without a JIT in many cases is already fast enough for most cases.
I don't do data science.
Sure, for UNIX scripting, for everything else it is plainfully slow.
I know Python since version 1.6, and is my scripting language in UNIX like environments, during my time at CERN, I was one of the CMT build infrastructure build engineer on the ATLAS team.
It was never been the language I would reach for when not doing OS scripting, and usually when a GNU/Linux GUI application happens to be slow as mollasses, it has been written in Python.
A Python web service my team maintains, running at a higher request rate and with lower CPU and RAM requirements than most of the Java services I see around us, would like a word with you.
How many requests per second are we talking, ballpark, and what's the workload?
~5k requests/second for the Python service, we tend to go for small instances for redundancy so that's across a few dozen nodes. The workload comparison is unfair to the Java service, if I'm honest :). But we're running Python on single vCPU containers with 2G RAM, and the Java service instances are a lot larger than that.
Flask, gunicorn, low single digit millisecond latency. Definitely optimised for latency over throughput, but not so much that we've replatformed it onto something that's actually designed for low latency :P. Callers all cache heavily with a fairly high hit ratio for interactive callers and a relatively low hit ratio for batch callers.
I guess those Java developers really aren't.
There's a lot of Django going on in the world.
shrug. If we're talking personal experience, I've been using Python since 1.4. It's been my primary development language since the late 1990s, with of course speed critical portions in C or C++ when needed - and I know a lot of people who also primarily develop in Python.
And there's a bunch of Python development at CERN for tasks other than OS scripting. ("The ease of use and a very low learning curve makes Python a perfect programming language for many physicists and other people without the computer science background. CERN does not only produce large amounts of data. The interesting bits of data have to be stored, analyzed, shared and published. Work of many scientists across various research facilities around the world has to be synchronized. This is the area where Python flourishes" - https://cds.cern.ch/record/2274794)
I simply don't see how a Python JIT is going to make that much of a difference. We already have PyPy for those needing pure Python performance, and Numba for certain types of numeric needs.
PyPy's experience shows we'll not be expecting a 5x boost any time soon from this new JIT framework, while C/C++/Fortran/Rust are significantly faster.
There's a lot of Django going on in the world.
Unfortunely.
And there's a bunch of Python development at CERN for tasks other than OS scripting
Of course there is, CMT was a build tool, not OS scripting.
No need to give me CERN links to me to show me Python bindings to ROOT, or Jupyter notebooks.
PyPy's experience shows we'll not be expecting a 5x boost any time soon from this new JIT framework, while C/C++/Fortran/Rust are significantly faster.
I really don't get the attitude that if it doesn't 100% fix all the world problems, then it isn't worth it.
The link wasn't for you - the link was for other HN users who might look at your mention of your use at CERN and mistakenly assume it was a more widespread viewpoint there.
I really don't get the attitude that if it doesn't 100% fix all the world problems, then it isn't worth it.
Then it's a good thing I'm not making that argument, but rather that "Having a Python with JIT, in many cases it will be fast enough for most cases." has very little information content, because Python without a JIT already meets the consequent.
My teams deploy Python web APIs and yes, it is slow compared to other languages and runtimes.
But on the whole, machines are cheaper than other engineering approaches to scaling.
For us, and many others, fast enough is fast enough.
I really wouldn't mind Python being faster than it is and I really didn't mind at all getting an practically free ~30% performance increase just by updating to 3.11. There's tons of applications which just passively benefit from these optimizations. Sure, you might argue "but you shouldn't have written that parser or that UI handling a couple thousand items in Python" but lots of people do and did just that.
I wouldn't mind either.
Do you agree with me that Python is already fast enough for most cases, even without a JIT?
If not, how would a 30% boost improve things enough to change the balance?
I'm fairly certain that this is false, and am working on proving it. In the cases that Numba is optimised for it's already faster than plausible C++ implementations of the same kernels.
https://stackoverflow.com/questions/36526708/comparing-pytho...
it's not faster, it's about as fast as C++ compiled with O3 optimizations. which is great and also much more likely to be true.
Numba is basically another language embedded in Python. It (sometimes severely) modifies the semantics of code.
Disregarding the fact that python is an awful programming language for anthing other than jupyter notebooks
Ah I'd say the exact opposite, python in general is pretty good but jupyter sucks because the syntax isn't compatible with regular python and I avoid it like the plague.
What does a jupyter notebook have to do with python syntax?
Take the code you find in an average notebook, copy it to a .py text file, run it with python. Does it run? In my experience the answer is usually 'no' because of some extra-ass syntax sugar jupyter has that doesn't exist in python.
Another one that hasn't seen UNIX scripting in shell languages or Perl, Apache modules, before Python came to be.
This comment is really just bordering on a rule violation and doesn’t add to the conversation at all.
Facts are objective; "Python is awful" is your opinion.
it will finally shift the mentality regarding "C/C++/Fortran libs are Python"
But pjmlp, I use Python because it's a wrapper for C/C++/Fortran libs. - Chocolate Giddyup
Just like Tcl happens to be.
I can dig it!
maybe, maybe not. time will tell. ahead-of-time compilation is even better known for improving performance and yet perl's compile-to-c backend turned out to fail to do that
ahead-of-time compilation is even better known for improving performance
Not necessarily, not for dynamic languages.
With very dynamic languages you can make only very limited assumptions about e.g. function argument types, which lead you to compiled functions that have to handle any possible case.
A JIT compiler can notice that the given function is almost always (or always) used to operate on a pair of integers, and do a vastly superior specialized compilation, with guards to fallback on the generic one. With extensive inlining, you can also deduplicate a lot of the guards.
yes, that is true. but aot compilers never make things slower than interpretation, and they can afford more expensive optimizations
also, even mature jit compilers often only make limited improvements; jython has been stuck at near-parity with cpython's terrible performance for decades, for example, and while v8 was an enormous improvement over old spidermonkey and squirrelfish, after 15 years it's still stuck almost an order of magnitude slower than c https://benchmarksgame-team.pages.debian.net/benchmarksgame/... which is (handwaving) like maybe a factor of 2 or 3 slower than self
typically when i can get something to work using numpy it's only about a factor of 5 slower than optimized c, purely interpretively, which is competitive with v8 in many cases. luajit, by contrast, is goddam alien technology from the future
with respect to your int×int example, if an int×int specialization is actually vastly superior, for example because the operation you're applying is something like + or *, an aot compiler can also insert the guard and inline the single-instruction implementation, and it can also do extensive inlining and even specialization (though that's rare in aots and common in jits). it can insert the guards because if your monomorphic sends of + are always sending + to a rational instance or something, the performance gain from eliminating megamorphic dispatch is comparatively slight, and the performance loss from inserting a static hardcoded guess of integer math before the megamorphic dispatch is also comparatively slight, though nonzero
this can fall down, of course, when your arithmetic operations are polymorphic over integer and floating-point, or over different types of integers; but it often works far better than it has any right to. in most code, most arithmetic and ordered comparison is integers, most array indexing is arrays, most conditionals are on booleans (and smalltalk actually hardcodes that in its bytecode compiler). this depends somewhat on your language design, of course; python using the same operator for indexing dicts, lists, and even strings hurts it here
meanwhile, back in the stop-hitting-yourself-why-are-you-hitting-yourself department, fucking cpython is allocating its integers on the heap and motherfucking reference-counting them
fucking cpython is allocating its integers on the heap and motherfucking reference-counting them
And here I thought that it was shocking to learn that v8 allocates doubles on the heap recently. (I mean, I'm not a compiler writer, I have no idea how hard it would be to avoid this, but it feels like mandatory boxed floats would hurt performance a lot)
nanboxing as used in spidermonkey (https://piotrduperas.com/posts/nan-boxing) is a possible alternative, but i think v8 works pretty hard to not use floats, and i don't think local-variable or temporary floats end up on the heap in v8 the way they do in cpython. i'm not that familiar with v8 tho (but i'm pretty sure it doesn't refcount things)
i think v8 works pretty hard to not use floats
Correct, to the point where at work a colleague and I actually have looked into how to force using floats even if we initiate objects with a small-integer number (the idea being that ensuring our objects having the correct hidden class the first time might help the JIT, and avoids wasting time on integer-to-float promotion in tight loops). Via trial and error in Node we figured that using -0 as a number literal works, but (say) 1.0 does not.
i don't think local-variable or temporary floats end up on the heap in v8 the way they do in cpython
This would also make sense - v8 already uses pools to re-use common temporary object shapes in general IIRC, I see no reason why it wouldn't do at least that with heap-allocated doubles too.
so then the remaining performance-critical case is where you have a big array of floats you're looping over. in firefox that works fine (one allocation per lowest-level array, not one allocation and unprefetchable pointer dereference per float), but maybe in chrome you'd want to use a typedarray?
Maybe, at that point it is basically similar to the struct-of-arrays vs array-of-structs trade-off, except with significantly worse ergonomics and less pay-off.
As I understand it, V8 keeps track of an ElementsKind for each array (or, more precisely, for the elements of every object; arrays are not special in this sense). If an array only contains floats, then they will all be stored unboxed and inline. See here: https://source.chromium.org/chromium/chromium/src/+/main:v8/...
I assume that integers are coerced to floats in this mode, and that there's a performance cliff if you store a non-number in such an array, but in both cases I'm just guessing.
In SpiderMonkey, as you say, we store all our values as doubles, and disguise the non-float values as NaNs.
There is already an AOT compiler for Python: Nuitka[0]. But I don't think it's much faster.
And then there is mypyc[1] which uses mypy's static type annotations but is only slightly faster.
And various other compilers like Numba and Cython that work with specialized dialects of Python to achieve better results, but then it's not quite Python anymore.
thanks, i'd forgotten about nuitka and didn't know about mypyc!
I so much agree with your comment on memory allocation. Everybody is focusing on JIT, but allocating everything on the heap, with no possibility to pack multiple values contiguously in a struct or array, will still be a problem for performance.
Ahead-of-time compilation is a bad solution for dynamic languages, so that is an expected outcome for Perl.
The base line should be how heavily dynamic languages like my favourite set, Smalltalk, Common Lisp, Dylan, SELF, NewtonScript, ended up gaining from JIT, versus the original interpreters, while being in the genesis of many relevant papers for JIT research.
when i wrote ur-scheme one of the surprising things i learned from it was that ahead-of-time compilation worked amazingly well for scheme. scheme is ruthlessly monomorphic but i was still doing a type check on every primitive argument
i didn't realize they ever jitted newtonscript
NewtonScript 2.0 introduced a mechanism to manually JIT code, functions marked as native get compiled into machine code.
Had the Newton not been canceled, probably there would be an evolution from that support.
See "Compiling Functions for Speed"
https://www.newted.org/download/manuals/NewtonToolkitUsersGu...
this is great, thanks! but it sounds like it was an aot compiler, not a jit compiler; for example, it explains that a drawback of compiling functions to native code is that they use more memory, and that the compiler still produces bytecode for the functions it compiles natively, unless you suppress the bytecode compilation in project settings
Yeah, I guess if one wants to go more technical, I see it as the first step of a JIT that didn't had the opportunity to evolve due to market decisions.
i guess if they had, we would know whether a jit made newtonscript faster or slower, but they didn't, so we don't. what we do know is that an aot compiler sometimes made newtonscript faster (though maybe only if you added enough manifest static typing annotations to your source code)
that seems closer to the opposite of what you were saying in the point on which we were in disagreement?
I guess my recolection regarding NewtonScript wasn't correct, if you prefer that I put it like that, however I am quite certain in regards to the other languages in my list.
i agree that the other languages gained a lot for sure
maybe i should have said that up front!
except maybe common lisp; all the implementations i know are interpreted or aot-compiled (sometimes an expression at a time, like sbcl), but maybe there's a jit-compiled one, and i bet it's great
probably with enough work python could gain a similar amount. it's possible that work might get done. but it seems likely that it'll have to give up things like reference-counting, as smalltalk did (which most of the other languages never had)
I don't see this as an enhancement.
Not pursuing JIT or efficient compilation in general was a deliberate decision way back when Python made some kind of sense. It was the simplicity of implementation valued over performance gains that motivated this decision.
The mantra Python programmers liked to repeat was that "the performance is good enough, and if you want to go fast, write in C and make a native module".
And if you didn't like that, there was always Java.
Today, Python is getting closer and closer to be "the crappy Java with worse syntax". Except we already have that: it's called Groovy.
What are you talking about? From what I can read here there is no syntax change. Just a framework for faster execution. Plus, Python's usecase has HEAVILY evolved over the last few years since it's now the defacto language for machine learning. It's great that the core devs are keeping up with the time.
The language is definitely getting more complex syntactically, and I'm not a huge fan of some of those changes but it's no where near Java or C++ or anything else. You can still write simple Python with all of these changes.
is it any different or comparable to numba or pyjion? Not following python closely in recent years but I recount those two projects with huge potential
I don’t know Pyjion, but I have used Numba for real work. It’s a great package and can lead to massive speed-ups.
However, last time I used it, it (1) didn’t work with many third-party libraries (e.g. SciPy was important for me), and (2) didn’t work with object-oriented code (all your @njit code had to be wrapped in functions without classes). Those two has limited for which projects I could adopt Numba in practice, despite loving it in the cases it worked.
I don’t know what limitations the built-in Python JIT has, but hopefully it might be a more general JIT that works for all Python code.
This is so true!
A JIT compiler is a big deal for performance improvements, especially where it matters (in large repetitive loops).
Anyone cynical about the potential a python JIT offers should take a look at pypy which has a 5x speed up over regular python, mainly though JIT operations: https://www.pypy.org/
Honestly I don't understand the pessimistic view here. I think every release since Microsoft started funding python has increased high single digit best case performance.
Rather than focussing on the raw number compare to python 3.5 or so. It's still getting significantly faster.
If they keep doing this steady pace they are slowly saving the planet!
I think the pessimism really comes from a dislike for Python
While very very very popular, Python is i think is very disliked languages, it doesnt have or it is not built around the current programming language features that programmers like, its not functional or immutable by default, its not fast, the tooling is complex, it uses indentation for code blocks (this feature was cool in the 90s, but dreaded since at least 2010)
so i guess if python become fasters, this will ensure its continued dominance, and all those hoping that one day it will be replace by a nicer , faster language are disappointed
this pessimism is the aching voice of the developers who were hoping for a big python replacement
To each his own, but the things you list are largely subjective/inaccurate, and there are many, many, many developers who use Python because they enjoy it and like it a lot.
Python is a very widely used language, and like any popular thing, yes many many many like it , and many many many dislike it .. it is that big, python can be disliked by a million developer and still be a lot more liked than disliked
but i also think that its true that python is not and have not been for a while considered as a modern or technically advanced language
the hype currently is for typed or gradually typed languages, functional languages, immutable data , system languages, type safe language, language with advanced parallelism and concurrency support etc ..
python is old , boring OOP, if you like it, than like millions of developers you are not picky about programming language, you use what works, what pays
but for devs passionate about programming languages, python is a relic they hope vanish
devs passionate about programming languages, python is a relic they hope vanish
Statements like this are obviously untrue for large numbers of people, so I'm not sure of the point you're trying to make.
But certainly it's true that there are both objective and subjective reasons for using a particular tool, so I hope you are in a position to use the tools that you prefer the most. Have a great day!
the hype currently is for typed or gradually typed languages
So Python with mypy
but for devs passionate about programming languages, python is a relic they hope vanish
If you asked me what language I would consider to be a relic that I hope would vanish, I'd go with Perl.
(this feature was cool in the 90s, but dreaded since at least 2010)
LOL this is a dead giveaway you haven't been around long. There have been people kvetching about the whitespace since the beginning. Haskell went on to be the next big thing for reddit/HN/etc for years and it also uses whitespace.
Sorry, but reality bites https://en.wikipedia.org/wiki/Amdahl%27s_law
It's not that simple.
Amdahl's Law is about expected speedup/decrease in latency. That actually isn't strongly correlated to "saving the planet" afaik (where I interpret that as reducing direct energy usage, as well as embodied energy usage by reducing the need to upgrade hardware).
If anything, increasing speed and/or decreasing latency of the whole system often involves adding some form of parallelism, which brings extra overhead and requires extra hardware. Note that prefetching/speculative execution kind of counts here as well, since that is essentially doing potentially wasted work in parallel. In the past boosting the clock rate the CPU was also a thing until thermodynamics said no.
OTOH, letting your CPU go to sleep faster should save energy, so repeated single-digit perf improvements via wasting less instructions does matter.
But then again, that could lead to Jevons Paradox (the situation where increasing the efficiency encourages more wasteful than the increase in efficiency saves - Wirth's Law but generalized and older, basically).
So I'd say there's too many interconnected dynamics at play to really simply state "optimization good" or "optimization useless". I'm erring on the side of "faster Python probably good".
When did Microsoft start funding Python?
Also, such a shame that it takes sooo long for crucial open source to be funded properly. Kudos to Microsoft for doing it, shame on everyone else for not pitching in sooner.
FYI Python was launched 32 years ago, Python 2 was released 24 years ago and Python 3 was released 16 years ago.
Julia is my source of pessimism. Julia is super fast once it's warmed up, but before it gets there, it's painfully slow. They seem to be making progress on this, but it's been gradual. I understand that Java had similar growing pains, but it's better now. Combined with the boondoggle of py3, I'm worried for the future of my beloved language as it enters another phase of transformation.
If JIT is a good thing for Python, why don't just compile to Java or .NET bytecode and use their already optimized infrastructure?
Given how many Microsoft employees today steer the Python decision making process, I'm sure in not so distant future, we might see a new CLR-based Python implementation.
Maybe Microsoft don't know yet how to sell this thing, or maybe they are just boiling the frog. Time will tell. But I'm pretty sure your question will be repeated as soon as people will get used to the idea of Python on JIT.
Doesn't this already exist in IronPython?
Correct. I just checked the project yesterday and they are presently at 3.4 :-|
“Python code runs 15% faster and and 20% cheaper on azure than aws, thanks to our optimized azurePython runtime. Use it for azure functions and ml training”
Just a guess at the pitch.
If you're interested in learning more about the challenges and tradeoffs, both Jython (https://www.jython.org/) and IronPython (https://ironpython.net/) have been around for a long time and there's a lot of reading material on that subject.
Graal Python exists too: https://www.graalvm.org/python/
It beats Python on performance, supposedly, but compatibility has never been great.
I've found the startup time for Graal Python to be terrible compared with other Graal languages like JS. When I did some profiling, it seemed that the vast majority of the time was spent loading the standard library. If implemented lazily, that should have a negligible performance impact.
Python is a convenient friendly syntax for calling code implemented in C. While you can easily re-implement the syntax, you then have to decide how much of that C to re-implement. A few of the builtin types are easy (eg strings and lists), but it soon becomes a mountain of code and interoperability, especially if you want to get the semantics exactly right. And that is just the beginning - a lot of the value of Python is in the extensions, and many popular ones (eg numpy, sqlite3) are implemented in C and need to interoperate with your re-implementation. Trying to bridge from Java or .NET to those extensions will overwhelm any performance advantages you got.
This JIT approach is improving the performance of bits of the interpreter while maintaining 100% compatibility with the rest of the C code base, its object model, and all the extensions.
This is what you are looking for. Python on the GraalVM
I love Python and use it for everything other than web development.
One reason is performance. So if Python has a faster future ahead of it: Hurray!
The other reason is that the Python ecosystem moved away from stateless requests like CGI or mod_php use and now is completely set on long running processes.
Does this still mean you have to restart your local web application after any change you made to it? I heard that some developers automate that, so that everytime they save a file, the web application is restarted. That seems pretty expensive in terms of resource consumption. And complex as you would have to run some kind of watcher process which handles watching your files and restarting the application?
The restart isn't expensive in absolute terms, on a human level it's practically instant. You would only do this during development, hopefully your local machine isn't the production environment.
It's also very easy, often just adding a CLI flag to your local run command.
edit: Regarding performance, Python today can easily handle at least 1k requests per second. The vast vast vast majority of web applications today don't need anywhere near that kind of performance.
The thing is, I don't run my applications locally with a "local run command".
I prefer to have a local system set up just like the production server, but in a container.
Maybe using WSGI with MaxConnectionsPerChild=1 could be a solution? But that would start a new (for example) Django instance for every request. Not sure how fast Django starts.
Another option might be to send a HUP signal to Apache:
apachectl -k restart
That will only kill the worker threats. And when there are none (because another file save triggered it already), this operation might be almost free in terms of resource usage. This also would require WSGI or similar. Not sure if that is the standard approach for Django+Apache.I would still recommend running it properly locally, but whatever. Pseudo-devcontainer it is. I assume the code is properly volume mounted.
In production, you would want to run your app through gunicorn/uvicorn/whatever on an internal-only port, and reverse-proxy to it with a public-facing apache or similar.
Set up apache to reverse proxy like you would on prod, and run gunicorn/uvicorn l/whatever like you would on prod, except you also add the autoreload flag. E.g.
uvicorn main:app --host 0.0.0.0 --port 12345 --reload
If production uses containers, you should keep the python image slim and simple, including only gunicorn/uvicorn and have the reverse proxy in another container. Etc.Been working with python for the web for over a decade. This is basically a solved issue, and the performance is a non-issue day to day.
Python is amazing and shines for Web development. I'd recommend taking a look at https://www.tornadoweb.org/en/stable/index.html. I use this in production on my pet project at https://www.meecal.co/. Put Nginx in front and you're golden.
Definitely take a look, it's come a long way from ten years ago.
If you run the debug web server, (e.g. Django's `manage.py runserver`) command, yes it has watcher that will automatically restart the web server process if there is a code changes.
Once you deploy it to production, you usually run it using a WSGI/ASGI server such as Gunicorn or Uvicorn and let whatever deployment process you use handles the lifecycle. You usually don't use watcher in production.
Basically similar stuff with nodejs, rails, etc.
That seems pretty expensive in terms of resource consumption. And complex as you would have to run some kind of watcher process which handles watching your files and restarting the application?
What? No, in reality it’s just running your app in debug mode (just a cli flag), and when you save the files the next refresh of the browser has the live version of the app. It’s neither expensive nor complex.
In dev, this is handled mostly by the OS with things like inotify, so it has little perf impact.
In prod, you don't do it. Deployment implies sending a signal like HUP to your app, so that it reloads the code gracefully.
All in all, everybody is moving to thid, even php. This allows for persitent connexion, function memoization, delegation to threadpools, etc
The last two-ish years have been insane for Python performance. Something clicked with the core team and they obviously made this a serious goal of theirs and the last few years have been incredible to see.
It’s because the total dollars of capitalized software deployed in the world using Python has absolutely exploded from AI stuff. Just like how the total dollars of business conducted on the web was a big driver of JS performance earlier.
But all the AI heavy lifting is done in native code.
AI heavy lifting isn't just model training. There's about a million data pipelines and processes before the training data gets loaded into a PyTorch tensor.
also done in native code
There were no noticeable performance improvements in the course of the last two years. I have no idea what you are talking about.
The major change that's been going on in Python core development team is that Microsoft gets more and more power over what happens to Python. Various PSF authorities had strong links to Microsoft until today the head of PSF is straight up a Microsoft employee. Microsoft doesn't like to advertise this fact, because it rightfully suspects that rebranding Python as "Microsoft Python" would scare off some old-timers at least, but de facto it is "Microsoft Python".
The community gone from bad to worse. Any real discussion about the language stopped years ago. Today it's a pretty top-down decision making process where there's no feedback, no criticism is allowed etc. My guess is that Microsoft doesn't have a plan for the third "E" here, but who knows? Maybe eventually they'll find a way to move Python to CLR and will peddle their version of it? -- I wouldn't be surprised if that happened, actually.
Microsoft already tried Python on the CLR! They didn't stick with it. https://en.wikipedia.org/wiki/IronPython
There were no noticeable performance improvements in the course of the last two years.
In fairness, Python did get faster. Python 3.9 took 82 seconds for sudoku solving and 62 seconds for interval query. Python 3.11 took 53 and 43 seconds, respectively [1]. v3.12 may be better. That said, whether the speedup is noticeable can be subjective. 10x vs 15x slower than v8 may not make much difference mentally.
Microsoft are paying core devs to work on it full time, for one.
I think it's really cool that Haoran Xu and Fredrik Kjolstad's copy-and-patch technique[0] is catching on, I remember discovering it through Xu's blog posts about his LuaJIT remake project[1][2], where he intends to apply these techniques to Lua (and I probably found those through a post here). I was just blown away by how they "recycled" all these battle-tested techniques and technologies, and used it to synthesize something novel. I'm not a compiler writer but it felt really clever to me.
I highly recommend the blog posts if you're into learning how languages are implemented, by the way. They're incredible deep dives, but he uses the details-element to keep the metaphorical descents into Mariana Trench optional so it doesn't get too overwhelming.
I even had the privilege of congratulating him the 1000th star of the GH repo[3], where he reassured me and others that he's still working on it despite the long pause after the last blog post, and that this mainly has to do with behind-the-scenes rewrites that make no sense to publish in part.
[0] https://arxiv.org/abs/2011.13127
[1] https://sillycross.github.io/2022/11/22/2022-11-22/
[2] https://sillycross.github.io/2023/05/12/2023-05-12/
[3] https://github.com/luajit-remake/luajit-remake/issues/11
M. Anton Ertl and David Gregg. 2004. Retargeting JIT Compilers by using C-Compiler Generated Executable Code. In Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques (PACT '04). IEEE Computer Society, USA, 41–50.
While bears a significant resemblance, Ertl and Gregg's approach is not automatic and every additional architecture requires a significant understanding of the target architecture---including an ability to ensure that fully relocable code can be generated and extracted. In comparison, the copy-and-patch approach can be thought as a simple dynamic linker, and objects generated by unmodified C compilers are far more predictable and need much less architecture-specific information for linking.
Does Ertl and Gregg's approach have any "upsides" over copy-and-patch? Or is it a case of just missing those one or two insights (or technologies) that make the whole thing a lot simpler to implement?
I think so, but I can't say this any more confident until I get an actual copy of their paper (I used other review papers to get the main idea instead).
Anton Ertl! <3
Context: I've been on a concatenative language binge recently, and his work on Forth is awesome. In my defense he doesn't seem to list this paper among his publications[0]. Will give this paper a read, thanks for linking it! :)
If they missed the boat on getting credit for their contributions then at least the approach finally starts to catch on I guess?
(I wonder if he got the idea from his work on optimizing Forth somehow?)
Reminds me of David K who is local to me in Florida, or was, last I spoke to him. He has been a Finite State Machine advocate for ages, and its a well known concept, but you'd be surprised how useful they can be. He pushes it for front-end a lot, and even implemented a Tic Tac Toe sample using it.
At the end of the day, the number of optimizations that even a JIT can do on Python is limited because all variables are boxed (each time the variable is accessed the type of the variable needs to be checked because it could change) and then function dispatches must be chosen based on the type of the variable. Without some mechanism to strictly type variables, the number of optimizations will always be limited.
Per the spec all JS values are boxed too (aside from values in TypedArrays). The implementations managed to work their way around that too for the most part.
javascript is insanely more optimized but has the same limitations as Python. So there is likely a lot more you can do despite the flexibility, like figure out in hot code what flexibility features are not used, & optimize around that
Couldn’t you say the same for e.g. JavaScript? The variables aren’t typed there either and prototypes are mutable. I could definitely see things being harder with Python which has a lot of tricky metaprogramming available that other interpreted languages don’t but I don’t think it’s as simple as a lack of explicit types.
Don't worry. Python already has syntactical constructs with mandatory type annotations. I will not be surprised if few years from now those type annotations will become mandatory in other contexts as well.
Can’t the happy path be branch predicted and speculatively executed, though? AFAIK V8 seems to do this: https://web.dev/articles/speed-v8#the_optimizing_compiler
IIRC Instagram's flavour of Python had unboxed primitives (if the types were constricted enough).
If python became fast, there's a chance it may become a language eater.
What languages do you think it could realistically eat (that it hasn’t already)?
Javascript, if it becomes viable for web development?
I'd love to see it eat JavaScript and Java for back-end code.
But I doubt that's going to ever happen.
2-9% isn't changing any language hierarchies
Groundwork for the future.
Great article, but small typo when the author says "copy-any-patch JIT"
That’s not a typo, that’s the name of the technique.
i think it's 'copy-and-patch'
D’oh! Of course you’re correct. I skipped over “any”, and focused on “patch”. Sorry about that.
no harm done :)
The article describes that the new JIT is a "copy-and-patch JIT" (I've previously heard this called a "splat JIT"). This is a relatively simple JIT architecture where you have essentially pre-compiled blobs of machine code for each interpreter instruction that you patch immediate arguments into by copying over them.
I once wrote an article about very simple JITs, and the first example in my article uses this style: https://blog.reverberate.org/2012/12/hello-jit-world-joy-of-...
I take some issue with this statement, made later in the article, about the pros/cons vs a "full" JIT:
The big downside with a “full” JIT is that the process of compiling once into IL and then again into machine code is slow. Not only is it slow, but it is memory intensive.
I used to think this was true also, because my main exposure to JITs was the JVM, which is indeed memory-intensive and slow.
But then in 2013, a miraculous thing happened. LuaJIT 2.0 was released, and it was incredibly fast to JIT compile.
LuaJIT is undoubtedly a "full" JIT compiler. It uses SSA form and performs many optimizations (https://github.com/tarantool/tarantool/wiki/LuaJIT-Optimizat...). And yet feels no more heavyweight than an interpreter when you run it. It does not have any noticeable warm up time, unlike the JVM.
Ever since then, I've rejected the idea that JIT compilers have to be slow and heavyweight.
»LuaJIT is undoubtedly a "full" JIT compiler.«
Yes, and it's practically unmaintained. Pull requests to add support for various architectures have remained largely unanswered, including RISC-V.
I think Mike Pall has done enough work on LuaJIT for several lifetimes. If nobody else wants to merge pull requests and make sure everything still works then maybe LuaJIT isn't important enough to the world.
Doesn't change parent's point, clearly proves it's possible
Wasn't CPython supposed to remain very simple in its codebase, with the heavy optimization left for other implementations to tackle? I seem to remember hearing as much a few years back.
That was the original idea, when Python started attracting interest from big corporations. It has however become clear that maintaining alternative implementations is very difficult and resource-intensive; and if you have to maintain compatibility with the wider ecosystem anyway (because that's what users want), you might as well work with upstream to find solutions that work for everyone.
The copy-and-patch approach was explicitly chosen in order to minimize additional impacts on non-JIT-specific code base.
Does Python even have a language specification? I've been told that CPython IS the specification. I don't know if this is still true. In the Java world there is a specification and a set of tests to test for conformation so it's easier to have alternative implementations of the JVM. If what I said is correct, then I can see how the optimized alternative implementation idea is less likely to happen.
The bit everyone wants:
The initial benchmarks show something of a 2-9% performance improvement.
Which is underwhelming (as mentioned in the article), especially if we look at PyPy[0]. But it's a step forward nonetheless.
At the moment, the JIT is only used if the function contains the JUMP_BACKWARD opcode which is used in the while statement but that will change in the future.
It's a bit less underwhelming if you consider that only function objects with loops are being JITed. nb: for loops in Python also use the JUMP_BACKWARD op.
PyPy was never able to get fast enough to replace CPython in spite of its lack of compatible C API. CPython is trying to move fast without breaking C API, and 2--9% improvement is in fact very encouraging for that and other reasons (see my other comment).
What is it really JIT-ing? Given it says that it's only relevant for those building CPython. So it's not JIT-ing my Python code, right? And the interpreter is in C. So what is it JIT-ing? Or am I misunderstanding something?
A copy-and-patch JIT only requires the LLVM JIT tools be installed on the machine where CPython is compiled from source, and for most people that means the machines of the CI that builds and packages CPython
Code fragments that implement each opcode in the core interpreter loop are additionally compiled in the way that each fragment is compiled into a relocatable binary. Once processed in that way, the runtime code generator can join required fragments by patching relocations, essentially doing the job of dynamic linkers. So it is compiling your Python code, but the compiled result is composed of pre-baked fragments with patches.
How do you access optimizations such as dead code removal and constant propagation using this technique?
I believe a JIT using this technique could eliminate dead code at the Python bytecode level, but not at the machine code level. That seems pretty reasonable to me.
How's this different than running pypy?
It supports all your existing python code and python libraries (at the cost of being significantly slower than PyPy).
Can someone explain what a JIT compiler means in the case of an interpreted language?
Basically a JIT (Just In Time), is also known as a dynamic compiler.
It is an approach that traces back to original Lisp and BASIC systems, among others lesser kwown ones.
The compiler is part of the language runtime, and code gets dynamically compiled into native code.
Why is this a good approach?
It allows for experiences that are much harder to implement in languages that tradicionally compile straight to native code like C (note there are C interpreters).
So you can have an interpreter like experience, and code gets compiled to native code before execution on the REPL, either straight away, or after the execution gets beyond a specific threshold.
Additionally, since dynamic languages per definition can change all the time, a JIT can profit from code instrumentation, and generate machine code that takes into account the types actually being used, something that an AOT approach for a dynamic language cannot predit, thus optimizations are hardly an option in most cases.
The initial benchmarks show something of a 2-9% performance improvement. You might be disappointed by this number, especially since this blog post has been talking about assembly and machine code and nothing is faster than that right?
Indeed, reading the blog post build much higher expectations
Just running machine code itself does not make a program magically faster. It‘s all about the amount of work the machine code is doing.
For example, if the JIT compiler realizes the program is adding two integers it could potentially replace the code with two MOVs and a single ADD. However, what about the error handling in the case of an overflow? Python switches to its internal BigInt representation in this case and cannot rely on architecture specific instructions alone once the result gets too large to fit into a register.
Modern programming languages are all about trading performance for convenience and that is what makes them slow — not because they are running an interpreter and not compiling to machine code.
I still don't get why they didn't reduce API of the interpreter internals in Python 3 so that things like this would be more achievable.
If you're going to break backwards compatibility, it's not like Unicode was the only foundational problem Python 2 had.
They did change the API for Python modules implemented in C. That was actually part of the reason why the 2->3 transition went so badly.
It wasn't realistic to switch to 3.x when the libraries either weren't there or were a lot slower (due to using pure Python instead of C code).
It also wasn't realistic to rewrite the libraries when the users weren't there.
It was in many respects a perfect case study in how not to do version upgrades.
It's interesting to see these 2-9% improvements from version to version. They are always talked about with disappointment, as if they are too small, but they also keep coming, with each version being faster than the previous one. I prefer a steady 10% per version over breaking things because you are hoping for bigger numbers. Those percentages add up!
Well I think they even multiply, making it even better news!
Maybe relevant blogpost is https://devblogs.microsoft.com/python/python-311-faster-cpyt... dated Oct 2022. The team behind this and some other recent improvements to Python are at Microsoft.
The PR message with a riff off the Night Before Christmas is gold.
I wish the money could be spent on PyPy but pypy has its problems - you don't get a big boost on small programs that run often because the warmup time isn't that fabulous.
For larger programs like you sometimes it some incredibly complicated incompatibility problem. For me bitbake was one of those - could REALLY benefit from pypy but didn't work properly and I couldn't fix it.
If this works more reliably or has a faster warmup then....well it could help to fill in some gaps.
Finally!
Regardless of the work being done in PyPy, Jython, GraalPy and IronPython, having a JIT in CPython seems to be the only way beyond "C/C++/Fortran libs are Python" mindset.
Looking forward to its evolution, from 3.13 onwards.
This was a fantastic, very clear, write-up on the subject. Thanks for sharing!
If the further optimizations that this change allows, as explained at the end of this post, are covered as well as this one, it promises to be a very interesting series of blog posts.
Brandt gave a talk about this at the CPython Core Developer Sprint late last year https://www.youtube.com/watch?v=HxSHIpEQRjs
Why has it taken so much longer for CPython to get a JIT than, say, PyPy? I would imagine the latter has far less engineering effort and funding put into it.
"by Anthony Shaw, January 9, 2023"
2024, right?
what are those future optimization he talks about?
he talks about an IL, but what's that IL? does that mean that the future optimization will involve that IL?
Maybe the article should be dated "January 9, 2024" ??? (or is it really a year old article?)
The article presents a copy and patch jit as something new, but I remember DOS's quickbasic doing the same thing. It generated very bad assembly code in memory by patching together template assembly blocks with filled in values, with a lot of INT instructions toward the quickbasic runtime, but it did compile, not interprete.
It never hurts for any language to get an uplift in performance. Exciting to see python getting that treatment
I'm even more excited to noGIL in 3.13. I wonder how these two features will play together?
I love the description in the draft PR:
'Twas the night before Christmas, when all through the code
Not a core dev was merging, not even Guido;
The CI was spun on the PRs with care
In hopes that green check-markings soon would be there;
...
...
...
--enable-experimental-jit, then made it,
And away the JIT flew as their "+1"s okay'ed it.
But they heard it exclaim, as it traced out of sight,
"Happy JIT-mas to all, and to all a good night!"
https://github.com/python/cpython/pull/113465Woah - very interesting!
At the moment, the JIT is only used if the function contains the JUMP_BACKWARD opcode which is used in the while statement but that will change in the future.
Isn't this the main reason why it's only a 2-9% improvement? Not much Python code uses the while statement in my experience.
a 2-9% improvement at global scale is insane! This is not a small number by any means.
Because it's already fast enough for most of us ? Anecdote, but I've had my share of slow things in Javascript that are not slow in Python. Try to generate a SHA256 checksum for a big file in the browser...
Good to see progress anyways.
Python's SHA256 is written in C. And I'd quess Web Crypto API for JS is in the same ballbark.
SHA256 in pure Python would be unusably slow. In Javascript it would be at least usably slow.
Javascript is fast. Browsers are fast.
Have you tried to generate a SHA256 checksum for a file in the browser, no matter what crypto lib or api is available to you ? Have you tried to generate it using Python standard lib ?
I did, and doing it in the browser was so bad that it was unusable. I suspect that it's not the crypto that's slow but the file reading. But anyway...
None would do that because:
Hence why comparing "pure python" to "pure javascript" is mostly irrelevant for most day to day tasks, like most benchmarks.
Well, no they were not for my use case. Browsers are really slow at generating file checksums.
The Pytthon standard lib calls out to hand optimized assembly language versions of the crypto algos. It is of no relevance to a JIT-vs-interpreted debate.
It absolutely is relevant to the "python is slow reee" nonsense tho, which is the subject. Python-the-language being slow is not relevant for a lot of the users, because even if they don't know they use Python mostly as a convenient interface to huge piles of native code which does the actual work.
And as noted upthread that's a significant part of the uptake of Python in scientific fields, and why pypy despite the heroic work that's gone into it is often a non-entity.
Python is slow, reee.
This is a major problem in scientific fields. Currently there are sort of "two tiers" of scientific programmers: ones who write the fast binary libraries and ones that use these from Python (until they encounter e.g. having to loop and they are SOL).
This is known as the two language problem. It arises from Python being slow to run and compiled languages being bad to write. Julia tries to solve this (but fails due to implementation details). Numba etc try to hack around it.
Pypy is sadly vaporware. The failure from the beginning was not supporting most popular (scientific) Python libraries. It nowadays kind of does, but is brittle and often hard to set up. And anyway Pypy is not very fast compared to e.g. V8 or SpiderMonkey.
Reee.
Care to list some of those details ? (I have zero knowledge in Julia)
This is quite a good intro: https://viralinstruction.com/posts/badjulia/
fyi: the author of that post is a current Julia user and intended the post as counterpoint to their normally enthusiastic endorsements. so while it is a good intro to some of the shortfalls of the language, I'm not sure the author would agree that Julia has "failed" due to these details
Yes, but it's a good list of the major problems, and laudable for a self-professed "stan" to be upfront about them.
It's my assesment that the problems listed in there are a cause why Julia will not take off and we're largely stuck with Python for the foreseeable future.
It is worth noting that the first of the reasons presented is significantly improved in Julia 1.9 and 1.10 (released ~8 months and ~1 month ago). The time for `using BioSequences, FASTX` on 1.10 is down to 0.14 seconds on my computer (from 0.62 seconds on 1.8 when the blog post was published).
TTFX is indeed getting a lot better. But e.g. "using DynamicalSystems" is still over 5 seconds.
There is something big going on in caching the binaries, so there's a chance the TTFX will get workable.
The major problem in scientific fields is not this, but the amount of incompetence and the race-to-the-bottom environment which enables it. Grant organizations don't demand rigor and efficiency, they demand shiny papers. And that's what we get. With god awful code and very questionable scientific value.
There are such issues, but I don't think they are a very direct cause of the two language problem.
And even these issues are part of the greater problem of late stage capitalism that in general produces god awful stuff with questionable value. E.g. vast majority of industry code is such.
There is pleeeenty of mission critical stuff written in Python, for which interpreter speed is a primary concern. This has been true for decades. Maybe not in your industry, but there are other Python users.
Just for giggles I tried this and I'm getting ~200ms when reading and hashing 50MB file in the browser (Chromium based) vs ~120ms using Python 3.11.6.
https://gist.github.com/jiripospisil/1ae8b877b1c728536e382fc...
https://jsfiddle.net/yebdnz6x/
Not so bad compared to what I tried a few years ago. Might finally be usable for us...
All major browsers have supported it for over eight years. Maybe the problem was between the seat and the keyboard?
https://caniuse.com/mdn-api_crypto_subtle
Maybe 8 years is not much in a career ? Maybe we had to support one of those browsers that did not support it ? Maybe your snarky comment is out of place ? And even to this day it's still significantly slower than Python stdlib according to the tester. So much for "why python not as fast as js, python is slow, blah blah blah".
Have you tried to do this in Python?
A Node comparison would be more appropriate.
The point of Python is quickly integrating a very wide range of fast libraries written in other languages though, you can't ignore that performance just because it's not written in Python.
Always interested in replies to this kind of comment, which basically boil down to "Python is so slow that we have to write any important code in C. And this is somehow a good thing."
I mean, it's great that you can write some of your code in C. But wouldn't it be great if you could just write your libraries in Python and have them still be really fast?
But wouldn't it be great if you could just write your libraries in Python
Everybody obviously wants that. The question is are you willing to lose what you have in order to hopefully, eventually, get there. If Python 3 development stopped and Python 4 came out tomorrow and was 5x faster than python 3 and a promise of being 50-100x faster in the future, but you have to rewrite all the libraries that use the C API, it would probably be DOA and kill python. People who want a faster 'almost python' already have several options to choose from, none of which are popular. Or they use Julia.
Why are you assuming that they'd have to rewrite all of their libraries? I don't see anything in the article that says that.
The reason this approach is so much slower than some of the other 'fast' pythons out there that have come before is that they are making sure you don't have to rewrite a bunch of existing libraries.
That is the problem with all the fast python implementations that have come before. Yes, they're faster than 'normal' python in many benchmarks, but they don't support the entire current ecosystem. For example Instagram's python implementation is blazing fast for doing exactly what Instagram is using python for, but is probably completely useless for what I'm using python for.
Aaah, so it's not this approach that you're saying is an issue, it's the ones that significantly change Python. Gotcha, that makes sense. Thank you.
Yes, but not so good when the JIT-ed Python can no longer reference those fast C code others have written. Every Python JIT project so far has suffered from incompatibility with some C-base Python extension, and users just go back to the slow interpreter in those cases.
"not so good when the JIT-ed Python can no longer reference those fast C code others have written"
I don't see an indication in the article that that's the case. Am I missing something?
this was a big obstacle for pypy specifically
https://www.pypy.org/posts/2011/05/numpy-follow-up-692862769...
https://doc.pypy.org/en/latest/faq.html#what-about-numpy-num...
i'm not sure what version they gave up at
languages don’t need to all be good at the same thing. Python currently excels as a glue language you use to write drivers for modules written in lower-level languages, which is a niche that (afaik) nobody else seems to fill right now.
While I’m all for making Python itself faster, it would be a shame to lose the glue language par excellence.
Pure JS libs are more portable. In Python, portability doesn't matter as much.
it depends what speed is most important to you.
When i was a scientist, speed was getting the code written during my break, and if it took all afternoon to run that's fine because i was in the lab anyway.
Even as i moved more into the software engineer direction, and started profiling code more, most of the bottlenecks come from things like "creating objects on every incovation rather than pooling them", "blocking IO", "using a bad algorithm" or "using the wrong datasctructure for the task". problems that exist in every language, though "bad algorithm" or "using the wrong datasctructure" might matter less in a faster language you're still leaving performance on the table.
The good thing is that python has a very vibrant ecosystem filled with great libraries, so we don't have to write it in C, because somebody else has. We can just benefit from that when the situation calls for it
That really depends.
To make the issue clear, let's think about a similar situation:
bash is nice because you can plug together inputs and outputs of different sub-executables (like grep, sed and so on) and have a big "functional" pipeline deliver the final result.
Your idea would be "wouldn't it be great if you could just write your libraries in bash and have them still be really fast?". Not if you make bash into C, tanking productivity. And definitely not if that new bash can't run the old grep anymore (which is what usually is implied by the proposal in the case of Python).
Also, I'm fine with not writing my search engines, databases and matrix multiplication algorithm implementations in bash, really. So are most other people, I suspect.
Also, many proposals would weaken Python-the-language so it's not as expressive anymore. But I want it to stay as dynamic as it is. It's nice as a scripting language about 30 levels above bash.
As always, there are tradeoffs. Also with this proposal there will be tradeoffs. Are the tradeoffs worth it or not?
For the record, rewriting BLAS in Python (or anything else), even if the result was faster (!), would be a phenomenally bad idea. It would just introduce bugs, waste everyone's time, essentially be a fork of BLAS. There's no upside I can see that justifies it.
Between writing C code and writing Python code, there is also Cython.
But sure, I'm all for removing build steps and avoiding yet another layer.
Python is already fast where it matters: often, it is just used to integrate existing C/C++ libraries like numpy or pytorch. It is more an integration language than one where you write your heavy algorithms in.
For JS, during the time that it received its JITs, there was no cross platform native code equivalent like wasm yet. JS had to compete with plugins written in C/C++ however. There was also competition between browser vendors, which gave the period the name "browser wars". Nowadays at least, the speed improvements for the end user thanks to JIT aren't also that great, Apple provides a mode to turn off JIT entirely for security.
Having recently implemented parallel image rendering in corrscope (https://github.com/corrscope/corrscope/pull/450), I can say that friends don't let friends write performance-critical code in Python. Depending on prebuilt C++ libraries hampers flexibility (eg. you can't customize the memory management or rasterization pipeline of matplotlib). Python's GIL inhibits parallelism within a process, and the workaround of multiprocessing and shared memory is awkward, has inconsistencies between platforms, and loses performance (you can't get matplotlib to render directly to an inter-process shared memory buffer, and the alternative of copying data from matplotlib's framebuffer to shared memory wastes CPU time).
Additionally a lot of the libraries/ecosystem around shared memory (https://docs.python.org/3/library/multiprocessing.shared_mem...) seems poorly conceived. If you pre-open shared memory in a ProcessPoolExecutor's initializer functions, you can't close them when the worker process exits (which might be fine, nobody knows!), but if you instead open and close a shared memory segment on every executor job, it measurably reduces performance, presumably from memory mapping overhead or TLB/page table thrashing.
What would you use instead of Python?
Cython? :o
That's quite surprising to learn, as I didn't think the initializer ran in a specialized context (like a pthread_atfork postfork hook in the child).
What happens when you try to close an initializer-allocated SharedMemory object on worker exit?
But what is the counterfactual? Implementing the whole thing in Python? It seems much more work than forking/fixing matplotlib.
That's why optional GIL will be so important.
Well, imho the biggest problem with this approach to paralellism is that you're stepping out of the Python world with gc'ed objects etc. and into a world of ctypes and serialization. It's like you're not even programming Python anymore, but more something closer to C with the speed of an interpreted language.
I think usually the term “browser wars” refers to the time when Netscape and Microsoft were struggling for dominance, which concluded in 2001.
JavaScript JITs only emerged around 2008 with SpiderMonkey’s TraceMonkey, JavaScriptCore’s SquirrelFish Extreme, and V8’s original JIT.
There were multiple browser wars, otherwise you wouldn't need -s there ;-)
A big part of what made Python so successful was how easy it was to extend with C modules. It turns out to be very hard to JIT Python without breaking these, and most people don’t want a Python that doesn’t support C extension modules.
The JavaScript VMs often break their extensions APIs for speed, but their users are more used to this.
On the other hand, rewriting the C modules and adapting them to a different C API is very straightforward after you've done 1 or 2 of such modules. Perhaps it's even something that could be done by training an LLM like Copilot.
That's breakage you'd have to tread carefully on; and given the 2to3 experience, there would have to be immediate reward to entice people to undertake the conversion. No one's interested in even minor code breakage for minor short-term gain.
JS doesn't really have the tradition of external modules that Python has, for a long time it only really existed inside the browser.
Which is why I'm shocked that Python's big "we're breaking backwards compatibility" release (Python 3) was mostly just for Unicode strings. It seems like the C API and the various __builtins__ introspection API thingies should've been the real focus on breaking backwards compatibility so that Python would have a better future for improvements like this.
anyone (company) stepping up and make the runtime as fast as modern JavaScript runtimes.
There are a lot of faster python runtimes out there. Both Google and Instagram/Meta have done a lot of work on this, mostly to solve internal problems they've been having with python performance. Microsoft has also done work on parallel python. There's PyPy and Pythran and no doubt several others. However none of these attempts have managed to be 100% compatible with the current CPython (and more importantly the CPython C API), so they haven't been considered as replacements.
JavaScript had the huge advantage that there was very little mission critical legacy JavaScript code around they had to take into consideration, and no C libraries that they had to stay compatible with. Meaning that modern JavaScript runtime teams could more or less start from scratch. Also the JavaScript world at the time were a lot more OK with different JavaScript runtimes not being 100% compatible with each other. If you 'just' want a faster python runtime that supports most of python and many existing libraries, but are OK with having to rewrite some your existing python code or third party libraries to make it work on that runtime, then there are several to choose from.
JS also had the major advantage of being sandboxed by design, so they could work from there. Most of the technical legacy centered around syntax backwards compatibility, but it's all isolated - so much easier to optimize.
Python with it's C API basically gives you the keys to the kingdom on a machine code level. Modifying something that has an API to connect to essentially anything is not an easy proposition. Of course, it has the advantage that you can make Python faster by performance analysis and moving the expensive parts to optimized C code, if you have the resources.
Google/Instagram have done bits, but the company that's done the most serious work on Python performance is actually Oracle. GraalPython is a meaningfully faster JIT (430% faster vs 7% for this JITC!) and most importantly, it can utilize at least some CPython modules.
They test it against the top 500 modules on PyPI and it's currently compatible with about half:
https://www.graalvm.org/python/compatibility/
But investment continues. It has some other neat features too like sandboxing and the ability to make single-binary programs.
The GraalPython guys are working on the HPy effort as well, which is an attempt to give Python a properly specified and engine-neutral extension API.
Node.js and Python 3 came out at around the same time. Python had their chance to tell all the "mission critical legacy code" that it was time to make hard changes.
There have been several attempts. For example, Google tried to introduce a JIT in 2011 with a project named Unladen Swallow, but that ended up getting abandoned.
Unladen Swallow was massively over-hyped. It was talked about as though Google had a large team writing “V8 for Python”, but IIRC it was really just an internship project.
well, there were a couple of guys working on it
Teaching. So many colleges/unis I know teach "Introduction to Programming" with Python these days, especially to non-CS students/pupils.
I think python is very well suited to people who do computation in Excel spreadsheets. For actual CS students, I'd rather see something like scheme be a first language (but maybe I'm just an old person)
They do both Python and Scheme in the same Berkeley intro to CS class. But I think the point of Scheme is more to expand students' thinking with a very different language. The CS fundamentals are still covered more in the Python part of the course.
Easy and the number-crunching libs are optimized away in (generally) C.
and FORTRAN.
Billions of dollars of product decisions use JS benchmark speed as one of the standard benchmarks to base buying decision on (for a good reason).
For machine learning speed compiling to the right CUDA / OpenCL kernel is much more crucial, so there's where the money goes.
Because enough users find the performance sufficient.
Because the reason why Python is one of the world's most popular language (a large set of scientific computing C extensions) is bound to every implementation details of the interpreter itself.
New runtimes like NodeJS have expanded JS beyond web, and JS's syntax has improved the past several years. But before that happened, Python on its own was way easier for non-web scripts, web servers, and math/science/ML/etc. Optimized native libs and ecosystems for those things got built a lot earlier around Python, in some cases before NodeJS even existed.
Python's syntax is still nicer for mathy stuff, to the point where I'd go into job coding interviews using Python despite having used more JS lately. And I'm comparing to JS because it's the closest thing, while others like Java are/were far more cumbersome for these uses.
I still scratch my head why it’s not installed by default on Windows.
I think the thing with python is that it's always been "fast enough" and if not you can always reach out to natively implemented modules. On the flipside javascript was the main language embedded in web browsers.
There has been a lot of competition to make browsers fast. Nowadays there are 3 main JS engines, V8 backed by google, JavaScriptCore backed by apple, and spidermonkey backed by mozilla.
If python had been the language embedded into web browsers, then maybe we would see 3 competing python engines with crazy performance.
The alternative interpreters for python have always been a bit more niche than Cpython, but now that Guido works at microsoft there has been a bit more of a push to make it faster
JavaScript has to be fast because its users were traditionally captive on the platform (it was the only language in the browser).
Python's users can always swap out performance critical components to another language. So Python development delivered more when it focussed on improving strengths rather than mitigating weaknesses.
In a way, Python being slow is just a sign of a healthy platform ecosystem allowing comparative advantages to shine.
In lots of applications, all the computations already happen inside native libraries, e.g. Numpy, PyTorch, TensorFlow, JAX etc.
And if you have a complicate computation graph, there are already JITs on this level, based on Python code, e.g. see torch.compile, or TF XLA (done by default via tf.function), JAX, etc.
It's also important to do JIT on this level, to really be able to fuse CUDA ops, etc. A generic Python JIT probably cannot really do this, as this is CUDA specific, or TPU specific, etc.
You might want to checkout Mojo, which is not a runtime but a different language, but also designed to be a superset of Python. Beware though that it's not yet open source, which is slated for this Q1
https://docs.modular.com/mojo/manual/
edit: The main point I forgot to mention - it aims to compete with "low-level" languages like C and Rust in performance
Meta has actually been doing that — helping improve python's speed — with things like [1,2]
[1] https://peps.python.org/pep-0703/
[2] https://news.ycombinator.com/item?id=36643670
Two reasons:
1. Javascript is a less dynamic language than Python and numbers are all float64 which makes it a lot easier to make fast.
2. If you want to run fast code on the web you only have one option: make Javascript faster. (Ok we have WASM now but that didn't exist at the time of the Javascript Speed wars.) If you want to run fast code on your desktop you have a MUCH easier option: don't use Python.
Because it doesn't useGrossCamelCaseAsOften