Asyncio allows you to replace the event loop with an implementation of your own. For Temporal Python we represent workflows as a custom, durable asyncio event loops so things like asyncio.sleep are durable timers (i.e. code can resume on another machine, so you can sleep for weeks). Here is a post explaining how it's done: https://temporal.io/blog/durable-distributed-asyncio-event-l....
The biggest problem with asyncio is how easy/common it is in Python to be able to block the asyncio thread with synchronous calls, gumming up the whole system. Python sorely needs a static analysis tool that can build a call graph to help detect if a known thread-blocking call is called directly or directly from an async def.
Why not use threads ? I'm still trying to understand if Python is really meant for concurrency. Asyncio always felt like it's barely hanging on, maybe it's me, by C# has a cleaner async implementation in my opinion.
We need hundreds or thousands and they need to be cheap and, in Temporal's case, the cooperative multitasking needs to be deterministic which native threads are not. Temporal's C# SDK uses .NET tasks the same way and tasks have a lot of parallels with asyncio (though they combine with threads in one static thread pool which can be confusing in itself).
Is C# generally faster or more efficient here?
Forgive my ignorance, but I always figured Python just doesn't scale well for this use case.
Massively so. You can spawn 1 million tasks waiting for a Task.Delay and threadpool will trivially survive the punishment. Not so much with Python, which has a lot of problems scaling to all cores and also deals with interpreter lock (I assume it is better now and no longer pure GIL style?). The advice for Python seems to be to use multi-processing and even then it is orders of magnitude of difference in CPU time spent per operation. One language compiles to codegen that is not far from what you see out of GCC or Clang, another one is purely interpreted which means significant difference in resource utilization.
Although top positions do a lot of "cheating", take a look where most Python entries sit: https://www.techempower.com/benchmarks/#hw=ph&test=fortune&s...
Fastest Python entry starts at 234th place while .NET one is at 16th (and that with still somewhat heavy impl., upper bracket of Fortunes is bottle-necked by DB driver as well which is what top entries win with).
What a great response.
I had the same feeling here, .net is just faster , but I think a lot of teams get into this workflow where they already have a Python solution and they don't want to rewrite it.
I'm not saying everything needs to be written in a systems language like Rust, but Python always strikes me as a weird choice where performance is a concern.
I'm pretty amazed to see a JavaScript runtime ranking so high here
Make sure to examine code of top entries if this interests you - a lot of shenanigans there. Though some are much more well-behaved than others (.NET's one does look sketchy unfortunately, even if it doesn't have to...). Also, Solid.js is pretty much a thin wrapper on top of syscalls, so it's best to look at other JS submissions.
Do you have a source with more realistic benchmarks ?
I think an argument could be made that without real life use cases, these metrics are useless.
These, technically, are realistic benchmarks as they execute your average web application code...or what used to remotely resemble one.
Comparing purely interpreted languages or interpreted + weak JIT compiler + dynamically typed (Python, JavaScript, Ruby, PHP, BEAM family) to the ones that get compiled in their entirety to machine code (C#, Java, Go, C/C++/Rust, etc.) is not rocket science.
There is a hard ceiling as to how fast an interpreter can go - it has to parse text (if it's purely scripting language) first and then interpret AST, or it has to interpret bytecode, but, either way, it means spending dozens, hundreds or even thousands of CPU cycles in a worst case scenario per each individual operation.
Consider addition, it can be encoded in bytecode as, let's say, single 0x20 byte followed by two numeric literals, each taking 4 bytes (i32). In order to execute such operation, an interpreter has to fetch opcode, its arguments, then it has to dispatch on a jump table (we assume it's an optimized interpreter like Python's) to specific opcode handler, which will then load these two numbers into CPU registers, do addition and store the evaluation result, and then jump back to fetch the next opcode, and then dispatch it, and so on and so forth.
Each individual step like this takes at least 10 cycles (or 25 or 100, depends on complexity) if it's a fancy new big core and can also have hidden latency due to how caches and memory prefetching works. Now, when CPU executes machine code, a modern core can execute multiple additions per cycle, like 4 or even 6 on the newest cores. This alone, means 20-60 times of difference (because IPC is never perfect), and this is with the simplest operation that has just two operands and no data dependencies or other complex conditions.
Once you know this, it becomes easier to reason about overhead of interpreted languages.
So I take it unless you're doing something which requires quick prototyping or a specific machine learning library one should always avoid interpreted languages when performance is a concern ?
I love C sharp and I'm really productive in it, but I've worked at so many places which have tried to get performance out of Python.
https://youtu.be/v1CmGbOGb2I?si=8aGIDjI44pZGiTAU&t=234
Yes, C# is faster just because it's a faster runtime, but Python does scale reasonably well for this use case.
Thanks for the info!
Good luck with that. A simple read() might block or not, depending on what the descriptor is and how it's configured. How would you statically detect this?
Like many static analyzers in ambiguous situations, you have to decide whether the possibility of false positives is worth the safety. In this case I'd flag it knowing that it could be marked ignored as needed (same with open() if it was used).
I have toyed with a MyPy plugin to do this, but the arbitrary state storage/cache is a bit limited to do good call graph and function coloring (metadata can be on types but not functions). And there's just not many other good _extensible_ static analysis libraries. Language like Go have static analysis libraries that let you store/cache arbitrary state/facts on all constructs.
You must also detect "too long loops"… which is dangerously close to solving the halting problem.
This and other similar CPU-bound operations may be unreasonable to detect at analysis time. I would encourage users to run with asyncio debug mode in tests to help catch these: https://docs.python.org/3/library/asyncio-dev.html#debug-mod.... Granted that's runtime, but there are limitations on what analysis can perform, it's just meant to help catch obvious issues.
In the workflow example, is ` Purchaser.purchase` supposed to be `do_purchase`?
Fixed, thanks! (it had undergone revisions hence the function name mismatch)
Temporal is seriously cool.
When I found out how it had implemented the asyncio event loop it was a real mind blown moment.
how? is there a source to read?
Maybe this is a bad idea, but... What if, instead of the current way where every call (including native stuff) is sync by default, it were the other way around? You'd quickly whitelist basic things like arithmetic and data struct access as sync, then you could maybe detect other things that should be sync if the event loop is spinning suspiciously fast.