return to table of content

How Python asyncio works: recreating it from scratch

kodablah
21 replies
4d12h

Asyncio allows you to replace the event loop with an implementation of your own. For Temporal Python we represent workflows as a custom, durable asyncio event loops so things like asyncio.sleep are durable timers (i.e. code can resume on another machine, so you can sleep for weeks). Here is a post explaining how it's done: https://temporal.io/blog/durable-distributed-asyncio-event-l....

The biggest problem with asyncio is how easy/common it is in Python to be able to block the asyncio thread with synchronous calls, gumming up the whole system. Python sorely needs a static analysis tool that can build a call graph to help detect if a known thread-blocking call is called directly or directly from an async def.

999900000999
11 replies
2d18h

Why not use threads ? I'm still trying to understand if Python is really meant for concurrency. Asyncio always felt like it's barely hanging on, maybe it's me, by C# has a cleaner async implementation in my opinion.

kodablah
10 replies
2d6h

We need hundreds or thousands and they need to be cheap and, in Temporal's case, the cooperative multitasking needs to be deterministic which native threads are not. Temporal's C# SDK uses .NET tasks the same way and tasks have a lot of parallels with asyncio (though they combine with threads in one static thread pool which can be confusing in itself).

999900000999
9 replies
2d5h

Is C# generally faster or more efficient here?

Forgive my ignorance, but I always figured Python just doesn't scale well for this use case.

neonsunset
6 replies
2d4h

Massively so. You can spawn 1 million tasks waiting for a Task.Delay and threadpool will trivially survive the punishment. Not so much with Python, which has a lot of problems scaling to all cores and also deals with interpreter lock (I assume it is better now and no longer pure GIL style?). The advice for Python seems to be to use multi-processing and even then it is orders of magnitude of difference in CPU time spent per operation. One language compiles to codegen that is not far from what you see out of GCC or Clang, another one is purely interpreted which means significant difference in resource utilization.

Although top positions do a lot of "cheating", take a look where most Python entries sit: https://www.techempower.com/benchmarks/#hw=ph&test=fortune&s...

Fastest Python entry starts at 234th place while .NET one is at 16th (and that with still somewhat heavy impl., upper bracket of Fortunes is bottle-necked by DB driver as well which is what top entries win with).

999900000999
5 replies
2d3h

What a great response.

I had the same feeling here, .net is just faster , but I think a lot of teams get into this workflow where they already have a Python solution and they don't want to rewrite it.

I'm not saying everything needs to be written in a systems language like Rust, but Python always strikes me as a weird choice where performance is a concern.

I'm pretty amazed to see a JavaScript runtime ranking so high here

neonsunset
3 replies
2d3h

Make sure to examine code of top entries if this interests you - a lot of shenanigans there. Though some are much more well-behaved than others (.NET's one does look sketchy unfortunately, even if it doesn't have to...). Also, Solid.js is pretty much a thin wrapper on top of syscalls, so it's best to look at other JS submissions.

999900000999
2 replies
2d1h

Do you have a source with more realistic benchmarks ?

I think an argument could be made that without real life use cases, these metrics are useless.

neonsunset
1 replies
2d1h

These, technically, are realistic benchmarks as they execute your average web application code...or what used to remotely resemble one.

Comparing purely interpreted languages or interpreted + weak JIT compiler + dynamically typed (Python, JavaScript, Ruby, PHP, BEAM family) to the ones that get compiled in their entirety to machine code (C#, Java, Go, C/C++/Rust, etc.) is not rocket science.

There is a hard ceiling as to how fast an interpreter can go - it has to parse text (if it's purely scripting language) first and then interpret AST, or it has to interpret bytecode, but, either way, it means spending dozens, hundreds or even thousands of CPU cycles in a worst case scenario per each individual operation.

Consider addition, it can be encoded in bytecode as, let's say, single 0x20 byte followed by two numeric literals, each taking 4 bytes (i32). In order to execute such operation, an interpreter has to fetch opcode, its arguments, then it has to dispatch on a jump table (we assume it's an optimized interpreter like Python's) to specific opcode handler, which will then load these two numbers into CPU registers, do addition and store the evaluation result, and then jump back to fetch the next opcode, and then dispatch it, and so on and so forth.

Each individual step like this takes at least 10 cycles (or 25 or 100, depends on complexity) if it's a fancy new big core and can also have hidden latency due to how caches and memory prefetching works. Now, when CPU executes machine code, a modern core can execute multiple additions per cycle, like 4 or even 6 on the newest cores. This alone, means 20-60 times of difference (because IPC is never perfect), and this is with the simplest operation that has just two operands and no data dependencies or other complex conditions.

Once you know this, it becomes easier to reason about overhead of interpreted languages.

999900000999
0 replies
1d19h

So I take it unless you're doing something which requires quick prototyping or a specific machine learning library one should always avoid interpreted languages when performance is a concern ?

I love C sharp and I'm really productive in it, but I've worked at so many places which have tried to get performance out of Python.

mattpallissard
0 replies
1d

> I'm pretty amazed to see a JavaScript runtime ranking so high here

The bodies of many, many PhD students have been hurled unto a big mound to optimize JavaScript within an inch of it's life.

https://youtu.be/v1CmGbOGb2I?si=8aGIDjI44pZGiTAU&t=234

kodablah
1 replies
2d4h

Yes, C# is faster just because it's a faster runtime, but Python does scale reasonably well for this use case.

999900000999
0 replies
2d4h

Thanks for the info!

guappa
3 replies
4d9h

Good luck with that. A simple read() might block or not, depending on what the descriptor is and how it's configured. How would you statically detect this?

kodablah
2 replies
4d6h

Like many static analyzers in ambiguous situations, you have to decide whether the possibility of false positives is worth the safety. In this case I'd flag it knowing that it could be marked ignored as needed (same with open() if it was used).

I have toyed with a MyPy plugin to do this, but the arbitrary state storage/cache is a bit limited to do good call graph and function coloring (metadata can be on types but not functions). And there's just not many other good _extensible_ static analysis libraries. Language like Go have static analysis libraries that let you store/cache arbitrary state/facts on all constructs.

guappa
1 replies
4d5h

You must also detect "too long loops"… which is dangerously close to solving the halting problem.

kodablah
0 replies
4d4h

This and other similar CPU-bound operations may be unreasonable to detect at analysis time. I would encourage users to run with asyncio debug mode in tests to help catch these: https://docs.python.org/3/library/asyncio-dev.html#debug-mod.... Granted that's runtime, but there are limitations on what analysis can perform, it's just meant to help catch obvious issues.

dnadler
1 replies
2d5h

In the workflow example, is ` Purchaser.purchase` supposed to be `do_purchase`?

kodablah
0 replies
2d2h

Fixed, thanks! (it had undergone revisions hence the function name mismatch)

chippiewill
1 replies
4d7h

Temporal is seriously cool.

When I found out how it had implemented the asyncio event loop it was a real mind blown moment.

swyx
0 replies
1d23h

how? is there a source to read?

hot_gril
0 replies
3d21h

Maybe this is a bad idea, but... What if, instead of the current way where every call (including native stuff) is sync by default, it were the other way around? You'd quickly whitelist basic things like arithmetic and data struct access as sync, then you could maybe detect other things that should be sync if the event loop is spinning suspiciously fast.

throwawaymaths
20 replies
4d13h

Completely insane that python decided to hijack "def" to be used to create an object that very much is not a function. At the least they could have gone was create a different keyword

throwaway81523
12 replies
4d11h

They somewhat fixed that with "async def" but the whole thing is a huge tragedy anyway.

cuu508
11 replies
4d11h

Do you mean they fixed the hijacking by adding the "async" keyword?

By the whole thing do you mean the async functionality in Python in general? What is the tragedy?

jononomo
5 replies
4d11h

I'm also curious to know what the tragedy is.

darthrupert
4 replies
4d9h

Dunno if they meant this, but I feel that adding async to a language/ecosystem where it was earlier not used makes everything awkward for a long time. Especially if it's done liberally like python did, with both sync and async styles still supported.

It feels like the worst of both worlds at times. I think Rust suffers from this too, whereas Node.js has managed to make async the default mode of operation for enough things that it doesn't feel so bad. Especially with Typescript.

jononomo
0 replies
4d6h

I actually learned a lot from watching that video.

andrewstuart
1 replies
4d6h

> whereas Node.js has managed to make async the default mode of operation for enough things that it doesn't feel so bad

Javascript is inherently async

darthrupert
0 replies
10h30m

Well yeah. That's probably why it feels better there.

throwaway81523
4 replies
4d8h

Yes, I mean the async keyword fixes the hijacking although the hijacked version is still there for compatibility. So much for Python 3 breaking stuff for the purpose of fixing old cruft.

The tragedy is the whole Python 3 debacle, and if they were going to make a break like that, they should have introduced BEAM-like concurrency instead of async.

For reasons I don't understand, the Python community seems to have an intense phobia and hatred towards threads. I've always used threads instead of async though, and by writing in an Erlang-like style (message passing and no shared mutable data) you can keep things reliable. This has worked ok for me with 1000s of threads, though not millions or anything like that.

Python by now is mostly legacy cruft, it feels to me sometimes. Everything else of course has its own problems.

hot_gril
2 replies
3d20h

The problem with threads in Python is they're 1:1 with OS threads in every major implementation. It's not like Golang where you have greenthreading, and you don't even get all the benefits of regular threads due to the GIL. It's been too long since I've used Erlang, but the way I remember it, the concurrency is almost like using RPCs. Which is great in some applications, but Python users want shared mutable data for simple things and will make microservices for larger systems.

For a simple language, Python has so many different ways to do concurrency, some of which are misleading to new users. Event loop ("async") would've been nice and simple from the start, but they were late to that party. Everything vaguely related to packaging is also confusing, from the __init__.py files to the PIP/Conda/whatever stuff. They had a whole breaking Py2->3 transition and still didn't fix that.

throwaway81523
1 replies
3d18h

The GIL is an implementation artifact mostly due to the reference counting memory management. Besides phobia towards threads, the Python crowd also seems phobic towards tracing garbage collection. There is an ongoing project to make CPython run without the GIL yet still use reference counting. I wonder how many bugs this will surface or introduce.

Most concurrent Python code that I've written has not been cpu intensive, FWIW. Just an application talking to a bunch of i/o ports or sockets, but mostly idle. My CPU intensive stuff has tended to be single threaded so I can just launch instances in separate processes. YMMV.

Yeah the tooling situation is crazy enough that I don't begin to understand it.

hot_gril
0 replies
3d18h

Yeah, Python is made for non-CPU-intensive, IO-bound concurrent code. But that use case is a great fit for async/event-loop.

bormaj
0 replies
4d6h

As someone who started learning to code with C++ followed by Python and then Rust, proper threading approaches weren't drilled into my head until I started working with Rust. I agree with you that threading with message passing is a safer approach. Threading does add a level of complexity that isn't always necessary.

AFAIK it's not outlined in a PEP, but my intuition is that Python was designed to be accessible and "simple". It wasn't built to be a systems programming language. So for all its faults with the GIL I don't think it should be a major point of contention. For any task it's always about choosing the right tool for the job.

There is considerable effort at the moment to remove the GIL in future versions and implement better threading support. At 3.11/3.12, it has come a long way in terms of performance.

maxbond
2 replies
4d13h

It's a function, it's just been transformed into a function with a different return type (that of a generator or coroutine, you can see the type signatures here [1]). You could do something similar in vanilla Python rather than language-level sugar, eg using decorators.

[1] https://docs.python.org/3/library/typing.html#typing.Generat...

As a caveat, I point out the type signature for academic interest, I prefer to use the simpler Iterable and Awaitable types in type signatures.

xmcqdpt2
1 replies
4d7h

I think the type signature and example makes the parent's point, frankly. It has three types, on of which is contravariant, two different usage of the yield keyword, and two exit points.

Now granted most uses of generators are simpler than that, but still that's a lot of special syntax. And if you are writing Python without type-checking, that's a lot of bookkeeping you'd have to do when reading code.

maxbond
0 replies
3d21h

That type signature is gnarly, I agree. I've seen people get tripped up because they think the function is the generator, and are confused about what behavior to expect from concurrent readers of different instances of the generator (because they don't realize that they are different instances in the first place). So my position is more about how to work with Python effectively/understand it accurately than in defense of it's syntax.

zo1
0 replies
4d8h

Are you talking about decorators? Very confused trying to parse what you're saying in the context of python.

hot_gril
0 replies
3d21h

The bigger problem with Python to begin with was that you have to use def for functions, meaning you can't inline them like with JS arrow funcs or C++ lambdas. And this isn't a language purity thing, it's just inconvenient.

globular-toast
0 replies
4d9h

What's the actual problem with it? Or is this just a language "purity" thing?

cqqxo4zV46cp
0 replies
4d8h

What? There is no tragedy here. It sounds like it just put DX semantics ahead of…what? some sort of language purity? I can’t say that I’ve ever thought “gosh! that async function is…an async function, and that’s tripped me up!” before. Not even once.

krackers
6 replies
4d13h

This seems to "busy wait" when sleeping, i.e. the event loop keeps running even if nothing is currently runnable. I remember reading about another toy implementation which handled it correctly (the way I believe it's actually done in asyncio) by tracking the next-runnable times of a task in sorted order, and if nothing is currently runnable you can sleep the event loop. And then this was extended so that instead of just next-runnable depending on wall-clock time, the task could have a dependency on a socket or something, so that you can use select w/timeout.

theelous3
1 replies
4d6h

There's nothing really wrong with this on one level. Your event loop doesn't even have to loop, it can just start by running your main and end when that ends. Imagine starting a server which has a while true loop that waits over a socket, and on some shutdown condition or interrupt ends, ending the program. No busy waiting, no messing with sleeps or sockets from the event loops perspective. Run until complete vs run forever.

Better off, if making some toy loop, to not bother with the forever case.

immibis
0 replies
3d16h

how do you wait for a socket if the event loop doesn't support waiting for sockets?

chippiewill
1 replies
4d7h

In asyncio itself you can have custom event loop implementations in addition to the out of the box one.

One popular implementation is uvloop (https://github.com/MagicStack/uvloop) which basically just implements the loop using libuv which takes care of doing stuff like `select` as you describe.

LtWorf
0 replies
4d2h

I've encountered bugs with it. Unix sockets not being created at all. So I went back to the regular loop.

coolThingsFirst
0 replies
3d21h

Whats the best book to learn about things like these?

tda
2 replies
4d12h

There is a brilliant talk by David Beazley on asyncio. I used it to write a Discrete Even Simulation tool. Pretty cool to be able to make your own implementation of asyncio and replace the system clock with a simulated time

theelous3
0 replies
4d6h

dabeaz's async talks are imo the most important things you can consume to understand events loops and async, followed closely by njsmith's analysis and criticism of different methods of async invocation and call structures: https://vorpus.org/blog/notes-on-structured-concurrency-or-g...

tusharsadhwani
1 replies
4d5h

This is brilliant stuff, a very nice high level explanation that skips everything that would make it too boring for a first time reader.

Adding other resources at the end which explain how it _really_ works under the hood would be great.

xmcqdpt2
0 replies
4d6h

I would have preferred if the article did the same without using yield at all, because frankly that's where I feel the real magic is.

See here for a much more in-depth description of coroutines in python,

https://aosabook.org/en/500L/a-web-crawler-with-asyncio-coro...

guappa
0 replies
4d9h

No mention of poll()? It's not at all how asyncio works then.