HN comments for: Show HN: Hatchet – Open-source distributed task queue

kcorbitt

30 replies

2024-03-08 18:13:21 UTC

I love your vision and am excited to see the execution! I've been looking for exactly this product (postgres-backed task queue with workers in multiple languages and decent built-in observability) for like... 3 years. Every 6 months I'll check in and see if someone has built it yet, evaluate the alternatives, and come away disappointed.

One important feature request that probably would block our adoption: one reason why I prefer a postgres-backed queue over eg. Redis is just to simplify our infra by having fewer servers and technologies in the stack. Adding in RabbitMQ is definitely an extra dependency I'd really like to avoid.

(Currently we've settled on graphile-worker which is fine for what it does, but leaves a lot of boxes unchecked.)

doctorpangloss

9 replies

22h2m

2024-03-08 20:36:06 UTC

Why does the RabbitMQ dependency matter?

It was pretty painless for me to set up and write tests against. The operator works well and is really simple if you want to save money.

I mean, isn’t Hatchett another dependency? Graphile Worker? I like all these things, but why draw the line at one thing over another over essentially aesthetics?

You better start believing in dependencies if you’re a programmer.

eska

6 replies

21h14m

2024-03-08 21:24:11 UTC

Introducing another piece of software instead of using one you already use anyway introduces new failures. That’s hardly aesthetics.

As a professional I’m allergic to statements like “you better start believing in X”. How can you even have objective discourse at work like that?

doctorpangloss

5 replies

20h51m

2024-03-08 21:47:06 UTC

Introducing another piece of software instead of using one you already use anyway introduces new failures.

Okay, but we're talking about this on a post about using another piece of software.

What is the rational for, well this additional dependency, Hatchet, that's okay, and its inevitable failures are okay, but this other dependency, RabbitMQ, which does something different, but will have fewer failures for some objective reasons, that's not okay?

Hatchet is very much about aesthetics. What else does Hatchet have going on? It doesn't have a lot of history, it's going to have a lot of bugs. It works as a DSL written in Python annotations, which is very much an aesthetic choice, very much something I see a bunch of AI startups doing, which I personally think is kind of dumb. Like OpenAI tools are "just" JSON schemas, they don't reinvent everything, and yet Trigger, Hatchet, Runloop, etc., they're all doing DSLs. It hews to a specific promotional playbook that is also very aesthetic. Is this not the "objective discourse at work" you are looking for?

I am not saying it is bad, I am saying that 99% of people adopting it will be doing so for essentially aesthetic reasons - and being less knowledgable about alternatives might describe 50-80% of the audience, but to me, being less knowledgeable as a "professional" is an aesthetic choice. There's nothing wrong with this.

You can get into the weeds about what you meant by whatever you said. I am aware. But I am really saying, I'm dubious of anyone promoting "Use my new thing X which is good because it doesn't introduce a new dependency." It's an oxymoron plainly on its face. It's not in their marketing copy but the author is talking about it here, and maybe the author isn't completely sincere, maybe the author doesn't care and will happily write everything on top of RabbitMQ if someone were willing to pay for it, because that decision doesn't really matter. The author is just being reactive to people's aesthetics, that programmers on social media "like" Postgres more than RabbitMQ, for reasons, and that means you can "only" use one, but that none of those reasons are particularly well informed by experience or whatever, yet nonetheless strongly held.

When you want to explain something that doesn't make objective sense when read literally, okay, it might have an aesthetic explanation that makes more sense.

necovek

2 replies

12h41m

2024-03-09 05:57:12 UTC

There is some implicit context you are missing here.

Tools like hatchet are one less dependency for projects already using Postgres: Postgres has become a de-facto database to build against.

Compare that to an application built on top of Postgres and using Celery + Redis/RabbitMQ.

Also, it seems like you are confusing aesthetic with ergonomics. Since forever, software developers have tried to improve on all of "aesthetics" (code/system structure appearance), "ergonomics" (how easy/fast is it to build with) and "performance" (how well it works), and the cycle has been continuous (we introduce extra abstractions, then do away with some when it gets overly complex, and on and on).

danielovichdk

1 replies

10h14m

2024-03-09 08:23:43 UTC

"Since forever, software developers have tried to improve on all of "aesthetics" (code/system structure appearance), "ergonomics" (how easy/fast is it to build with) and "performance" (how well it works), and the cycle has been continuous"

Fast,easy,well,cheap is not a quality measure but it sure is a way to build more useless abstractions. You tell me which abstractions has made your software twice as effective.

hosh

0 replies

4h44m

2024-03-09 13:54:02 UTC

Efficacy has more to do with the specific situation than the tools you use. Rather, it is versatility of a tool that allows someone to take advantage of the situation.

What makes abstractions more versatile has more to do with its composability and expressiveness of those compositions.

An abstraction that attempts to (apparently) reduce complexity without also being composable, is overall less versatile. Usually, something that does one thing well, is designed to also be as simple as possible. Otherwise you are increasing the overall complexity (and reducing reliability or making it fragile instead of anti-fragile) for very little gain.

eska

0 replies

19h20m

2024-03-08 23:17:15 UTC

You can get into the weeds about what you meant by whatever you said. I am aware.

When you want to explain something that doesn't make objective sense when read literally, okay, it might have an aesthetic explanation that makes more sense.

What an attitude and way to kill a discussion. Again, hard for me to imagine that you're able to have objective discussions at work. As you wish I won't engage in discourse with you so you can feel smart.

danielovichdk

0 replies

20h5m

2024-03-08 22:32:47 UTC

I fully agree with you.

'But I am really saying, I'm dubious of anyone promoting "Use my new thing X which is good because it doesn't introduce a new dependency."'

"Advances in software technology and increasing economic pressure have begun to break down many of the barriers to improved software productivity. The ${PRODUCT} is designed to remove the remaining barriers […]"

It reads like the above quote from the pitch of r1000 in 1985. https://datamuseum.dk/bits/30003882

otabdeveloper4

0 replies

4h21m

2024-03-09 14:16:57 UTC

You better start believing in dependencies if you’re a programmer.

Yeah, faith will be your last resort when the resulting tower of babel fails in hitherto unknown to man modes.

blandflakes

0 replies

20h32m

2024-03-08 22:06:00 UTC

And you better start critically assessing dependencies if you're a programmer. They aren't free; this is a wild take.

ako

5 replies

20h35m

2024-03-08 22:03:01 UTC

Funny how this is vision now. I started my career 29 years ago at a company that build exactly this, but based on oracle. The agents would run on Solaris, aix, vax vms, hpux, windows nt, iris, etc. Was also used to create an automated cicd pipeline to build all binaries on all these different systems.

throwawaymaths

3 replies

10h15m

2024-03-09 08:22:35 UTC

Also basically has existed as an open source (pro version has web dashboard and complex task zoo) drop-in library (no sidecar dependencies outside of postgres) in Elixir for years called Oban.

cpursley

2 replies

7h35m

2024-03-09 11:02:19 UTC

Yep, it feels like half the show hn launches is for infrastructure tooling that already exist natively or as plug and play libraries for Elixir/Erlang.

I really try to suggest people skip Node and learn a proper backend language with a solid framework with a proven architecture.

zepolen

1 replies

5h32m

2024-03-09 13:05:16 UTC

Oban looks great, how would one run a python cuda based workload on it?

hosh

0 replies

4h53m

2024-03-09 13:44:49 UTC

You could shell out to execute with porcelain, make the python a long-running process and use ports, or port your python code to NX.

sixdimensional

0 replies

3h56m

2024-03-09 14:41:29 UTC

Because people don’t know what they don’t know, and, learning from others (along with human knowledge sharing and transfer) doesn’t seem to be what society often prioritizes in general.

Not so much talking about the original post, I think it’s awesome what they are building, and clearly they have learned by observing other things.

abelanger

4 replies

2024-03-08 18:23:12 UTC

Thank you, appreciate the kind words! What boxes are you looking to check?

Yes, I'm not a fan of the RabbitMQ dependency either - see here for the reasoning: https://news.ycombinator.com/item?id=39643940.

It would take some work to replace this with listen/notify in Postgres, less work to replace this with an in-memory component, but we can't provide the same guarantees in that case.

jaggederest

2 replies

18h56m

2024-03-08 23:41:37 UTC

I come to this only as an interested observer, but my experience with listen/notify is that it outperforms rabbitmq/kafka in small to medium operations and has always pleasantly surprised me. You might find out it's a little easier than you think to slim your dependency stack down.

hosh

1 replies

4h52m

2024-03-09 13:45:57 UTC

How do you handle things when no listeners are available to be notified?

abelanger

0 replies

14m

2024-03-09 18:23:42 UTC

Presumably there'd be a messages table that you listen/notify on, and you'd replay messages that weren't consumed when a listener rejoins. But yeah, this is the overhead I was referencing.

kcorbitt

0 replies

18h50m

2024-03-08 23:47:48 UTC

Boxes-wise, I'd like a management interface at least as good as the one Sidekiq had in Rails for years. Would also need some hard numbers around performance and probably a bit more battle-testing before using this in our current product.

bevekspldnw

2 replies

20h20m

2024-03-08 22:17:39 UTC

You can do a fair amount of this with Postgres using locks out of the box. It’s not super intuitive but I’ve been using just Postgres and locks in production for many years for large task distribution across independent nodes.

renegade-otter

1 replies

16h3m

2024-03-09 02:34:15 UTC

I wrote about one simple implementation:

https://renegadeotter.com/2023/11/30/job-queues-with-postrgr...

bevekspldnw

0 replies

14h9m

2024-03-09 04:29:07 UTC

Looks very similar to my solution. :-)

BenjieGillam

2 replies

22h45m

2024-03-08 19:52:19 UTC

Not sure if you saw it but Graphile Worker supports jobs written in arbitrary languages so long as your OS can execute them: https://worker.graphile.org/docs/tasks#loading-executable-fi...

Would be interested to know what features you feel it’s lacking.

kcorbitt

1 replies

18h49m

2024-03-08 23:49:02 UTC

That's interesting! Would that still involve each worker node needing to have Nodejs installed to run the process that actually reads from the queue? That's doable, but makes the deployment story a little more annoying/complicated if I want a worker that just runs Python or Rust or something.

Feature-wise, the biggest missing pieces from Graphile Worker for me are (1) a robust management web ui and (2) really strong documentation.

BenjieGillam

0 replies

18h8m

2024-03-09 00:29:13 UTC

Yes, currently Node is the runtime but we could bundle that up into a binary blob if that would help; one thing to download rather than installing Node and all its dependencies?

A UI is a common request, something I’ve been considering investing effort into. I don’t think we’ll ever have one in the core package, but probably as a separate package/plugin (even a third party one); we’ve been thinking more about the events and APIs such a system would need and making these available, and adding a plugin system to enable tighter integration.

Could you expand on what’s missing in the documentation? That’s been a focus recently (as you may have noticed with the new expanded docusaurus site linked previously rather than just a README), but documentation can always be improved.

simplyinfinity

0 replies

2h28m

2024-03-09 16:09:48 UTC

Hope im not misunderstanding, but have you checked gearman? While I haven't used it personally, ive used similar thing but in c#, namely hangfire.

rubenfiszel

0 replies

6h23m

2024-03-09 12:14:46 UTC

Windmill is is built exactly like that, what box is left unchecked for it if you had time to review it?

magic_hamster

0 replies

2h4m

2024-03-09 16:33:40 UTC

For what it's worth, RabbitMQ is extremely low maintenance, fire and forget. In the multiple years we've used it in production I can't remember a single time we had an issue with rabbit or that we needed to do anything after the initial set up.

leetrout

9 replies

23h7m

2024-03-08 19:30:45 UTC

Just pointing out even though this is a "Show HN" they are, indeed, backed by YC.

Is this going to follow the "open core" pattern or will there be a different path to revenue?

MuffinFlavored

5 replies

23h3m

2024-03-08 19:34:52 UTC

path to revenue

There have to be at least 10 different ways between different cloud providers to run a distributed task queue. Amazon, Azure, GCP

Self-hosting RabbitMQ, etc.

I'm curious how they are able to convince investors that there is a sizable portion of market they think doesn't already have this solved (or already has it solved and is willing to migrate)

leetrout

2 replies

21h44m

2024-03-08 20:54:06 UTC

I am curious to see where they differentiate themselves on observability on the longer run.

Comparing to rabbitmq it should be easier to see what is in the queue itself without mutating it, for instance.

MuffinFlavored

1 replies

17h28m

2024-03-09 01:10:04 UTC

https://www.rabbitmq.com/docs/management

leetrout

0 replies

16h57m

2024-03-09 01:40:30 UTC

Sure, but to see what is in the queue you have to operate on it, mutating it. With this using postgres we can just look in the table.

Kinrany

0 replies

22h57m

2024-03-08 19:40:42 UTC

There will be space for improvement until every cloud has a managed offering with exactly the same interface. Like docker, postgres, S3.

Aeolun

0 replies

14h2m

2024-03-09 04:36:03 UTC

I'm curious how they are able to convince investors that there is a sizable portion of market they think doesn't already have this solved

Is there any task queue you are completely happy with?

I use Redis, but it’s only half of the solution.

wodenokoto

1 replies

10h5m

2024-03-09 08:33:09 UTC

Wasn’t the first Dropbox introduction also a show HN?

I don’t think this is out of place

leetrout

0 replies

5h1m

2024-03-09 13:36:40 UTC

I am not saying it is out of place but I feel for such a long winded explanation of what they are doing a missing "YC W24" was surprising.

abelanger

0 replies

21h1m

2024-03-08 21:36:34 UTC

Yep, we're backed by YC in the W24 batch - this is evident on our landing page [1].

We're both second time CTOs and we've been on both sides of this, as consumers of and creators of OSS. I was previously a co-founder and CTO of Porter [2], which had an open-core model. There are two risks that most companies think about in the open core model:

1. Big companies using your platform without contributing back in some way or buying a license. I think this is less of a risk, because these organizations are incentivized to buy a support license to help with maintenance, upgrades, and since we sit on a critical path, with uptime.

2. Hyperscalers folding your product in to their offering [3]. This is a bigger risk but is also a bit of a "champagne problem".

Note that smaller companies/individual developers are who we'd like to enable, not crowd out. If people would like to use our cloud offering because it reduces the headache for them, they should do so. If they just want to run our service and manage their own PostgreSQL, they should have the option to do that too.

Based on all of this, here's where we land on things:

1. Everything we've built so far has been 100% MIT licensed. We'd like to keep it that way and make money off of Hatchet Cloud. We'll likely roll out a separate enterprise support agreement for self hosting.

2. Our cloud version isn't going to run a different core engine or API server than our open source version. We'll write interfaces for all plugins to our servers and engines, so even if we have something super specific to how we've chosen to do things on the cloud version, we'll expose the options to write your own plugins on the engine and server.

3. We'd like to make self-hosting as easy to use as our cloud version. We don't want our self-hosted offering to be a second-class citizen.

Would love to hear everyone's thoughts on this.

[1] https://hatchet.run

[2] https://github.com/porter-dev/porter

[3] https://www.elastic.co/blog/why-license-change-aws

moribvndvs

8 replies

21h47m

2024-03-08 20:50:52 UTC

One repeat issue I’ve had with my past position is need to schedule an unlimited number of jobs, often months to year from now. Example use case: a patient schedules an appointment for a follow up in 6 months, so I schedule a series of appointment reminders in the days leading up to it. I might have millions of these jobs.

I started out by just entering a record into a database queue and just polling every few seconds. Functional, but our IO costs for polling weren’t ideal, and we wanted to distribute this without using stuff like schedlock. I switched to Redis but it got complicated dealing with multiple dispatchers, OOM issues, and having to run a secondary job to move individual tasks in and out of the immediate queue, etc. I had started looking at switching to backing it with PG and SKIP LOCKED, etc. but I’ve changed positions.

I can see a similar use case on my horizon wondered if Hatchet would be suitable for it.

herval

3 replies

21h41m

2024-03-08 20:56:21 UTC

why do you need to schedule things 6 months in advance, instead of, say, check everything that needs notifications in a rolling window (eg 24h ahead) and schedule those?

moribvndvs

2 replies

21h9m

2024-03-08 21:28:44 UTC

Well, it was a dumbed down example. In that particular case, appointments can be added, removed, or moved at any moment, so I can’t just run one job every 24 hours to tee up the next day’s work and leave it at that. Simply polling the database for messages that are due to go out gives me my just-in-time queue, but then I need to build out the work to distribute it, and we didn’t like the IO costs.

I did end up moving it Redis and basically ZADD an execution timestamp and job ID, then ZRANGEBYSCORE at my desired interval and remove those jobs as I successfully distribute them out to workers. I then set a fence time. At that time a job runs to move stuff that should have ran but didn’t (rare, thankfully) into a remediation queue, and load the next block of items that should run between now + fence. At the service level, any items with a scheduled date within the fence gets ZADDed after being inserted into the normal database. Anything outside the fence will be picked up at the appropriate time.

This worked. I was able to ramp up the polling time to get near-real time dispatch while also noticeably reducing costs. Problems were some occasional Redis issues (OOM and having to either a keep bumping up the Redis instance size or reduce the fence duration), allowing multiple pollers for redundancy and scale (I used schelock for that :/), and occasionally a bug where the poller craps out in the middle of the Redis work resulting in at least once SLA which required downstream protections to make sure I don’t send the same message multiple time to the patient.

Again, it all works but I’m interested in seeing if there are solutions that I don’t have to hand roll.

tonyhb

0 replies

17h52m

2024-03-09 00:46:06 UTC

I built https://www.inngest.com specifically because of healthcare flows. You should check it out, with the obvious disclaimer that I'm biased. Here's what you need:

1. Functions which allow you to declaratively sleep until a specific time, automatically rescheduling jobs (https://www.inngest.com/docs/reference/functions/step-sleep-...).

2. Declarative cancellation, which allows you to cancel jobs if the user reschedules their appointment automatically (https://www.inngest.com/docs/guides/cancel-running-functions).

3. General reliability and API access.

Inngest does that for you, but again — disclaimer, I made it and am biased.

herval

0 replies

18h41m

2024-03-08 23:56:47 UTC

Couldn’t u just enqueue + change a status, then check before firing? I don’t see why you’d need more than a dumb queue and a db table for that, unless you’re doing millions of qps

kbar13

1 replies

21h30m

2024-03-08 21:08:08 UTC

can you explain why this cannot be a simple daily cronjob to query for appointments upcoming next <time window> and send out notifications at that time? polling every few seconds seems way overkill

moribvndvs

0 replies

21h1m

2024-03-08 21:36:58 UTC

Sure: https://news.ycombinator.com/item?id=39646719

abelanger

1 replies

20h24m

2024-03-08 22:13:29 UTC

It wouldn't be suitable for that at the moment, but might be after some refactors coming this weekend. I wrote a very quick scheduling API which pushes schedules as workflow triggers, but it's only supported on the Go SDK. It also is CPU-intensive at thousands of schedules, as the schedules are run as separate goroutines (on a dedicated `ticker` service) - I'm not proud of this. This was a pattern that made sense for the cron schedule and I just adapted it for the one-time scheduling.

Looking ahead (and back) in the database and placing an exclusive lock on the schedule is the way to do this. You basically guarantee scheduling at +/- the polling interval if your service goes down while maintaining the lock. This allows you to horizontally scale the `tickers` which are polling for the schedules.

moribvndvs

0 replies

19h41m

2024-03-08 22:57:05 UTC

Thanks for the follow-up! I’ll keep an eye on the progress.

jerrygenser

8 replies

23h47m

2024-03-08 18:50:44 UTC

Something I really like about some pub/sub systems is Push subscriptions. For example in GCP pub/sub you can have a "subscriber" that is not pulling events off the queue but instead is an http endpoint where events are pushed to.

The nice thing about this is that you can use a runtime like cloud run or lambda and allow that runtime to scale based on http requests and also scale to zero.

Setting up autoscaling for workers can be a little bit more finicky, e.g. in kubernetes you might set up KEDA autoscaling based on some queue depth metrics but these might need to be exported from rabbit.

I suppose you could have a setup where your daemon worker is making http requests and in that sense "push" to the place where jobs are actually running but this adds another level of complexity.

Is there any plan to support a push model where you can push jobs into http and some daemons that are holding the http connections opened?

tonyhb

1 replies

19h0m

2024-03-08 23:38:11 UTC

You might want to look at https://www.inngest.com for that. Disclaimer: I'm a cofounder. We released event-driven step functions about 20 months ago.

jerrygenser

0 replies

7h34m

2024-03-09 11:03:13 UTC

Looks cool but looks like it's only typescript. If there is a json payload, couldn't any web server handle it?

abelanger

1 replies

23h13m

2024-03-08 19:25:06 UTC

I like that idea, basically the first HTTP request ensures the worker gets spun up on a lambda, and the task gets picked up on the next poll when the worker is running. We already have the underlying push model for our streaming feature: https://docs.hatchet.run/home/features/streaming. Can configure this to post to an HTTP endpoint pretty easily.

The daemon feels fragile to me, why not just shut down the worker client-side after some period of inactivity?

jerrygenser

0 replies

22h47m

2024-03-08 19:50:40 UTC

I think it depends on the http runtime. One of the things with cloud run is that if the server is not handling requests, it doesn't get CPU time. So even if the first request is "wake up", it wouldn't get any CPU to poll outside of the request-response cycle.

You can configure cloud run to always allocate CPU but it's a lot more expensive. I don't think it would be a good autoscaling story since autoscaling is based on http requests being processed. (maybe can be done via CPU but that's may not be what you want, it may not even be cpu bound)

sixdimensional

0 replies

3h49m

2024-03-09 14:48:18 UTC

There are some tools like Apache Nifi which call this pattern an HTTP listener. it’s also basically a kind of a sink, and also sort of resembles webhook architecture.

lysecret

0 replies

7h29m

2024-03-09 11:08:13 UTC

Yep we are using cloud tasks and pub sub a lot. Another big benefit is that the GCP infra is literally “pushing” your messages even if your infra goes down.

jsmeaton

0 replies

11h15m

2024-03-09 07:22:31 UTC

https://cloud.google.com/tasks is such a good model and I really want an open source version of it (or to finally bite the bullet and write my own).

Having http targets means you get things like rate limiting, middleware, and observability that your regular application uses, and you aren’t tied to whatever backend the task system supports.

Set up a separate scaling group and away you go.

alexbouchard

0 replies

20h24m

2024-03-08 22:13:49 UTC

The push queue model has major benefits has you mentioned. We've built Hookdeck (hookdeck.com) on that premise. I hope we see more projects adopt it.

topicseed

7 replies

1d1h

2024-03-08 17:26:29 UTC

What specific strategies does Hatchet employ to guarantee fault tolerance and enable durable execution? How does it handle partial failures in multi-step workflows?

abelanger

6 replies

2024-03-08 18:05:13 UTC

Each task in Hatchet is backed by a workflow [1]. Workflows are predefined steps which are persisted in PostgreSQL. If a worker dies or crashes midway through (stops heartbeating to the engine), we reassign tasks (assuming they have retries left). We also track timeouts in the database, which means if we miss a timeout, we simply retry after some amount of time. Like I mentioned in the post, we avoid some classes of faults just by relying on PostgreSQL and persisting each workflow run, so you don't need to time out with distributed locks in Redis, for example, or worry about data loss if Redis OOMs. Our `ticker` service is basically its own worker which is assigned a lease for each step run.

We also store the input/output of each workflow step in the database. So resuming a multi-step workflow is pretty simple - we just replay the step with the same input.

To zoom out a bit - unlike many alternatives [2], the execution path of a multi-step workflow in Hatchet is declared ahead of time. There are tradeoffs to this approach; it makes it much easier to run a single-step workflow or if you know the workflow execution path ahead of time. You also avoid classes of problems related to workflow versioning, we can gracefully drain older workflow version with a different execution path. It's also more natural to debug and see a DAG execution instead of debugging procedural logic.

The clear tradeoff is that you can't try...catch the execution of a single task or concatenate a bunch of futures that you wait for later. Roadmap-wise, we're considering adding procedural execution on top of our workflows concept. Which means providing a nice API for calling `await workflow.run` and capturing errors. These would be a higher-level concept in Hatchet and are not built yet.

There are some interesting concepts around using semaphores and durable leases that are relevant here, which we're exploring [3].

[1] https://docs.hatchet.run/home/basics/workflows [2] https://temporal.io [3] https://www.citusdata.com/blog/2016/08/12/state-machines-to-...

spenczar5

2 replies

23h48m

2024-03-08 18:49:27 UTC

What happens if a worker goes silent for longer than the heartbeat duration, then a new worker is spawned, then the original worker “comes back to life”? For example, because there was a network partition, or because the first worker’s host machine was sleeping, or even just that the first worker process was CPU starved?

abelanger

1 replies

20h15m

2024-03-08 22:22:42 UTC

The heartbeat duration (5s) is not the same as the inactive duration (60s). If a worker has been down for 60 seconds, we reassign to provide some buffer and handle unstable networks. Once someone asks we'll expose these options and make them configurable.

We currently send cancellation signals for individual tasks to workers, but our cancellation signals aren't replayed if they fail on the network. This is an important edge case for us to figure out.

There's not much we can do if the worker ignores that signal. We should probably add some alerting if we see multiple responses on the same task, because that means the worker is ignoring the cancellation signal. This would also be a problem if workloads start blocking the whole thread.

spenczar5

0 replies

19h1m

2024-03-08 23:36:49 UTC

Right, I meant inactive duration, of course.

Cancellation signals are tricky. You of course cannot be sure that the remote end receives it. This turns into the two generals problem.

Yes, you need monitoring for this case. I work on scientific workloads which can completely consume CPU resources. This failure scenario is quite real.

Not all tasks are idempotent, but it sounds like a prudent user should try to design things that way, since your system has “at least once” execution of tasks, as opposed to “at most once.” Despite any marketing claims, “exactly once” is not generally possible.

Good docs on this point are important, as is configurability for cases when “at most once” is preferable.

sigmarule

1 replies

22h41m

2024-03-08 19:56:24 UTC

I think the answer is no but just to be sure: are you able to trigger step executions programmatically from within a step, even if you can't await their results?

Related, but separately: can you trigger a variable number of task executions from one step? If the answer to the previous question is yes then it would of course be trivial; if not, I'm wondering if you could i.e. have a task act as a generator and yield values, or just return a list, and have each individual item get passed off to its own execution of the next task(s) in the DAG.

For example some of the examples involve a load_docs step, but all loaded docs seem to be passed to the next step execution in the DAG together, unless I'm just misunderstanding something. How could we tweak such an example to have a separate task execution per document loaded? The benefits of durable execution and being able to resume an intensive workflow without repeating work is lessened if you can't naturally/easily control the size of the unit of work for task executions.

abelanger

0 replies

21h53m

2024-03-08 20:44:19 UTC

You can execute a new workflow programmatically, for example see [1]. So people have triggered, for example, 50 child workflows from a parent step. As you've identified the difficult part there is the "collect" or "gathering" step, we've had people hack around that by waiting for all the steps from a second workflow (and falling back to the list events method to get status), but this isn't an approach I'd recommend and it's not well documented. And there's no circuit breaker.

I'm wondering if you could i.e. have a task act as a generator and yield values, or just return a list, and have each individual item get passed off to its own execution of the next task(s) in the DAG.

Yeah, we were having a conversation yesterday about this - there's probably a simple decorator we could add so that if a step returns an array, and a child step is dependent on that parent step, it fans out if a `fanout` key is set. If we can avoid unstructured trace diagrams in favor of a nice DAG-style workflow execution we'd prefer to support that.

The other thing we've started on is propagating a single "flow id" to each child workflow so we can provide the same visualization/tracing that we provide in each workflow execution. This is similar to AWS X-rays.

As I mentioned we're working on the durable workflow model, and we'll find a way to make child workflows durable in the same way activities (and child workflows) are durable on Temporal.

[1] https://docs.hatchet.run/sdks/typescript-sdk/api/admin-clien...

topicseed

0 replies

2024-03-08 18:17:15 UTC

Thank you for the thorough response!

fcsp

5 replies

20h11m

2024-03-08 22:26:28 UTC

Hatchet is built on a low-latency queue (25ms average start)

That seems pretty long - am I misunderstanding something? By my understanding this means the time from enqueue to job processing, maybe someone can enlighten me.

mhh__

3 replies

20h5m

2024-03-08 22:33:09 UTC

It's only a few billion instructions on a decent sized server these days

spenczar5

2 replies

18h56m

2024-03-08 23:41:39 UTC

Damn, I want one of these 100GHz CPUs you have, that sounds great.

I think you mean million :)

mhh__

0 replies

9h38m

2024-03-09 09:00:09 UTC

My ipad has 8 cores executing about 4 to 6 billion instructions a second these days (3GHz at a most ipc of about two)

jlokier

0 replies

12h54m

2024-03-09 05:43:46 UTC

You'd be surprised. 1 billion instructions in 25ms is realistic these days.

My laptop can execute about 400 billion CPU instructions per second on battery.

That's about 10 billion instructions in 25ms.

Ihat's the CPU alone, i.e. not including the GPU which would increase the total considerably. Also not counting SIMD lanes as separate: The count is bona fide assembly language instructions.

It comes from cores running at ~4GHz, 8 issued instructions per clock, times 12 cores, plus 4 additional "efficiency" cores adding a bit more. People have confirmed by measurement the 8 instructions per clock is achievable (or close) in well-optimised code. Average code is more like 2-3 per cycle.

Only for short periods as the CPU is likely to get hot and thermally throttle even with its fan. But when it throttles it'll still exceed 1 billion in 25ms.

For perspective on how far silicon has come, the GPU on my laptop is reported to do about 14 trillion floating-point 32-bit calculations per second.

abelanger

0 replies

19h52m

2024-03-08 22:45:56 UTC

To clarify - you're right, this is a long time in a message/event queue.

It's not an eternity in a task queue which supports DAG-style workflows with concurrency limits and fairness strategies. The reason for this is you need to check all of the subscribed workers and assign a task in a transactional way.

The limit on the Postgres level is probably on the order of 5-10ms on a managed PG provider. Have a look at: https://news.ycombinator.com/item?id=39593384.

Also, these are not my benchmarks, but have a look at [1] for Temporal timings.

[1] https://www.windmill.dev/blog/launch-week-1/fastest-workflow...

bluehadoop

5 replies

2024-03-08 18:08:23 UTC

How does this compare against Temporal/Cadence/Conductor? Does hatchet also support durable execution?

https://temporal.io/ https://cadenceworkflow.io/ https://conductor-oss.org/

abelanger

4 replies

23h47m

2024-03-08 18:50:59 UTC

It's very similar - I used Temporal at a previous company to run a couple million workflows per month. The gRPC networking with workers is the most similar component, I especially liked that I only had to worry about an http2 connection with mTLS instead of a different broker protocol.

Temporal is a powerful system but we were getting to the point where it took a full-time engineer to build an observability layer around Temporal. Integrating workflows in an intuitive way with OpenTelemetry and logging was surprisingly non-arbitrary. We wanted to build more of a Vercel-like experience for managing workflows.

We have a section on the docs page for durable execution [1], also see the comment on HN [2]. Like I mention in that comment, we still have a long way to go before users can write a full workflow in code in the same style as a Temporal workflow, users either define the execution path ahead of time or invoke a child workflow from an existing workflow. This is also something that requires customization for each SDK - like Temporal's custom asyncio event loop in their Python SDK [3]. We don't want to roll this out until we can be sure about compatibility with the way most people write their functions.

[1] https://docs.hatchet.run/home/features/durable-execution

[2] https://news.ycombinator.com/item?id=39643881

[3] https://github.com/temporalio/sdk-python

bicijay

2 replies

22h5m

2024-03-08 20:33:03 UTC

Well, you just got an user. Love the concept of temporal, but i can't justify the overhead you need with infra to make it work for the upper guys... And the cloud offering is a bit expensive for small companies.

mfateev

1 replies

19h15m

2024-03-08 23:23:08 UTC

Do you know about the Temporal startup program? It gives enough credits to offset support fees for 2 years. https://temporal.io/startup

Aeolun

0 replies

13h59m

2024-03-09 04:39:02 UTC

If you are expecting to still be small after 2 years that just delays the expense until you are locked in?

dangoodmanUT

0 replies

4h20m

2024-03-09 14:17:41 UTC

we were getting to the point where it took a full-time engineer to build an observability layer around Temporal

We did it in like 5 minutes by adding in otel traces? And maybe another 15 to add their grafana dashboard?

What obstacles did you experience here?

tzahifadida

4 replies

2024-03-08 17:44:24 UTC

Why not use postgres listen/notify instead of rabbitmq pub sub.

anentropic

1 replies

2024-03-08 17:51:00 UTC

It uses Postgres rather than RabbitMQ: https://github.com/hatchet-dev/hatchet?tab=readme-ov-file#ho...

anentropic

0 replies

22h25m

2024-03-08 20:12:45 UTC

I see... apparently it uses both

abelanger

1 replies

2024-03-08 18:09:16 UTC

When I started on this codebase, we needed to implement some custom exchange logic that maps very neatly to fanout exchanges and non-durable queues in RabbitMQ and weren't built out on our PostgreSQL layer yet. This was a bootstrapping problem. Like I mentioned in the comment, we'd like to switch to pub/sub pattern that lets us distribute our engine over multiple geographies. Listen/notify could be the answer once we migrate to PG 16, though there are some concerns around connection poolers like pg_bouncer having limited support for listen/notify. There's a Github discussion on this if you're curious: https://github.com/hatchet-dev/hatchet/discussions/224.

tzahifadida

0 replies

20h57m

2024-03-08 21:40:31 UTC

I use haproxy with go listen notify of one of the libs. It works as long as the connection is up. I.e.i have a timeout of 30 min configured in haproxy. Then you have to assume you lost sync and recheck. That is not that bad every 30min... at least for me. You can configure to never close...

krawczstef

4 replies

18h53m

2024-03-08 23:44:46 UTC

Can you explain why you chose every function to take in context? https://github.com/hatchet-dev/hatchet/blob/main/python-sdk/...

This seems like a lot of boiler plate to write functions with to me (context I created http://github.com/DAGWorks-Inc/hamilton).

abelanger

3 replies

13h51m

2024-03-09 04:46:47 UTC

We did it because there are methods that should be accessed which don't map to `args` cleanly. For example, we let users call `context.log`, `context.done` (to determine whether to return on cancellation) or `context.step_output` (to dynamically access a parent's step output). Perhaps there's a more pythonic way to do this? Admittedly this is a pattern we adapted from Go.

kamikaz1k

1 replies

13h36m

2024-03-09 05:01:53 UTC

Probably just have it attached to self, like self.context

But nbd IMHO

krawczstef

0 replies

12h15m

2024-03-09 06:23:05 UTC

yep nbd, but :/

krawczstef

0 replies

12h16m

2024-03-09 06:21:35 UTC

you could just make them optional arguments that you inject if they're declared. Happy to chat more. With Hamilton we could actually build an alternative way to describe your API pretty easily...

SCUSKU

4 replies

23h36m

2024-03-08 19:01:31 UTC

Looks pretty great! My biggest issue with Celery has been that the observability is pretty bad. Even if you use Celery Flower, it still just doesn’t give me enough insight when I’m trying to debug some problem in production.

I’m all for just using Postgres in service of the grug brain philosophy.

Will definitely be looking into this, congrats on the launch!

9dev

2 replies

21h31m

2024-03-08 21:06:52 UTC

I case you’re stuck with Celery for a while: I was hit with this same problem, and solved it by adding a sidecar HTTP server thread to the Python workers that would expose metrics written by the workers into a multithreaded registry. This has been working amazingly well in production for over two years now, and makes it really straightforward to get custom metrics out of a distributed Celery app.

kamikaz1k

1 replies

13h42m

2024-03-09 04:55:28 UTC

Any chance you could share more specifics about your solution?

9dev

0 replies

9h46m

2024-03-09 08:51:20 UTC

Here you go: https://stackoverflow.com/questions/75652326/celery-spawn-si...

Plus some adjacent discussion on GitHub: https://github.com/prometheus/client_python/issues/902

Hope that helps!

abelanger

0 replies

23h25m

2024-03-08 19:12:33 UTC

Appreciate it, thank you! We've spent quite a bit of time in the Celery Flower console. Admittedly it's been a while, I'm not sure if they've added views for chains/groups/etc - it was just a linear task view when I used it.

A nice thing in Celery Flower is viewing the `args, kwargs`, whereas Hatchet operates on JSON request/response bodies, so some early users have mentioned that it's hard to get visibility into the exact typing/serialization that's happening. Something for us to work on.

toddmorey

3 replies

2024-03-08 18:27:07 UTC

I need task queues where the client (web browser) can listen to the progress of the task through completion.

I love the simplicity & approachability of Deno queues for example, but I’d need to roll my own way to subscribe to task status from the client.

Wondering if perhaps the Postgres underpinnings here would make that possible.

EDIT: seems so! https://docs.hatchet.run/home/features/streaming

abelanger

1 replies

23h58m

2024-03-08 18:39:46 UTC

Yep, exactly - Gabe has also been thinking about providing per-user signed URLs to task executions so clients can subscribe more easily without a long-lived token. So basically, you would start the workflow from your API, and pass back the signed URL to the client, where we would then provide a React hook to get task updates automatically. We need this ourselves once we open our cloud instance up to self-serve, since we want to provision separate queues per user, with a Hatchet workflow of course.

toddmorey

0 replies

23h56m

2024-03-08 18:42:02 UTC

Awesome to hear!

rad_gruchalski

0 replies

2024-03-08 18:32:05 UTC

If you need to listen for the progress only, try server-sent events, maybe?: https://en.wikipedia.org/wiki/Server-sent_events

It's dead simple: an existence of the URI means the topic/channel/whathaveu exists, to access it one needs to know the URI, data streamed but no access to old data, multiple consumers no problem.

Kinrany

3 replies

23h0m

2024-03-08 19:37:21 UTC

With NATS in the stack, what's the advantage over using NATS directly?

abelanger

2 replies

13h31m

2024-03-09 05:06:27 UTC

I'm assuming specifically you mean Nex functions? Otherwise NATS gives you connectivity and a message queue - it doesn't (or didn't) have the concept of task executions or workflows.

With regards to Nex -- it isn't fully stable and only supports Javascript/Webassembly. It's also extremely new, so I'd be curious to see how things stabilize in the coming year.

rapnie

0 replies

8h39m

2024-03-09 09:58:47 UTC

I recently found Nex in the context of Wasmcloud [0] and ability for it to support long-running tasks/workflows. Impression that indeed Nex needs a good time to mature still. There was also a talk [1] about using Temporal here. For Hatchet it may be interesting to check it out (note: I am not affiliated with Wasmcloud, nor currently using it).

[0] https://wasmcloud.com

[1] https://www.temporal.io/replay/videos/zero-downtime-deploys-...

bruth

0 replies

6h46m

2024-03-09 11:51:29 UTC

(Disclaimer: I am a NATS maintainer and work for Synadia)

The parent comment may have been referring to the fact that NATS has support for durable (and replicated) work queue streams, so those could be used directly for queuing tasks and having a set of workers dequeuing concurrently. And this is regardless if you would want to use Nex or not. Nex is indeed fairly new, but the team on is iterating on it quickly and we are dog-fooding it internally to keep stabilizing it.

The other benefits of NATS is the built-in multi-tenancy which would allow for distinct applications/teams/contexts to have an isolated set of streams and messaging. It acts as a secure namespace.

NATS supports clustering within a region or across regions. For example, Synadia hosts a supercluster in many different regions across the globe and across the three major cloud providers. As it applies to distributed work queues, you can place work queue streams in a cluster within a region/provider closest to the users/apps enqueuing the work, and then deploy workers in the same region for optimizing latency of dequeuing and processing.

Could be worth a deeper look on how much you could leverage for this use case.

zwaps

2 replies

22h34m

2024-03-08 20:03:37 UTC

You say this is for generative AI. How do you distribute inference across workers? Can one use just any protocol and how does this work together with the queue and fault tolerance?

Could not find any specifics on generative AI in your docs. Thanks

abelanger

1 replies

21h45m

2024-03-08 20:52:47 UTC

This isn't built specifically for generative AI, but generative AI apps typically have architectural issues that are solved by a good queueing system and worker pool. This is particularly true once you start integrating smaller, self-hosted LLMs or other types of models into your pipeline.

How do you distribute inference across workers?

In Hatchet, "run inference" would be a task. By default, tasks get randomly assigned to workers in a FIFO fashion. But we give you a few options for controlling how tasks get ordered and sent. For example, let's say you'd like to limit users to 1 inference task at a time per session. You could do this by setting a concurrency key "<session-id>" and `maxRuns=1` [1]. This means that for each session key, you only run 1 inference task. The purpose of this would be fairness.

Can one use just any protocol

We handle the communication between the worker and the queue through a gRPC connection. We assume that you're passing JSON-serializable objects through the queue.

[1] https://docs.hatchet.run/home/features/concurrency/round-rob...

zwaps

0 replies

14h51m

2024-03-09 03:46:18 UTC

Got it, so the underlying infrastructure (the inference nodes, if you wish) would be something to be solved outside of Hatched, but it would then allow to schedule inference tasks per user with limits.

welder

2 replies

10h8m

2024-03-09 08:29:36 UTC

Related, I also wrote my own distributed task queue in Python [0] and TypeScript [1] with a Show HN [2]. Time it took was about a week. I like your features, but it was easy to write my own so I'm curious how you're building a money making business around an open source product. Maybe the fact everyone writes their own means there's no best solution now, so you're trying to be that and do paid closed source features for revenue?

[0] https://github.com/wakatime/wakaq

[1] https://github.com/wakatime/wakaq-ts

[2] https://news.ycombinator.com/item?id=32730038

abelanger

1 replies

26m

2024-03-09 18:11:53 UTC

Nice, Waka looks cool! I've talked a bit about the tradeoffs with library-mode pollers, for example here: https://news.ycombinator.com/item?id=39644327. Which isn't to say they don't make sense, but scaling wise I think there can be some drawbacks.

I'm curious how you're building a money making business around an open source product.

We'd like to make money off of our cloud version. See the comment on pricing here - https://news.ycombinator.com/item?id=39653084 - which also links to other comments about pricing, sorry about that.

welder

0 replies

2024-03-09 18:29:34 UTC

Thanks. There's definitely a need for this, hence why I built WakaQ. Most distributed task queues have bugs or lack features and are overly complex. Would have been nice to find one I could have used instead of building my own. To be transparent, had Hatchet been around I probably would have self-hosted unless your cloud pricing gave similar throughput for the price I get on DigitalOcean. I'm unique, as a bootstrapped solo company. Maybe Hatchet can be the right solution for others. Keep the momentum going!

sixhobbits

2 replies

1d1h

2024-03-08 17:34:33 UTC

One of my favourite spaces and presentation in readme is clear and immediately told me what it is and most of the key information that I usually complain is missing.

However I am still missing a section on why this is different than any of the other existing and more mature solutions. What led you to develop this over existing options and what different tradeoffs did you make? Extra points if you can concisely tell me what you do badly that your 'competitors' do well because I don't believe there is a one best solution in this space, it is all tradeoffs

sixhobbits

1 replies

1d1h

2024-03-08 17:36:45 UTC

Sorry I am dumb and commented after clicking on the link. I would just add your hn text to the readme as that is exactly what I was looking for

abelanger

0 replies

2024-03-08 17:49:12 UTC

Done [1]. We'll expand this section over time. There are also definite tradeoffs to our architecture - spoke to someone wanting the equivalent 1.5m PutRecord/s in Kinesis, which we're definitely not ready for because we're persist every event + task execution in Postgres.

[1] https://github.com/hatchet-dev/hatchet/blob/main/README.md#h...

rubenfiszel

2 replies

6h8m

2024-03-09 12:29:23 UTC

Ola, fellow YC founders. Surely you have seen Windmill since you refer to it in the comments below. It looks like Hatchet, being a lot more recent, has currently a subset of what Windmill offers, albeit with a focus solely on the task queue and without the self-hosted enterprise focus. So it looks more like a competitor to Inngest than of Windmill. We released workflows as code last week which was the primary differentiator with other workflow engines and us so far: https://www.windmill.dev/docs/core_concepts/workflows_as_cod...

The license is more permissive than ours MIT vs AGPLv3, and you're using Go vs Rust for us, but other than that the architecture looks extremely similar, also based mostly on Postgres with the same insights than us: it's sufficient. I'm curious where do you see the main differentiator long-term.

HoyaSaxa

1 replies

4h12m

2024-03-09 14:26:05 UTC

No connection to either company, but for what it’s worth I’d never in a million years consider Windmill and this product to be direct competitors.

We’ve had a lot of pain with celery and Redis over the years and Hatchet seems to be a pretty compelling alternative. I’d want to see the codebase stabilize a bit before seriously considering it though. And frankly I don’t see a viable path to real commercialization for them so I’d only consider it if everything you needed really was MIT licensed.

Windmill is super interesting but I view it as the next evolution of something like Zapier. Having a large corpus of templates and integrations is the power of that type of product. I understand that under the hood it is a similar paradigm, but the market positioning is rightfully night and day. And I also do see a path to real commercialization of the Windmill product because of the above.

rubenfiszel

0 replies

3h52m

2024-03-09 14:45:14 UTC

Windmill is used by large enterprises to run critical jobs that require a predefined amount of resources and can run for months if needed, stream their logs, written in code at scale with upmost reliability, throughput and lowest overhead. The only insight from Zapier is how easy it is to develop new workflows.

I understand our positioning is not clear on our landing (and we are working on it), but my read of hatched is that what they put forward is mostly a durable execution engine for arbitrary code in python/typescript on a fleet of managed workers, which is exactly what Windmill is. We are profitable and probably wouldn't if we were MIT licensed with no enterprise features.

From reading their documentation, the implementation is extremely similar, you define workflows as code ahead of time, and then the engine make sure to have them progress reliably on your fleet of workers (one of our customer has 600 workers deployed on edge environments). There are a few minor differences, we implement the workers as generic rust binary that pull the workflows, so you never have to redeploy them to test and deploy new workflows, whereas they have developed SDK for each languages to allow you to define your own deployable workers (which is more similar to Inngest/Temporal). Also we use polling and REST instead of gRPC for communications between workers and servers.

hinkley

2 replies

22h10m

2024-03-08 20:27:31 UTC

It’s been about a dozen years since I heard someone assert that some CI/CD services were the most reliable task scheduling software for periodic tasks (far better than cron). Shouldn’t the scheduling be factored out as a separate library?

I found that shocking at the time, if plausible, and wondered why nobody pulled on that thread. I suppose like me they had bigger fish to fry.

lelanthran

0 replies

11h49m

2024-03-09 06:48:18 UTC

Honestly, I'm doing something like that right now, just not in a position to show.

All I want is a simple way to specify a tree of jobs to run to do things like checkout a git branch, build it, run the tests, then install the artifacts.

Or push a new static website to some site. Or periodically do something.

My grug brain simply doesn't want to deal with modern way of doing $SHIT. I don't need to manage a million different tasks per hour, so scaling vertically is acceptable to me, and the benefits of scaling horizontally simply don't appear in my use cases.

abelanger

0 replies

20h2m

2024-03-08 22:35:56 UTC

This reminds me of: https://news.ycombinator.com/item?id=28234057

If you're saying that the scheduling in Hatchet should be a separate library, we rely on go-cron [1] to run cron schedules.

[1] https://github.com/go-co-op/gocron

Yanael

2 replies

13h52m

2024-03-09 04:45:50 UTC

Looks very promising. Recently, I built an asynchronous DAG executor in Python, and I always felt I was reinventing the wheel, but when looking for a resilient and distributed DAG executor, nothing was really meeting the requirements. The feature set is appealing. Wondering if adding/removing/skipping nodes to the DAG dynamically at runtime is possible.

leafmeal

1 replies

12h18m

2024-03-09 06:19:23 UTC

a little late now, but I wonder if https://github.com/DataBiosphere/toil might meet your requirements

Yanael

0 replies

1h29m

2024-03-09 17:08:52 UTC

it's somehitng interesting I will have a closer look thanks

Fiahil

2 replies

23h1m

2024-03-08 19:36:21 UTC

How does this compare to ZeroMQ (ZMQ) ?

https://zeromq.org/

vector_spaces

0 replies

22h58m

2024-03-08 19:39:57 UTC

Not the OP or familiar with Hatchet, but generally ZeroMQ is a bit lower down in the stack -- it's something you'd build a distributed task queue or protocol on top of, but not something you'd usually reach for if you needed one for a web service or similar unless you had very special requirements and a specific, careful design in mind.

This tool comes with more bells and whistles and presumably will be more constrained in what you can do with it, where ZeroMQ gives you the flexibility to build your own protocol. In principle they have many of the same use cases, like how you can buy ready made whipped cream or whip up your own with some heavy cream and sugar -- one approach is more constrained but works for most situations where you need some whipped cream, and the other is a lot more work and somewhat higher risk (you can over whip your cream and end up with butter), but you can do a lot more with it.

jeremyjh

0 replies

22h41m

2024-03-08 19:56:46 UTC

ZeroMQ is a library that implements an application layer network protocol. Hatchet is a distributed job server with durability and transaction semantics. Two completely different things at very different levels of the stack. ZeroMQ supports fan-out messaging and other messaging patterns that could maybe be used as part of a job server, but it doesn't have anything to say about durability, retries, or other concerns that job servers take care of, much less a user interface.

wodenokoto

1 replies

10h0m

2024-03-09 08:37:13 UTC

Since these are task executions in a DAG, to what degree does it compete with dagster or airflow? I get that I can’t define the task with Hatchet, but if I already want to separate my DAG from my tasks, is this a viable option?

abelanger

0 replies

17m

2024-03-09 18:20:31 UTC

It can be used as an alternative to dagster or airflow but doesn't have the prebuilt connectors that airflow offers. And yes, there are ways to reuse tasks across workflows, but the docs for that aren't quite there yet. The key is to call a `registerAction` method and create the workflow programmatically - but we have some work to do before we publicize this pattern (for one, removing the overloading of the term action, function, step and task).

We'll be posting updates and announcements in the Discord - and the Github in our releases - I'd expect that we document this pattern pretty soon.

treesciencebot

1 replies

22h4m

2024-03-08 20:33:48 UTC

Latency is really important and that is honestly why we re-wrote most of this stuck ourselves but the project with the gurantee of 25ms< looks interesting. I wish there was an "instant" mode where enough workers are available it could just do direct placement.

abelanger

0 replies

20h6m

2024-03-08 22:31:52 UTC

To be clear, the 25ms isn't a guarantee. We have a load testing CLI [1] and the secondary steps on multi-step workflows are in the range of 25ms, while the first steps are in the range of 50ms, so that's what I'm referencing.

There's still a lot of work to do for optimization though, particularly to improve the polling interval if there aren't workers available to run the task. Some people might expect to set a max concurrency limit of 1 on each worker and have each subsequent workflow take 50ms to start, which isn't be the case at the moment.

[1] https://github.com/hatchet-dev/hatchet/tree/main/examples/lo...

rheckart

1 replies

21h0m

2024-03-08 21:38:03 UTC

Any plans for SDKs outside the current three? .NET Core & Java would be interesting to see..

abelanger

0 replies

20h13m

2024-03-08 22:24:36 UTC

Not at the moment - the biggest ask has been Rails, but on the other hand Sidekiq is so beloved that I'm not sure it makes sense at the moment. We have our hands very full with the 3 SDKs, though I'd love for us to support a community-backed SDK. If anyone's interested in working on that, feel free to message us in the Discord.

pyrossh

1 replies

2024-03-08 17:59:31 UTC

How is this different from pg-boss[1]? Other than the distributed part it also seems to use skip locked.

[1] https://github.com/timgit/pg-boss

abelanger

0 replies

2024-03-08 18:34:57 UTC

I haven't used pg-boss, and feature-wise it looks very similar and is an impressive project.

The core difference is that pg-boss is a library while Hatchet is a separate service which runs independently of your workers. This service also provides a UI and API for interacting with Hatchet - I don't think pg-boss has those things, so you'd probably have to build out observability yourself.

This doesn't make a huge difference when you're at 1 worker, but having each worker poll your database can lead to DB issues if you're not careful - I've seen some pretty low-throughput setups for very long-running jobs using a database with 60 CPUs because of polling workers. Hatchet distributes in two layers - the "engine" and the "worker" layer. Each engine polls the database and fans out to the workers over a long-lived gRPC connection. This reduces pressure on the DB and lets us manage which workers to assign tasks to based on things like max concurrent runs on each worker or worker health.

nextworddev

1 replies

1d1h

2024-03-08 17:31:56 UTC

I’m interested in self hosting this. What’s the recommendation here for state persistence and self healing? Wish there was a guide for a small team who wants to self host before trying managed cloud

abelanger

0 replies

2024-03-08 18:16:51 UTC

I think we might have had a dead link in the README to our self-hosting guide, here it is: https://docs.hatchet.run/self-hosting.

The component which needs the highest uptime is our ingestion service [1]. This ingests events from the Hatchet SDKs and is responsible for writing the workflow execution path, and then sends messages downstream to our other engine components. This is a horizontally scalable service and you should run at least 2 replicas across different AZs. Also see how to configure different services for engine components [2].

The other piece of this is PostgreSQL, use your favorite managed provider which has point-in-time restores and backups. This is the core of our self-healing, I'm not sure where it makes sense to route writes if the primary goes down.

Let me know what you need for self-hosted docs, happy to write them up for you.

[1] https://github.com/hatchet-dev/hatchet/tree/main/internal/se... [2] https://docs.hatchet.run/self-hosting/configuration-options#...

mfrye0

1 replies

20h44m

2024-03-08 21:54:12 UTC

I've been looking for this exact thing for awhile now. I'm just starting to dig into the docs and examples, and I have a question on workflows.

I have an existing pipeline that runs tasks across two K8 clusters and share a DB. Is it possible to define steps in a workflow where the step run logic is setup to run elsewhere? Essentially not having an inline run function defined, and another worker process listening for that step name.

abelanger

0 replies

13h47m

2024-03-09 04:50:19 UTC

This depends on the SDK - both Typescript and Golang support a `registerAction` method on the worker which basically let you register a single step to only run on that worker. You would then call `putWorkflow` programmatically before starting the worker. Steps are distributed by default so they run on the workers which have registered them. Happy to provide a more concrete example for the language you're using.

kevinlu1248

1 replies

2024-03-08 17:49:53 UTC

We're building a webhook services on FastAPI + Celery + Redis + Grafana + Loki and the experience with setting up every service incrementally was miserable, and even then it feels like logs are being dropped and we run into reliability issues. Felt like something like this should exist already but I couldn't find anything at the time. Really excited to see where this takes us!

tasn

0 replies

2024-03-08 17:56:05 UTC

That's exactly why we built Svix[1]. Building webhooks services, even with amazing tools like FastAPI, Celery and Redis is still a big pain. So we just built a product to solve it.

Hatchet looks cool nonetheless. Queues are a pain for many other use-cases too.

1: https://www.svix.com

fuddle

1 replies

21h9m

2024-03-08 21:29:10 UTC

Looks great! Do you publish pricing for your cloud offering? For the self hosted option, are there plans to create a Kubernetes operator? With an MIT license do you fear Amazon could create a Amazon Hatchet Service sometime in the future?

abelanger

0 replies

20h31m

2024-03-08 22:07:08 UTC

Thank you!

Do you publish pricing for your cloud offering?

Not yet, we're rolling out the cloud offering slowly to make sure we don't experience any widespread outages. As soon as we're open for self-serve on the cloud side, we'll publish our pricing model.

For the self hosted option, are there plans to create a Kubernetes operator?

Not at the moment, our initial plan was to help folks with a KEDA autoscaling setup based on Hatchet queue metrics, which is something I've done with Sidekiq queue depth. We'll probably wait to build a k8s operator after our existing Helm chart is relatively stable.

With an MIT license do you fear Amazon could create a Amazon Hatchet Service sometime in the future?

Yes. The question is whether that risk is worth the tradeoff of not being MIT-licensed. There are also paths to getting integrated into AWS marketplace we'll explore longer-term. I added some thoughts here: https://news.ycombinator.com/item?id=39646788.

ctoth

1 replies

21h22m

2024-03-08 21:15:39 UTC

My only question is why did you call it Hatchet if it doesn't cut down on your logs?

I'll show myself out.

abelanger

0 replies

13h41m

2024-03-09 04:56:52 UTC

Really have an axe to grind with this comment...

cebert

1 replies

4h38m

2024-03-09 14:00:09 UTC

The website for Hatchet and the GitHub repository make it look like a compelling distributed task queue solution. I see from the main website that this appears to have commercial aspirations, but I don’t see any pricing information available. Do you have a pricing model yet? I’d be apprehensive to consider using Hatchet in future projects without knowing how much it costs.

abelanger

0 replies

1h23m

2024-03-09 17:14:51 UTC

We'd like to make money off Hatchet Cloud, which is in early access - some more on that here [1] and here [2]. Pricing will be transparent once we're open access.

Like I mention in that comment, we'd like to keep our repository 100% MIT licensed. I realize this is unpopular among open source startups - and I'm sure there are good reasons for that. We've considered these reasons and still landed on the MIT license.

[1] https://news.ycombinator.com/item?id=39647101

[2] https://news.ycombinator.com/item?id=39646788

beerkat

1 replies

19h58m

2024-03-08 22:39:32 UTC

How does this compare to River Queue (https://riverqueue.com/)? Besides the additional Python and TS client libraries.

abelanger

0 replies

13h43m

2024-03-09 04:54:22 UTC

The underlying queue is very similar. See this comment, which details how we're different from a library client: https://news.ycombinator.com/item?id=39644327. We also have the concept of workflows, which last I checked doesn't exist in River.

I'm personally very excited about River and I think it fills an important gap in the Go ecosystem! Also now that sqlc w/ pgx seems to be getting more popular, it's very easy to integrate.

Nukesor

1 replies

15h47m

2024-03-09 02:51:09 UTC

Hey @abelanger,

I got a few feature request for Pueue that were out of the scope as they didn't fit Pueue's vision, but seem to fit hatchet quite well (e.g. complex scheduling functionality and multi-agent support) :)

One thing I'm missing from your website however, is an actual view from how the interface looks like, what does the actual user interface look like.

Having the possibility to schedule stuff in a smart way is nice and all, but how do you *overlook* it? It's important to get a good overview of how your tasks perform.

Once I'm convinced that this is actually a useful piece of software, I would like to reference you in the Readme of Pueue as a alternative for users that need more powerful scheduling features (or multi-client support) :) Would that be ok for you?

abelanger

0 replies

14h6m

2024-03-09 04:31:34 UTC

Pueue looks cool, it's not an alternative to Hatchet though - looks like it's meant to be run in the terminal or by a user? We're very much meant to run in an application runtime.

Like I mentioned here [1], we'll expand our comparison section over time. If Pueue's an alternative people are asking about, we'll definitely put it in there.

Having the possibility to schedule stuff in a smart way is nice and all, but how do you overlook it? It's important to get a good overview of how your tasks perform.

I'm not sure what you mean by this. Perhaps you're referring to this - https://news.ycombinator.com/item?id=39647154 - in which case I'd say: most software is far from perfect. Our scheduling works but has limitations and is being refactored before we advertise it and build it into our other SDKs.

[1] https://news.ycombinator.com/item?id=39643631

Kluggy

1 replies

2024-03-08 17:49:18 UTC

In https://docs.hatchet.run/home/quickstart/installation, it says

Welcome to Hatchet! This guide walks you through getting set up on Hatchet Cloud. If you'd like to self-host Hatchet, please see the self-hosted quickstart instead.

but the link to "self-hosted quickstart" links back to the same page

abelanger

0 replies

2024-03-08 18:18:03 UTC

This should be fixed now, here's the direct link: https://docs.hatchet.run/self-hosting.

wereHamster

0 replies

7h12m

2024-03-09 11:25:32 UTC

Does it (or will it, ie. is it planned) support delayed execution? eg. I have a task that I want to run at a certain time in the future?

sroussey

0 replies

22h55m

2024-03-08 19:42:53 UTC

Ah nice! I am writing a job queue this weekend for a DAG based task runner, so timing is great. I will have a look. I don't need anything too big, but I have written some stuff for using PostgreSQL (FOR UPDATE SKIP LOCKED for the win), sqlite, and in-memory, depending on what I want to use it for.

I want the task graph to run without thinking about retries, timeouts, serialized resources, etc.

Interested to look at your particular approach.

serbrech

0 replies

11h29m

2024-03-09 07:08:28 UTC

I wish that this was just a sdk built on top of a provider/standard. Amqp 1.0 is a standard protocol. You can build all this without being tied to a product or to rabbitMQ, with a storage provider and a amqp protocol layer.

radus

0 replies

2024-03-08 18:04:33 UTC

You've explained your value proposition vs. celery, but I'm curious if you also see Hatchet as an alternative to Nextflow/Snakemake which are commonly used in bioinformatics.

peterisdirsa

0 replies

14h41m

2024-03-09 03:56:20 UTC

This is not a viable product, it's a feature

pdksam

0 replies

50m

2024-03-09 17:47:21 UTC

How is this different from cadence by Uber or swf?

notpushkin

0 replies

9h9m

2024-03-09 09:28:16 UTC

Congrats on the launch!

You say Celery can use Redis or RabbitMQ as a backend, but I've also used it with Postgres as a broker successfully, although on a smaller scale (just a single DB node). It's undocumented, so definitely won't recommend anybody using this in production now, but seems to still work fine. [1]

How does Hatchet compare to this setup? Also, have you considered making a plugin backend for Celery, so that old systems can be ported more easily?

[1]: https://stackoverflow.com/a/47604045/1593459

jbergstroem

0 replies

4h17m

2024-03-09 14:20:55 UTC

Have you considered https://github.com/tembo-io/pgmq for the queue bit?

iangregson

0 replies

18h48m

2024-03-08 23:49:18 UTC

I love this idea. I wish it existed a few years ago when I did a not so good job of implementing a distributed DAG processing system :D

Looking forward to trying it out!

hannasm

0 replies

14h12m

2024-03-09 04:25:39 UTC

Seems like this summary should be in the README

dataangel

0 replies

14h16m

2024-03-09 04:22:01 UTC

Distributed

Built on PostGRES

Not what people usually mean by distributed, caveat emptor

dalberto

0 replies

23h36m

2024-03-08 19:01:13 UTC

I'm curious if this supports coroutines at tasks in Python. It's especially useful for genAI, and legacy queues (namely Celery) are lacking in this regard.

It would help to see a mapping of Celery to Hatchet as examples. The current examples require you to understand (and buy into) Hatchet's model, but that's hard to do without understanding how it compares to existing solutions.

cybice

0 replies

22h7m

2024-03-08 20:30:26 UTC

Why Hatchet might be better than Windmill: Windmill uses the same approach in PostgreSQL, very fast and has an incredibly good UI.

adeptima

0 replies

7h31m

2024-03-09 11:06:55 UTC

Exciting time for distributed, transactional task queue projects built on the top of PostgreSQL!

Here's the most heavily upvoted in the past 12 months

Hatchet https://news.ycombinator.com/item?id=39643136

Inngest https://news.ycombinator.com/item?id=36403014

Windmill https://news.ycombinator.com/item?id=35920082

HN comments on Temporal.io https://github.com/temporalio https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...

Internally we rant about the complexity of the above projects vs using transactional job queues libs like:

river https://news.ycombinator.com/item?id=38349716

neoq: [https://github.com/acaloiaro/neoq](https://github.com/acaloi...

gue: [https://github.com/vgarvardt/gue](https://github.com/vgarvar...

Deep inside can't wait to see some like ThePrimeTimeagen to review it ;) https://www.youtube.com/@ThePrimeTimeagen

acaloiar

0 replies

23h15m

2024-03-08 19:22:38 UTC

A related lively dicussion from a few months ago: https://news.ycombinator.com/item?id=37636841

Long live Postgres queues.

CoolCold

0 replies

14h4m

2024-03-09 04:33:27 UTC

From your experience, what would be a good way for doing Postgres Master-Master ? My understanding that Postgres Professional/EnterpriseDB based solutions provide reliable M-M and those are proprietary.