HN comments for: Maestro: Netflix's Workflow Orchestrator

slt2021

56 replies

21h4m

2024-07-22 21:24:22 UTC

I used to be impressed with these corporate techblogs and their internal proprietary systems, but not so much anymore. Because code is a liability.

I would rather use off-the-shelf open source stuff with long history of maintenance and improvement, rather than reinvent the cron/celery/airflow/whatever, because code is a liability. Somebody needs to maintain it, fix bugs, add new features. Unless I get +1 grade promotion and salary/rsu bump, ofc.

People need to realize that code is a liability, anything that is not the business critical stuff that earns/makes $$$ for the company is a distraction and resource sink.

bluepizza

20 replies

20h20m

2024-07-22 22:09:11 UTC

People need to realize that code is a liability

This is an extreme point of view, that is tightly connected to the MBA-driven min-maxing of everything under the sun.

I am glad that there are folks who aren't afraid to code new systems and champion new ideas. Even in the corporate sense, mediocre risk averse solutions will only take you so far. The most profitable companies tend to be quite daring in their tech.

Code is not a liability. Code is what makes a company move its gears.

delecti

17 replies

19h42m

2024-07-22 22:46:59 UTC

Code being a liability is not a contradiction with code being what makes a company move its gears. The trucks of a delivery service are a liability (requiring maintenance, deprecation accounting, fuel), but are also the only thing that lets the company deliver. A delivery company should own as few trucks as necessary, and no fewer. Any company should publish/run/maintain as little code as necessary, and no less.

slt2021

7 replies

18h39m

2024-07-22 23:50:08 UTC

For trucking company owning and developing trucks makes sense.

But does it make sense for a trucking(streaming) company to create own plumbing equipment? I’d rather use Plumbers Supply Inc that every other company uses from Plumber Depot or use open-source-plumbers.com, because I am not in a plumbing business

bhawks

6 replies

18h21m

2024-07-23 00:07:58 UTC

The margin on trucking could be so much higher than plumbing that most plumbers could never afford the R&D necessary to advance flushing tech. Big truck operates at a scale where they materially benefit from better flushing, so they take their truck dollars and pour them into their own plumbing lab. Big truck sees this as a competitive advantage that no one else is positioned in the market to unlock. They may one day enter the general plumbing space and disrupt waste management, at their option not obligation of course.

This describes Google and Amazon perfectly - while you can armchair quarterback their biz decisions they are definitely doing well for themselves.

slt2021

5 replies

16h58m

2024-07-23 01:31:02 UTC

Amazon actually steals a lot of open source and repackages it as a “managed AWS service”, they literally deployed managed Airflow as soon as it became popular.

The whole aws reinvent is repackaged whatever open source project is trending, hiding control plane from the user and instead expose it via AWS control plane and charge people per usage instead of per server

narism

2 replies

16h11m

2024-07-23 02:18:08 UTC

Can you steal something that is given away freely? People are buying those services so they must be providing some value.

slt2021

0 replies

15h14m

2024-07-23 03:15:07 UTC

I was just replying to the comment above that amazon somehow rolls their own stuff and gives back to the community by open sourcing their systems.

Amazon’s approach is the opposite: steal open source repo and make $$$ off of open source contributors’ labor

bhawks

0 replies

15h13m

2024-07-23 03:16:11 UTC

I can't find a single line in the Apache software license that would indicate that Amazon is breaking any agreements set forth for Airflow.

bushbaba

1 replies

15h40m

2024-07-23 02:49:05 UTC

Accept they also provide the security, billing/invoicing, IaaC, support, provisioning, scaling, list goes on.

As for pay per server vs pay per usage. Heck you know Amazon actually bills the team who caused the cost. And gives finance a report on how much each team is spending and on what. Good luck doing that on prem.

slt2021

0 replies

15h12m

2024-07-23 03:17:14 UTC

The question is how much do they give back to the open source community, after making boatloads of $$$ off of opensource contributions and whether their model is sustainable and healthy for the FOSS movement

bhawks

6 replies

18h30m

2024-07-22 23:58:53 UTC

Trucks are literally an asset - you can't do depreciation on a liability.

The only way a 'truck' could be a liability is a lease for said truck.

There are plenty of economically rationale reasons why a company may own more trucks they strictly need to manage delivery. For example wanting to handle seasonal bursts, wanting to ensure reliability, preparing for an expansion, being able to lease capacity to other businesses.

Actually you can go replace truck with server and you describe what made AWS make initial sense.

Please stop misusing accounting concepts.

patmorgan23

2 replies

15h47m

2024-07-23 02:42:15 UTC

A truck also comes with a maintenance liability if you want to continue getting value out of it, just like code.

bhawks

1 replies

15h31m

2024-07-23 02:57:51 UTC

Liabilities are obligations of a company to pay money owed to a lender as a result of a previous transaction.

You are describing an operating expense which has an entirely different nature than a liability.

'comes with a maintenance liability' is a handwaving statement that means practically nothing without a ton of contextual information. A true liability has a contractual set of obligations to pay defined amounts on a agreed upon schedule. No one is going to come after you for not changing the oil on your truck, try missing payments on a lease.

fragmede

0 replies

1h6m

2024-07-23 17:23:03 UTC

No one is going to come after you for not changing the oil on your truck

Several parties will come after you for not changing the oil on your semi-truck that is being used professionally for freight, starting with your driver, your insurance company, and the US Department of Transportation (DOT), specifically the Federal Motor Carrier Safety Administration (FMCSA), with whom you have m have to provide maintenance records. Trucking is a highly regulated industry, and after Crowdstrike, software engineering is only going to get more regulated, not less.

cbsmith

1 replies

17h44m

2024-07-23 00:45:10 UTC

Please stop misusing accounting concepts.

Assets can also be liabilities. The mortgages in a mortgage backed security is both an asset and a liability, as was only too well demonstrated in 2008... It's an asset in the security portfolio, but until you sell the security, it's a liability for whomever is securitizing it.

bhawks

0 replies

15h17m

2024-07-23 03:11:20 UTC

In the GFC the government literally created the Troubled _Asset_ Relief Program. Those MBSs were assets and didn't magically become liabilities.

The problem was the market value of those assets plummeted because no one expected them to generate the agreed upon cash flows because the underlying loans were going into correlated defaults. Despite all this the only party that saw the mortgage as a liability is the individual who's responsibility it was to make a monthly payment on said mortgage.

Outside of swaps and other derivatives financial instruments and other properties don't magically switch from being an asset to being a liability based on random external factors.

This conversation is like accountants talking about processes, threads, fibers and context switching... very imprecisely.

delecti

0 replies

4h42m

2024-07-23 13:46:53 UTC

I'm not using liability in the accounting context, but in the colloquial one.

a person or thing whose presence or behavior is likely to cause embarrassment or put one at a disadvantage.

Code is absolutely a liability. Code deteriorates as conditions change, and unchanged code also becomes more vulnerable in a way that conventional objects can't.

bilalq

1 replies

19h5m

2024-07-22 23:23:26 UTC

But thinking of those trucks primarily as a liability is exactly the kind of mindset that leads to companies minimizing their liabilities instead of maximizing their potential.

shermantanktop

0 replies

18h45m

2024-07-22 23:43:32 UTC

Especially when the cost of minimizing (long hours, unsafe conditions) is not felt by decision makers, and may not materialize for a while, but the benefits of maximizing their potential is felt directly and immediately.

Incentives are everything. That's why managers are so careful when applying them to their own jobs.

pants2

1 replies

18h3m

2024-07-23 00:25:20 UTC

Using open-source is a liability too, with added problems of code licensing conflicts, supply chain attacks, zero-day vulnerabilities, relying on maintainers that don’t work for you, etc.

cbsmith

0 replies

17h41m

2024-07-23 00:47:24 UTC

Not open source is a liability too, with added problems of code licensing conflicts, supply chain attacks, zero-day vulnerabilities, relying on maintainers that don't work for you, etc... ;-)

cortesoft

15 replies

19h6m

2024-07-22 23:22:34 UTC

Isn't this exactly WHY this blog post exists? They are open sourcing this software so that they don't have to maintain it all internally anymore.

They had a need that an existing "off-the-shelf open source" project didn't solve, so they created this an are now turning it into an "off-the-shelf open source" project so they can keep using it without having to maintain it entirely themselves.

How are these open source tools supposed to be created in the first place? This is the process, someone has to do it

rjh29

7 replies

18h15m

2024-07-23 00:13:57 UTC

Usually the corporate needs differ too much and they end up keeping their own fork anyway.

Netflix has the resources to maintain this. It's probably more a PR move for their hiring division.

gregoriol

5 replies

9h6m

2024-07-23 09:22:27 UTC

Indeed, this is not open-source: this is public-source. They don't really open the project to external contribution, they just publish their code and continue the project as their tool. They will not have incentive to add features that are not useful to their business even if it useful to the community (if provided by a PR for example), because all the developers of the project are employed by the same company and this company doesn't have any reason to review and fix code that is not part of their business.

mhitza

2 replies

7h37m

2024-07-23 10:52:00 UTC

Indeed, this is not open-source: this is public-source. They don't really open the project to external contribution

It's open source, and they don't have to accept external contributions. Terms have a well-defined meaning, please refrain from calling open source code not open source, and not open source code, open source.

sramam

1 replies

5h53m

2024-07-23 12:35:24 UTC

You are arguing the difference between the letter and spirit of the law.

homarp

0 replies

3h26m

2024-07-23 15:02:34 UTC

you can fork it and continue dev in case they change their mind, it is open source.

any other 'source available' licenses would not (legally) let you do that.

Fidelix

1 replies

6h18m

2024-07-23 12:10:51 UTC

It has the Apache license, definitely Open Source. https://github.com/Netflix/maestro/blob/main/LICENSE

mikepurvis

0 replies

1h29m

2024-07-23 16:59:38 UTC

I think the contention here is more about whether it's an open project— does it have an open bugtracker, an open project management structure, clear governance, etc.

It not having those things is fine, and eventually someone may still take the source and create an open project around it. But understanding that is a Netflix project helps calibrate people's understanding around whether the model when you find a bug is going to be "fork, fix, and run the fork indefinitely" or "fork, fix, contribution accepted, drop fork and return to upstream."

darkwater

0 replies

14h47m

2024-07-23 03:42:02 UTC

Absolutely this. It literally happened to me with a Netflix OSS we were using at work. I found a bug that was biting us, opened a ticket with a PR attached with a possible fix and got an answer after a few months "ah yeah we fixed this in our internal version time ago, thanks, will merge it now".

slt2021

6 replies

18h16m

2024-07-23 00:12:47 UTC

So Netflix expects open source community to pick up the maintenance tab ?

I understand how open source proejcts are born, but I struggle to see what is novelty of this project. Just another Java CRUD app with some questionable design choices that are only applicable to netflix:

1. They claim it is distributed system, but it is just a regular Java crud with SQL backend

2. Java-like DSL with parser and classloader (why? Just why?)

Projects like these are the perfect examples of Enterprise Grade FizzBuzz (https://github.com/EnterpriseQualityCoding/FizzBuzzEnterpris...) and this is exactly what I dont like about it

cortesoft

1 replies

15h30m

2024-07-23 02:58:40 UTC

So Netflix expects open source community to pick up the maintenance tab?

Isn’t this the deal with all open source? They are giving something (the code and access to the project) in return for help maintaining it?

No one is being forced to do anything. It is not like there is some open source contributor somewhere now saying, “oh damn, now I have to maintain this, too?”

If people like it and find value in it, they can help contribute to the project in ways they want. Netflix gets to use those contributions, in return for letting people use their contributions. That is just how open source works.

nicce

0 replies

5h42m

2024-07-23 12:47:09 UTC

Isn’t this the deal with all open source?

If Netflix still heavily uses this internally, they should still do the most maintenance. Others contribute based on their own needs.

jjuliano

0 replies

5h30m

2024-07-23 12:58:28 UTC

So Netflix expects open source community to pick up the maintenance tab ?

I think the notion of open sourcing a project, is you are literally asking at the community for help and that the community will naturally help you with the maintenance.

geodel

0 replies

17h11m

2024-07-23 01:17:29 UTC

You are making great points. This is power of Netflix marketing and branding that they are considered as cutting edge tech company. In reality most of Netflix Java projects are pretty mediocre Enterprise Java stuff. Last year or so they have mandated Spring Boot as their development platform for all their web services.

This is exactly same stack I have to deal daily and management reason is it is lowest common denominator that works well with 3-month contract developer to deliver Nth micro service whose sole job is to call another service.

digger495

0 replies

5h28m

2024-07-23 13:01:10 UTC

People want an alternative to things like Temporal, and don't want to handle DAGs with Kafka Streams.

cbsmith

0 replies

17h49m

2024-07-23 00:39:47 UTC

So Netflix expects open source community to pick up the maintenance tab ?

In fairness, the very nature of open source is that the community is only going to pick up the maintenance tab if the value they're getting out of it is worth it.

YawningAngel

4 replies

20h23m

2024-07-22 22:05:41 UTC

Off-the-shelf open source stuff is often the product of big companies open sourcing internal tools though. Airflow, which you name check, is a great example of this. Temporal is another example in the space. Someone has to be dumb enough to build new stuff

slt2021

3 replies

17h58m

2024-07-23 00:30:21 UTC

airflow and Temporal has teams dedicated to maintain and extend their system. And these systems are business critical for astronomer/temporal, respectively.

And they develop them in a way that works for many customers and use cases, not just netflix.

But for netflix this is just another auxillary system, out of many others. Just a nice GUI to schedule cron jobs basically, does it make sense to sink resources into custom cron?

hulahoof

2 replies

15h51m

2024-07-23 02:38:16 UTC

But when Airbnb created airflow, you could have said the same. It’s just later in its lifecycle.

artwr

1 replies

13h13m

2024-07-23 05:15:26 UTC

Agreed.

To be fair, I doubt Maestro will take off like Airflow did.

Airflow filled a void of an easier orchestrator for Big Data with a prettier UI than the competitors of the time (Oozie, Luigi), implementing some UX patterns which had been tested at scale at Facebook with data swarm.

The field is quite a bit more crowded now.

turtle4

0 replies

7h27m

2024-07-23 11:01:23 UTC

Seems like you have some experience with the orchestrator offerings. Airflow still the way to go, or would you recommend something else for someone just starting down the path of selecting and implementing a data orchestrator?

bhawks

2 replies

19h50m

2024-07-22 22:38:46 UTC

with long history of maintenance and improvement,

That is a huge load bearing statement.

Do you plan on any contributions back to the community yourself?

Build vs. buy is always an important conversation but claiming that the 'buy'-side path has perfectly 0 maintenance and reliability costs reeks of naivety.

slt2021

1 replies

18h35m

2024-07-22 23:53:58 UTC

If I needed container orchestration I would use k8s. I can improve it and even propose patches/bugs or chip into opensource maintainers fund. I wont write my own orchestrator, especialy being in a streaming business.

Thats what I meant, doesnt even necessarily Build-vs-Buy, but rather Use-Open-Source-and-Contribute or Reinvent-the-wheel-for-L6-promo-and-then-opensource ??

Would the world be better with 10 workflow orchestrator systems or one mature?

bhawks

0 replies

14h45m

2024-07-23 03:43:35 UTC

Netflix is building a workflow orchestrator not a container orchestrator. The viable alternative would be Airflow or maybe something like Temporal. K8s alone isn't going to meet the need in this case.

Does the world need another workflow orchestrator? Who knows - some folks at Netflix seem willing to pay a handful of engineers $ to do so. Good luck to them

wodenokoto

0 replies

14h3m

2024-07-23 04:26:09 UTC

Aren’t most of those things developed in house at tech giants and later open sourced?

why-el

0 replies

19h46m

2024-07-22 22:43:06 UTC

I am confused by this comment:

open source stuff with long history of maintenance and improvement

improvement and maintenance is continent on usage, and having been used at Netflix, this project is in a better place to have already faced whatever bug you are worried about (and let's be real, 99% of applications wont ever get the luck to exercise code paths sophisticated enough to find bugs Netflix has not found already).

You might be unnecessarily projecting here. You don't have evidence to support that open sourcing this might have been for any other reason than it is simply good for the community to have.

ripped_britches

0 replies

20h51m

2024-07-22 21:38:02 UTC

100%. Very few times are these systems built as robustly as external folks who earn a profit on building robustness. Best example of course being Stripe. But I see this from everything from visual snapshot testing tools to custom CI workflows. The good thing is you can always rely on competitive market dynamics to price the off the shelf solution down to a reasonable margin above maintenance costs.

makeset

0 replies

20h29m

2024-07-22 22:00:11 UTC

anything that is not the business critical stuff That's an important qualifier. For skilled teams in performance-critical domains, the inflection point where any outside code becomes a low-quality/low-control liability is not that far.

ldjkfkdsjnv

0 replies

16h47m

2024-07-23 01:41:31 UTC

I was going to say this. I never mess with random libraries like this, always so much pain.

jefurii

0 replies

20h34m

2024-07-22 21:54:56 UTC

This sounds like the beginning of a sales pitch.

jcgrillo

0 replies

2h47m

2024-07-23 15:42:07 UTC

People need to realize that code is a liability

Code that you own and intimately understand is less of a liability than some 3rd party dependency (paid or free). Stitching together a patchwork of dependencies is not likely the optimal result. The more aligned your codebase is with the problem you're trying to solve the better, and if functionality is core to your business better to own than borrow or rent.

beanjuiceII

0 replies

14h21m

2024-07-23 04:07:41 UTC

Off the shelf come with its own set of burdens, not always sunshine rainbows and loli's

archerx

0 replies

2h52m

2024-07-23 15:37:15 UTC

This is a naive view, other people’s code is even more of a liability. Look at crowdstrike and opensource infiltrations. Using opensource software doesn’t magically grant you security nor stability.

alfalfasprout

0 replies

19h59m

2024-07-22 22:30:05 UTC

I very much disagree with this take-- and the more I've experienced throughout my career the more I'm sure of it.

Companies spend an IMMENSE amount of time and effort adapting sometimes subpar off the shelf solutions to fit their infra and pay an ongoing tax w/ increasing tech debt trying to support them. Often something bespoke and smaller + more tailored would unlock significantly more productivity if the investment is made consciously.

Any code that is written has both assets and liabilities. But to claim it is a distraction and resource sink is a very, very bad take. Every decision to build something in-house needs to be done thoughtfully and deliberately.

MetaWhirledPeas

0 replies

11m

2024-07-23 18:17:46 UTC

code is a liability

3rd parties are also a liability. Pick your poison. Trust in unknown individuals, trust in megacorps, or trust your own people. Choosing wisely is why people get paid the big bucks.

hintymad

25 replies

21h42m

2024-07-22 20:46:25 UTC

I wonder how many iterations we will need before engineers are happy with a workflow solution. Netflix had multiple solutions before Maestro, such as metaflow. Uber built multiple solutions too. Amazon had at least a dozen internal workflow engines. It's quite curious why engineers are so keen on building their own workflow engines.

Update: I just find it really interesting that many individuals in many companies like to build workflow engines. This is a not deriding comment towards anyone or Netflix in particular. To me, such observation is worth some friendly chitchat.

ilrwbwrkhv

8 replies

21h9m

2024-07-22 21:20:16 UTC

Its because Netflix pretends to be a tech company to get the high market cap.

So they hire tons of engineers who have nothing to do but rearchitecture the mess their microservices have created.

Then there are others who create observability and test harnesses for all of that.

When Pornhub and other porn sites can deliver orders of magnitude more data across the world with much simpler systems, you know it's all bullshit.

thfuran

5 replies

21h2m

2024-07-22 21:26:55 UTC

When Pornhub and other porn sites can deliver orders of magnitude more data across the world with much simpler systems, you know it's all bullshit

When is that, exactly? https://www.statista.com/chart/15692/distribution-of-global-...

rty32

1 replies

19h49m

2024-07-22 22:39:19 UTC

What is the methodology of the report?

Just one of the questions I have regarding this -- China has nearly 1.4 billion people, and barely any of them use any of the services here. Instead, they have their own video platforms. And you tell me that none of those platforms use at least the same amount of traffic of Prime Video? I doubt it.

barnabyjones

0 replies

16h17m

2024-07-23 02:11:49 UTC

I found the report the statistic is from [0]. But note that it says "by app," so I don't think it's actually all traffic, just the top apps. Their reported source is data from 300m customers in different regions.

[0] https://www.sandvine.com/hubfs/Sandvine_Redesign_2019/Downlo...

ATMLOTTOBEER

1 replies

20h21m

2024-07-22 22:07:58 UTC

“Other” in your diagram is mostly porn

thfuran

0 replies

17h55m

2024-07-23 00:33:22 UTC

Even supposing that "Other" is just pornhub and nothing else, that's less than one order of magnitude more than Netflix.

exe34

0 replies

20h40m

2024-07-22 21:48:57 UTC

isn't it like 30%?

tempest_

0 replies

19h19m

2024-07-22 23:10:05 UTC

To be fair when netflix started they were solving legitimate problems that a major streaming provider would have.

In the time since those problems have been solved and now are offered as a service by most cloud providers (for a hefty fee of course)

renewiltord

0 replies

16h23m

2024-07-23 02:06:00 UTC

When Pornhub and other porn sites can deliver orders of magnitude more data across the world with much simpler systems, you know it's all bullshit.

That's nothing. My dedicated server delivers two orders of magnitude greater traffic than Pornhub (and everything in the Mindgeek network really). And I don't even need the cloud. Just better engineering.

pm90

2 replies

19h49m

2024-07-22 22:39:54 UTC

It’s likely because we haven’t yet found a workflow engine/orchestrator thats capable of handling diverse tasks while still being easy to understand and operate.

It’s really easy to build a custom workflow engine and optimize it for specific use cases. I think we haven’t yet seen a convergence simply because this tool hasn’t yet been built.

Consider the recent rise of tools that quickly dominated their fields: Terraform (IaC), Kubernetes (distributed compute). Both systems are hella complex, but they solve hard problems. Generic workflow engines are complex to understand and difficult to operate and offer a middling experience so many folks don’t even bother.

fragmede

1 replies

16h6m

2024-07-23 02:22:59 UTC

slurm? airflow?

pm90

0 replies

15h21m

2024-07-23 03:07:42 UTC

airflow is notoriously hard to operate https://news.ycombinator.com/item?id=31482217

alfalfasprout

2 replies

19h50m

2024-07-22 22:39:09 UTC

The issue is that "workflow orchestration" is a broad problem space. Companies need to address a lot of disparate issues and so any solution ends up being a giant product w/ a lot of associated functionality and heavily opinionated as it grows into a big monolith. This is why almost universally folks are never happy.

In reality there are five main concerns: 1. Resource scheduling-- "I have a job or collection of jobs to run... allocate them to the machines I have" 2. Dependency solving-- If my jobs have dependencies on each other, perform the topological sort so I can dispatch things to my resource scheduler 3. API/DSL for creating jobs and workflows. I want to define a DAG... sometimes static, sometimes on the fly. 4. Cron-like functionality. I want to be able to run things on a schedule or ad-hoc. 5. Domain awareness-- If doing ETL I want my DAGs to be data aware... if doing ML/AI workflows then I want to be able to surface info about what I'm actually doing with them

No one solution does all these things cleanly. So companies end up building or hacking around off the shelf stuff to deal with the downsides of existing solutions. Hence it's a perpetual cycle of everyone being unhappy.

I don't think that you can just spin up a startup to deliver this as a "solution". This needs to be solved with an open source ecosystem of good pluggable modular components.

swyx

0 replies

2024-07-23 18:23:31 UTC

great insight, appreciate this. would also point out logging/event sourcing for "free" observability

SOLAR_FIELDS

0 replies

15h12m

2024-07-23 03:17:17 UTC

The issue indeed is that "workflow orchestration" is a broad problem space. I would argue that the solution is not this:

I don't think that you can just spin up a startup to deliver this as a "solution". This needs to be solved with an open source ecosystem of good pluggable modular components.

But rather more specialized tools that solve specific issues.

What you describe just sounds like a better implemented version of Airflow or the over 100 other systems that are actively trying to be this today (Flyte, Dagster, Prefect, Argo Workflows, Kubeflow, Nifi, Oozie, Conductor, Cadence, Temporal, Step Functions, Logic Apps, your CI system of choice has their own, need I continue, that is not even scratching the surface). Most of those have some sort of "plugin" ecosystem for custom code, in varying degrees of robustness.

For what it is worth, everyone and their mom thinks they can make and wants to be this orchestrator. It's a problem that is just so generic and such a wide net that you end up with annoying-to-use building blocks because everyone wants to architecture astronaut themselves into being the generic workflow orchestration engine. The ultimate system design trap: Something so fundamentally easy to grok and conceptualize that you can PoC one in hours or days, but near infinite possibilities of what you can do with it, resulting in near infinite edge cases.

Instead, I'd rather companies just focus on the problem space that it lends itself to. Instead of Dagster saying "Automate any workflow" and try to capture that space, just make building blocks for data engineering workflows and get really good at that. Instead of Github Actions being a generic "workflow engine" just have it really good at making CI workflow building blocks.

But we can't have it that way. Because then some architecture astronaut will come around and design a generic workflow engine for orchestrating your domain specific workflow engines and say that you no longer need those.

Actually I think I just convinced myself that what you are suggesting actually IS the right way. If companies just said "we will provide an Airflow plugin" instead of building their own damn Airflow this would be easy. But we won't ever have that either. What we really need is some standards around that. Like if CNCF got together and got tired of this and said "This is THE canonical and supported engine for Kube workflows, bring your plugins here if you want us to pump you up". That might work. They've usually had better luck with putting people in lockstep in the Kube ecosystem at least than Apache has historically for more general FOSS stuff. Probably because the problem space there is more limited.

sgloutnikov

1 replies

21h36m

2024-07-22 20:52:53 UTC

Naming things, cache invalidation, and workflow engines? :)

https://github.com/meirwah/awesome-workflow-engines

cbsmith

0 replies

17h35m

2024-07-23 00:54:00 UTC

No, it's just the two things: naming things, cache invalidation, and off by one errors.

dinobones

1 replies

21h35m

2024-07-22 20:53:34 UTC

We rolled our own workflow engine and it almost crashed one of our unrelated projects for having so many bugs and being so inflexible.

I’m starting to think workflow engines are somewhat of a design smell.

It’s enticing to think you can build this reusable thing once and use it for a ton of different workflows, but besides requiring more than one asynchronous step, these workflows have almost nothing in common.

Different data, different APIs, different feedback required from users or other systems to continue.

ryanianian

0 replies

20h49m

2024-07-22 21:39:42 UTC

workflow engines are somewhat of a design smell

Probably so, but the real design smell seems to be thinking of a workflow engine as a panacea for sustainable business process automation.

You have to really understand the business flow before you automate it. You have to continuously update your understanding of it as it changes. You have to refactor it into sub-flows or bigger/smaller units of work. You have to have tests, tracer-bullets, and well-defined user-stories that the flows represent.

Else your business flow automation accumulates process debt. Just as much as a full-code-based solution accumulates technical debt.

And, just like technical debt, it's much easier (or at least more interesting) to propose a rewrite or framework change than it is to propose an investment in refactoring, testing, and gradual migrations.

savin-goyal

0 replies

21h19m

2024-07-22 21:10:00 UTC

Metaflow sits on top of Maestro, and neither replaces the other

...Users can use Metaflow library to create workflows in Maestro to execute DAGs consisting of arbitrary Python code. from https://netflixtechblog.com/orchestrating-data-ml-workflows-...

The orchestration section in this article (https://netflixtechblog.com/supporting-diverse-ml-systems-at...) goes into detail on how Metaflow interplays with Maestro (and Airflow, Argo Workflows & Step Functions)

renewiltord

0 replies

16h26m

2024-07-23 02:03:00 UTC

We all have different use-cases. We also have a workflow engine at work but that's because we wanted immediate execution. From submit to execute time can be 100 ms on our system, which makes it also work well for short jobs. Usually, the task coordinator overhead is greater than that on these things.

otabdeveloper4

0 replies

5h34m

2024-07-23 12:55:08 UTC

why engineers are so keen on building their own workflow engines

Because all the existing ones suck.

(We built our own tiny one two. We need tight integration with systemd jobs and cgroups, and existing solutions don't do that.)

nijave

0 replies

21h9m

2024-07-22 21:19:19 UTC

These things tend to be fairly complex and require lots of integration with various services to get working. I think it's a little more organic to start building something simple and end up progressively adding more than implementing one from scratch (unless there are people around with experience)

dekhn

0 replies

21h12m

2024-07-22 21:16:26 UTC

I wrote my own because I wanted to learn about DAG and toposort and had some ideas about what nodes and edges in the workflow meant (IE, does data flow over edges? Or do the edges just represent the sequence in which things run? Is a node a bundle of code, does it run continuously, or run then exit?). I almost ended up with reflow, which is a functional-programming approach based on python, similar to nextflow, but I found that the whole functional approach to be extremely challenging to reason about and debug.

Often times what happens is the workflow engine is tailored to a specific problem and then other teams discover the engine and want to use it for their projects, but often need some additional feature, sometimes which completely up-ends the mental model of the engine itself.

Nathanba

0 replies

17h38m

2024-07-23 00:50:53 UTC

It inherently asks for a custom implementation because it's almost like workflows are just how you'd have to code and run everything anyway. Conceptually: why wouldn't we want to reconnect to any work we are currently in progress of, just like in a videogame where if we lose connection for a splitsecond, we want to be able to keep going where we left off? So therefore we must save the current step persistently and make sure that we can resume work and never lose it. Workflow engines also do no magic: They still just run code and if it fails in a place that we didn't manually checkpoint (=by making it into a separate task/workflow/function/action/transaction that is persistable) then we still lose data so.. at that point why not just try doing it this way everywhere whether it's running in a "workflow engine" or not. Before "workflow engines" we already had db transactions but those were mostly for our benefit so we don't mess up the db with partial inserts. Although so far what I've seen in open source workflow engines is that they don't let you work with user input easily, it's sad how they all start a new thread and then just block the thread while it waits for the user to send something. This is obviously not how you'd code a crud operation. In my opinion this is a huge drawback of current workflow engines. If this was solved, we should literally do everything as a workflow I think. Every form submission from the user could offer to let the user continue where he left off and we saved all his data so "he can reconnect to his game" (to revive the videogame metaphor I started with)

meliora245

9 replies

22h23m

2024-07-22 20:06:13 UTC

why would one consider this over something more established such as Temporal, also I see Maestro is written in Java vs Temporal's Go

robryan

2 replies

21h39m

2024-07-22 20:50:00 UTC

Netflix also uses temporal: https://temporal.io/in-use/netflix

tiffanyh

1 replies

21h36m

2024-07-22 20:53:17 UTC

Is Temporal still alive?

(website doesn't resolve for me)

EDIT: I found the GitHub page

https://github.com/temporalio/temporal

sjansen

0 replies

21h30m

2024-07-22 20:58:41 UTC

The site loads fine for me.

trustno2

1 replies

6h54m

2024-07-23 11:34:55 UTC

Temporal's go is... something. They used to use Java (I think), then they switched to Go, and the Go is very Java-like.

Or maybe I just don't know Fx.

https://github.com/temporalio/temporal/blob/main/service/mat...

The issue we hit with Temporal - again and again - is that it's very under-documented, and it's something you install at the core of your business, yet it's really hard to understand what is going on, through all the layers and through the very obtuse documentation.

Maestro has... no documentation? OK Temporal wins by default.

swyx

0 replies

2024-07-23 18:26:38 UTC

no just the SDK is Java. temporal is 99% Golang, even at Uber https://github.com/uber/cadence

troebr

1 replies

21h36m

2024-07-22 20:53:05 UTC

Didn't they rewrite some of Temporal's core in rust?

sjansen

0 replies

21h20m

2024-07-22 21:08:55 UTC

They (re)wrote most of the client SDKs on a Rust core, but the Temporal server is still written in Go.

iamspoilt

0 replies

22h0m

2024-07-22 20:28:37 UTC

That's also my question.

aimazon

0 replies

21h22m

2024-07-22 21:07:00 UTC

isn’t Maestro an alternative to Airflow, not Temporal? Temporal isn’t a workflow orchestrator. There’s some overlap on the internals but they’re different designs for different use cases.

gtrubetskoy

5 replies

22h0m

2024-07-22 20:28:28 UTC

The name Maestro has already been used for a workflow orchestrator which I worked on back in 2016. That maestro is SQL-centric and infers dependencies automatically by simply examining the SQL. It's written in Go and is BigQuery-specific (but could be easily adjusted to use any SQL-based system).

https://github.com/voxmedia/maestro/

stepanhruda

4 replies

21h17m

2024-07-22 21:11:35 UTC

With all due respect, there are so many projects. They don’t care about clashing with a repo that has 12 stars and 14 commits.

nijave

3 replies

21h8m

2024-07-22 21:20:43 UTC

Worked at a bank that named their container "cloud" platform GCP and it was in no way related to Google facepalm

stavros

2 replies

20h32m

2024-07-22 21:56:42 UTC

Well, if you're so unimaginative as to call your cloud platform "<companyname> cloud platform", it's not the fault of the second company whose name also starts with a G.

nijave

1 replies

20h22m

2024-07-22 22:06:42 UTC

Worse, the G was Gaia (ironically the personification of Earth in Greek mythology). They used "Gaia" as a name for all their internal cloud platforms

jdmichal

0 replies

19h22m

2024-07-22 23:07:14 UTC

Hello fellow ex-employee of that bank. I was in a segment governed by PCI, and they wouldn't even let us touch Gaia in fear of the whole thing being declared in scope

andbberger

5 replies

19h22m

2024-07-22 23:07:04 UTC

slightly off topic, but there is dire need for a scientific "workflow manager" built to FAANG engineering standards attuned for the needs of academia (ie primarily designed to facilitate execution of DAGs on clusters). The airflows of the world have complex unnecessary features and require extensive kitbashing to plug into slurm and the academic side of things is a huge mess. Snakemake comes the closest but suffers from massive feature creep, a bizarre specification DSL (superset of python) and blurred resource requirement abstraction boundaries.

torrance

2 replies

14h49m

2024-07-23 03:39:43 UTC

What about Nextflow?

andbberger

0 replies

12h41m

2024-07-23 05:47:40 UTC

I considered Nextflow before begrudgingly settling on snakemake for my current project. Didn't record why... possibly because snakemake was already a known quantity and I was under time pressure or because I felt the task DAG would be difficult to specify in WDL. It's certainly the most mature of the bunch.

_Wintermute

0 replies

5h36m

2024-07-23 12:52:58 UTC

Nobody wants to write or debug groovy, especially scientists who are used to python. It also causes havoc on a busy SLURM scheduler with its lack of array jobs (heard this is being fixed soon).

slt2021

1 replies

16h47m

2024-07-23 01:41:42 UTC

Academia better to learn k8s and one of the k8s-native workflow orchestrators. This is as close to FAANG grade and open source as they can get, and arguably a bit better than this repo

andbberger

0 replies

12h36m

2024-07-23 05:52:43 UTC

for better or worse slurm is the status quo for HPC. it works, every university has a slurm cluster, people already know how to use it

rubenfiszel

3 replies

5h8m

2024-07-23 13:20:23 UTC

Founder of https://windmill.dev here which share many similarities with Maestro.

Maestro is a general-purpose, horizontally scalable workflow orchestrator designed to manage large-scale workflows such as data pipelines and machine learning model training pipelines. It oversees the entire lifecycle of a workflow, from start to finish, including retries, queuing, task distribution to compute engines, etc.. Users can package their business logic in various formats such as Docker images, notebooks, bash script, SQL, Python, and more. Unlike traditional workflow orchestrators that only support Directed Acyclic Graphs (DAGs), Maestro supports both acyclic and cyclic workflows and also includes multiple reusable patterns, including foreach loops, subworkflow, and conditional branch, etc.

You could replace Maestro with Windmill here and it would be precisely correct. Their rollup is what we call the openflow state.

Main differences I see:

- Windmill is written in Rust instead of Java.

- Maestro relies on CockroachDB for state and us Postgresql for everything (state but also queue). I can see why they would use CockroachDB, we had to rollout our own sharding algorithms to make Windmill horizontally scale on our very large scale customer instances

- Maestro is Apache 2.0 vs Windmill AGPL which is less friendly

- It's backed by Netflix so infinite money but although we are profitable, we are a much smaller company

- Maestro doesn't have extensive docs about self-hosting on k8s or docker-compose and either there is no UI to build stuff, or the UI is not yet well surfaced in their documentation

But overall, pretty cool stuff to open-source, will keep an eye on it and benchmark it asap

ensignavenger

1 replies

1h27m

2024-07-23 17:02:05 UTC

Thanks for the great comparison! While Meastro is Apache licensed, if it depends on CockroachDB, Cokroach itslef isn't even Open Source, so that isn't great. I would rather have an AGPL codebase than a non open source dependency. Of course overtime some one could add alternative DB support.

jamra

0 replies

18m

2024-07-23 18:11:02 UTC

I really wonder why they didn’t choose something like RocksDB for more speed.

rwky

0 replies

5h7m

2024-07-23 13:22:09 UTC

Been using windmill for a few months and so far it's rock solid keep it up!

antishatter

3 replies

14h53m

2024-07-23 03:35:54 UTC

Anyone have a recommendation for a workflow orchestrator for single server deployments? Looking at running a project at home and for certain pieces think it would be easiest to orchestrate with a tool like Maestro or Airflow but they’re basically set up to run in clusters with admins to manage them.

ssfak

0 replies

11h54m

2024-07-23 06:34:19 UTC

For Python tasks you can check Prefect, among others..

rwky

0 replies

5h44m

2024-07-23 12:44:41 UTC

Windmill is pretty lightweight and easy to deploy. https://www.windmill.dev/ you can configure it to have a single worker on the same server as the ui and database.

katrotz

0 replies

11h31m

2024-07-23 06:57:52 UTC

I'd recommend Kestra[1] since it can be run on a single node

[1] https://kestra.io/

Sparkyte

3 replies

22h23m

2024-07-22 20:05:29 UTC

Whats the difference of this and enqueue work into a queue then waiting for a job to pick it up at a scheduled time? Not saying build a Kafka cluster to serve this but most cloud providers have queuing tools.

sjansen

0 replies

21h35m

2024-07-22 20:53:48 UTC

Putting work in a queue is only the start. Most organizations start there and gradually write ad hoc logic as they discover problems like dependencies, retries, & scheduling.

Dependencies: what can be done in parallel and what must be done in sequence? For example, three tasks get pushed in the queue and only after all three finish a fourth task must be run.

Retries: The concept is simple. The details are killer. For example, ifa task fails, how long should the delay between retries be? Too short and you create a retry storm. Forget to add some jitter and you get thundering hoards all retrying at the same time.

Scheduling: Because cron is good enough, until it isn't.

A good workflow solution provides battle tested versions of all of the above. Better yet, a great workflow solution makes it easier to keep business logic separate from plumbing so that it's easier to reason about and test.

shawabawa3

0 replies

21h34m

2024-07-22 20:54:32 UTC

workflows typically involve chains of jobs with state transitions, waits, triggers, error handling etc

a lot more than just e.g. celery jobs

nijave

0 replies

21h6m

2024-07-22 21:22:28 UTC

A workflow manager implements a Choreography based saga pattern https://microservices.io/patterns/data/saga.html

skissane

2 replies

19h35m

2024-07-22 22:53:25 UTC

I'm a bit confused about what is going on here: This project appears to use Netflix/conductor [0]. But you go to that repo, you see it has been archived, with a message saying it is replaced by Netflix's internal non-OSS version, and by unmentioned community forks – by which I assume they mean Orkes Conductor [1]. But this isn't using Orkes Conductor, it looks like it is using the discontinued Netflix version `com.netflix.conductor:conductor-core:2.31.5` [2] – and an outdated version of it too.

[0] https://github.com/Netflix/conductor

[1] https://github.com/conductor-oss/conductor

[2] https://github.com/Netflix/maestro/blob/e8bee3f1625d3f31d84d...

halamadrid

1 replies

19h25m

2024-07-22 23:03:51 UTC

Yes Netflix abandoned Conductor long time ago [0]. The other repo is built and managed by Orkes after Netflix abandoned it.

[0] https://techcrunch.com/2023/12/13/orkes-forks-conductor-as-n...

skissane

0 replies

18h45m

2024-07-22 23:43:31 UTC

I haven't touched Conductor for a few years now, but back in 2020 I did some work trying to implement it, even submitted a few PRs – https://github.com/Netflix/conductor/pulls?q=is%3Apr+author%...

My impression of the code base, is I felt like it needed a lot of work to run in a non-Netflix environment. Which is part of why the project I was working on ended up abandoning Conductor – we were going to embed Conductor in our product as a workflow engine, we ended up building our own workflow engine from scratch instead. Another team did end up using it for some internal use cases, but scalability/reliability/etc are less of a concern for internal use cases as opposed to customer-facing ones.

And then Netflix abandons it – and then they open source something else which depends on an old version of it – well, I'm happy they open source anything, but it fits with my earlier impression – throwing stuff over the fence which can be a struggle to adopt in an outside environment. Still, throwing it over the fence is better than not releasing it at all.

iamsanteri

2 replies

22h52m

2024-07-22 19:36:35 UTC

So will this serve as a stand-in replacement for something like Airflow?

then4p

0 replies

6h37m

2024-07-23 11:51:35 UTC

I'm also missing comparisons to other existing tools like airflow, dagster, mlflow...

makestuff

0 replies

22h4m

2024-07-22 20:24:29 UTC

Yeah, also curious if this is meant as a replacement for Airflow.

tiffanyh

1 replies

21h58m

2024-07-22 20:31:06 UTC

Don't see many Java projects being posted on HN.

xyst

0 replies

21h55m

2024-07-22 20:34:13 UTC

We only upvote Go or Rust projects here ;)

jekude

1 replies

21h55m

2024-07-22 20:33:31 UTC

Seems like they re-engineered Temporal: https://temporal.io/

troebr

0 replies

21h40m

2024-07-22 20:49:10 UTC

They did use Temporal at Netflix, they gave a couple presentations 2 years ago. I think this is very much not-Temporal because it relies on a DSL instead of workflow as code.

I don't know if it's a scale-thing, I'm not a workflow expert but this seems more in line with the map-reduce of yore, as in you get some big fat steps and you coordinate them, although you could have coarse-grained activities in Temporal workflows.

I'd be curious to see what the tradeoffs are between the two and if they still have usages for Temporal. Maybe Maestro is better for less technical people? Latency? Scale?

indiv0

1 replies

22h24m

2024-07-22 20:04:54 UTC

Is this meaningfully different from Conductor (which they archived a while back)? Browsing through the code I see quite a few similarities. Plus the use of JSON as the workflow definition language.

opiniateddev

0 replies

21h24m

2024-07-22 21:05:09 UTC

Conductor was moved here: https://github.com/conductor-oss/conductor Maestro uses conductor as its core.

https://github.com/Netflix/maestro/blob/main/maestro-engine/...

https://netflixtechblog.com/orchestrating-data-ml-workflows-...

halamadrid

1 replies

22h53m

2024-07-22 19:35:49 UTC

Very nice, Netflix has a reputation of making great OSS products. I wonder where does this stand with Conductor.

opiniateddev

0 replies

21h23m

2024-07-22 21:06:07 UTC

Maestro is a domain specific implementation for ML and data pipelines that uses Conductor as its core

https://netflixtechblog.com/orchestrating-data-ml-workflows-...

https://github.com/Netflix/maestro/blob/main/maestro-engine/...

willbeddow

0 replies

16h0m

2024-07-23 02:29:05 UTC

I'm sure this is very nice, but the article reads as if written by AI. The first thing I'd want to see is an example workflow (both code and configuration) in a realistic use case. Instead, there's a lot of "powerful and flexible" language, but the example workflow doesn't come until halfway down, and then it's just foobar

skywhopper

0 replies

21h17m

2024-07-22 21:12:02 UTC

Advice: don’t rely on any tool open-sourced by Netflix. They have a long history of dropping support for things after they’ve announced them. Someone got a checkmark on their promotion packet by getting this blog post and code sharing out the door, but don’t build your business on a solution like this.

saturn8601

0 replies

17h20m

2024-07-23 01:09:17 UTC

Anyone here use Activebatch? To me it is the best software I wish had an equivalent for non enterprise users. I have tried and tried to use other "competitors" but Activebatch's simplicity of just attaching a simple MS SQL DB, installing the Windows GUI and execution agent is just click, click, click and now you have a robust GUI based automation environment where you don't have to use code...or if you want, go ahead and use code in any language if you want...but you don't have to.

Airflow may be robust but it is hidden behind a complexity fence that prevents most from seeing whatever its true capability may be. The same goes for other "open source" competitors.

Why can't someone just develop a robust DB backed GUI first system?

I have tried online services as well, they pale in comparison. I guess the cost of maintaining extensions is what kills simpler paid offerings?

Its a complete shame that ActiveBatch is walled off behind a stupid enterprise sales model. This has prevented this wonderful piece of software from being picked up by the wider community. Its like a hidden secret. :/

pantsforbirds

0 replies

22h50m

2024-07-22 19:38:22 UTC

This is a really great-looking project. I know I've considered building (a probably worse) version of exactly this on almost every mixed ML + Data Engineering project I've ever worked on.

I'm looking forward to testing it out.

oneplane

0 replies

22h39m

2024-07-22 19:49:18 UTC

Looks a bit like Argo Workflows combined with Argo Events. Makes sense to have so many projects and products converge around the same endstate.

nikhilsimha

0 replies

16h7m

2024-07-23 02:22:06 UTC

great job on open sourcing!

monkychop

0 replies

13m

2024-07-23 18:15:34 UTC

Hooolaa

monkychop

0 replies

13m

2024-07-23 18:15:19 UTC

Eduardo

mianos

0 replies

17h5m

2024-07-23 01:24:00 UTC

Interesting how complete this is. It's almost as comprehensive as prefect.io

This is a critical software infrastructure I have been promoting for years yet almost everyone thinks they don't need it.

kabes

0 replies

12h5m

2024-07-23 06:23:55 UTC

It says one of the big differentiators with 'traditional workflow orchestrators' is that is supports cyclic graphs. But BPMN (and the orchestrators using it) also supports loops.

febed

0 replies

11h29m

2024-07-23 06:59:36 UTC

Dagster is a better alternative, because of its asset first philosophy. Task based workflows are still available if you really need it.

dboreham

0 replies

22h23m

2024-07-22 20:05:48 UTC

Interesting. My team recently built a thing for managing long running, multi-machine, restartable, cascading batch jobs in an unrelated vehicle. Had no idea it was a category.

bjourne

0 replies

20h22m

2024-07-22 22:07:09 UTC

What is a workflow in this context?

HugoLu88

0 replies

5h57m

2024-07-23 12:31:46 UTC

I'm building something in the space (orchestra) so here's my take:

Folks making stuff open source and building in the open is obviously brilliant, but when it comes to "orchestrators" (as this is, and identifies) there is already so much that has been before (Airflow and so on) it's quite hard to see how this actually adds anything to the space other than another option nobody is ever going to use in a commercial setting.

Shameless plug: https://getorchestra.io