return to table of content

How we migrated onto K8s in less than 12 months

solatic
91 replies
1d22h

I don't get the hate for Kubernetes in this thread. TFA is from Figma. You can talk all day long about how early startups just don't need the kind of management benefits that Kubernetes offers, but the article isn't written by someone working for a startup, it's written by a company that nearly got sold to Adobe for $20 billion.

Y'all really don't think a company like Figma stands to benefit from the flexibility that Kubernetes offers?

cwiggs
25 replies
1d20h

k8s is complex, if you don't need the following you probably shouldn't use it:

* Service discovery

* Auto bin packing

* Load Balancing

* Automated rollouts and rollbacks

* Horizonal scaling

* Probably more I forgot about

You also have secret and config management built in. If you use k8s you also have the added benefit of making it easier to move your workloads between clouds and bare metal. As long as you have a k8s cluster you can mostly move your app there.

Problem is most companies I've worked at in the past 10 years needed multiple of the features above, and they decided to roll their own solution with Ansible/Chef, Terraform, ASGs, Packer, custom scripts, custom apps, etc. The solutions have always been worse than what k8s provides, and it's a bespoke tool that you can't hire for.

For what k8s provides, it isn't complex, and it's all documented very well, AND it's extensible so you can build your own apps on top of it.

I think there are more SWE on HN than Infra/Platform/Devops/buzzword engineers. As a result there are a lot of people who don't have a lot of experience managing infra and think that spinning up their docker container on a VM is the same as putting an app in k8s. That's my opinion on why k8s gets so much hate on HN.

YZF
8 replies
1d18h

There are other out of the box features that are useful:

* Cert manager.

* External-dns.

* Monitoring stack (e.g. Grafana/Prometheus.)

* Overlay network.

* Integration with deployment tooling like ArgoCD or Spinnaker.

* Relatively easy to deploy anything that comes with a helm chart (your database or search engine or whatnot).

* Persistent volume/storage management.

* High availability.

It's also about using containers which mean there's a lot less to manage in hosts.

I'm a fan of k8s. There's a learning curve but there's a huge ecosystem and I also find the docs to be good.

But if you don't need any of it - don't use it! It is targeting a certain scale and beyond.

kachapopopow
6 replies
1d17h

I started with kubernetes and have never looked back. Being able to bring up a network copy, deploy a clustered database, deploy a distributed fs all in 10 minutes (including the install of k3s or k8s) has been a game-changer for me.

You can run monolithic apps with no downtime restarts quite easily with k8s using rollout restart policy which is very useful when applications take minutes to start.

methodical
3 replies
1d6h

In the same vein here.

Every time I see one of these posts and the ensuing comments I always get a little bit of inverse imposter syndrome. All of these people saying "Unless you're at 10k users+ scale you don't need k8s". If you're running a personal project with a single-digit user count, then sure, but only purely out of a cost-to-performance metric would I say k8s is unreasonable. Any scale larger, however, and I struggle to reconcile this position with the reality that anything with a consistent user base should have zero-downtime deployments, load balancing, etc. Maybe I'm just incredibly OOTL, but when did these simple features to implement and essentially free from a cost standpoint become optional? Perhaps I'm just misunderstanding the argument, and the argument is that you should use a Fly or Vercel-esque platform that provides some of these benefits without needing to configure k8s. Still, the problem with this mindset is that vendor lock-in is a lot harder to correct once a platform is in production and being used consistently without prolonged downtime.

Personally, I would do early builds with Fly and once I saw a consistent userbase I'd switch to k8s for scale, but this is purely due to the cost of a minimal k8s instance (especially on GKE or EKS). This, in essence, allows scaling from ~0 to ~1M+ with the only bottleneck being DB scaling (if you're using a single DB like CloudSQL).

Still, I wish I could reconcile my personal disconnect with the majority of people here who regard k8s as overly complicated and unnecessary. Are there really that many shops out there who consider the advantages of k8s above them or are they just achieving the same result in a different manner?

One could certainly learn enough k8s in a weekend to deploy a simple cluster. Now I'm not recommending this for someone's company's production instance, due to the foot guns if improperly configured, but the argument of k8s being too complicated to learn seems unfounded.

/rant

uaas
0 replies
20h12m

With the simplicity and cost of k3s and alternatives it can also make sense for personal projects from day one.

shakiXBT
0 replies
1d5h

I've been in your shoes for quite a long time. By now I've accepted that a lot of folks on HN and other similar forums simply don't know / care about the issue that Kubernetes resolves, or that someone else in their company takes care of those for them

AndrewKemendo
0 replies
1d

It’s actually much simpler than that

k8s makes it easier to build over engineered architectures for applications that don’t need that level of complexity

So while you are correct that it is not actually that difficult to learn and implement K8S it’s also almost always completely unnecessary even at the largest scale

given that you can do the largest scale stuff without it and you should do most small scale stuff without it, the number of people for whom all of the risks and costs balancr out is much smaller than the amount that it has been promoted and pushed

And given the fact that orchestration layers are a critical part of infrastructure, handing over or changing the data environment relationship in a multilayer computing environment to such an extent is a non-trivial one-way door

BobbyJo
1 replies
1d16h

100%

I can bring up a service, connect it to a postgres/redis/minio instance, and do almost anything locally that I can do in the cloud. It's a massive help for iterating.

There is a learning curve, but you learn it and you can do so damn much so damn easily.

kachapopopow
0 replies
19h7m

+1 on the learning curve, took me 3 attempts (gave up twice) before I spent 1 day learning the docs, then wasting a week moving some of my personal things to it.

Now I have a small personal cluster with machines and vps's (on some regions I don't have enough deployments to justify an entire machine) with a distributed multi-site fs that's mostly as certified for workloads as any other cloud. CDN, GeoDNS, nameservers all handled within the cluster. Any machine can go offline while connectivity remains the same, minus the timeout requirement of 5 minutes of downed pods to be rescheduled for monolithic services.

Kubernetes also provides an amazing way to learn things like bgp, ipam and many other things via calico, metallb and whatever else you want to learn.

epgui
0 replies
1d18h

To this I would also add the ability to manage all of your infrastructure with k8s manifests (eg.: crossplane).

maccard
7 replies
1d19h

For anyone who thinks this is a laundry list - running two instances of your app with a database means you need almost all of the above.

The _minute_ you start running containers in the cloud you need to think of "what happens if it goes down/how do I update it/how does it find the database", and you need an orchestrator of some sort, IMO. A managed service (I prefer ECS personally as it's just stupidly simple) is the way to go here.

hnav
6 replies
1d19h

Eh, you can easily deploy containers to EC2/GCE and have an autoscaling group/MIG with healthchecks. That's what I'd be doing for a first pass or if I had a monolith (a lot of business is still deploying a big ball of PHP). K8s really comes into its own once you're running lots of heterogeneous stuff all built by different teams. Software reflects organizational structure so if you don't have a centralized infra team you likely don't want container orchestration since it's basically your own cloud.

cwiggs
2 replies
1d18h

Sure you can use an AWS ASG, but I assume you also tie that into an AWS AlB/NLB. Then you use ACM for certs and now you are locked in to AWS times 3.

Instead you can do those 3 and more in k8s and it would be the same manifests regardless which k8s cluster you deploy to, EKS, AKS, GKE, on prem, etc.

Plus you don't get service discovery across VMs, you don't get a CSI so good luck if your app is stateful. How do you handle secrets, configs? How do you deploy everything, Ansible, Chef? The list goes on and on.

If your app is simple sure, I haven't seen simple app in years.

maccard
1 replies
1d18h

I've never worked anywhere that has benefitted from avoiding lock-in. We would have saved thousands in dev-hours if we just used an ALB instead of tweaking nginx and/or caddy.

Also, if you can't convert an ALB into an Azure Load balancer, then you probably have no business doing any sort of software development.

hobobaggins
0 replies
19h4m

I don't disagree about avoiding lock-in, and I'm sure it was hyperbole, but if you really spent thousands of dev-hours (approx 1 year) on tweaking nginx, you needed different devs ;)

ALB costs get very steep very quickly too, but you're right - start with ALB and then migrate to nginx when costs get too high

maccard
1 replies
1d18h

By containers on EC2 you mean installing docker on AMI's? How do you deploy them?

I really do think Google Cloud Run/Azure Container Apps (and then in AWS-land ECS-on-fargate) is the right solution _especially_ in that case - you just shove a container on and tell it the resources you need and you're done.

gunapologist99
0 replies
1d18h

From https://stackoverflow.com/questions/24418815/how-do-i-instal... , here's an example that you can just paste into your load balancing LaunchConfig and never have to log into an instance at all (just add your own runcmd: section -- and, hey, it's even YAML like everyone loves)

  #cloud-config
  
  apt:
    sources:
      docker.list:
        source: deb [arch=amd64] https://download.docker.com/linux/ubuntu $RELEASE stable
        keyid: 9DC858229FC7DD38854AE2D88D81803C0EBFCD88
  
  packages:
    - docker-ce
    - docker-ce-cli

MrDarcy
0 replies
1d15h

I did this. It’s not easier than k8s, GKE, EKS, etc…. It’s harder cause you have to roll it yourself.

If you do this just use GKE autopilot. It’s cheaper and done for you.

Osiris
2 replies
1d20h

Those all seem important to even moderately sized products.

worldsayshi
0 replies
1d19h

As long as your requirements are simple the config doesn't need to be complex either. Not much more than docker-compose.

But once you start using k8s you probably tend to scope creep and find a lot of shiny things to add to your set up.

doctorpangloss
0 replies
1d18h

Some ways to tell if someone is a great developer are easy. JetBrains IDE? Ample storage space? Solving problems with the CLI? Consistently formatted code using the language's packaging ecosystem? No comments that look like this:

    # A verbose comment that starts capitalized, followed by a single line of code, cuing you that it was written by a ChatBot.
Some ways to tell if someone is a great developer is hard. You can't tell if someone is a brilliant shipper of features, choosing exactly the right concerns to worry about at the moment, like doing more website authoring and less devops, with a grand plan for how to make everything cohere later; or, if the guy just doesn't know what the fuck he is doing.

Kubernetes adoption is one of those, hard ones. It isn't a strong, bright signal like using PEP 8 and having a `pyproject.toml` with dependencies declared. So it may be obvious to you, "People adopt Kubernetes over ad-hoc decoupled solutions like Terraform because it has, in a Darwinian way, found the smallest set of easily surmountable concerns that should apply to most good applications." But most people just see, "Ahh! Why can't I just write the method bodies for Python function signatures someone else wrote for me, just like they did in CS50!!!"

st3fan
1 replies
1d5h

If you don't need any of those things then your use of k8s just becomes simpler.

I find k8s an extremely nice platform to deploy simple things in that don't need any of the advanced features. All you do is package your programs as containers and write a minimal manifest and there you go. You need to learn a few new things, but the things you do not have to worry about that is a really great return.

Nomad is a good contender in that space but I think HashiCorp is letting it slowly become EOL and there are bascially zero Nomad-As-A-Service providers.

hylaride
0 replies
1d4h

If you don't need any of those things, going for a "serverless" option like fargate or whatever other cloud equivalents exist is a far better value prop. Then you never have to worry about k8s support or upgrades (of course, ECS/fargate is shit in its own ways, in particular the deployments being tied to new task definitions...).

tbrownaw
0 replies
1d14h

k8s is complex, if you don't need the following you probably shouldn't use it:

I use it (specifically, the canned k3s distro) for running a handful of single-instance things like for example plex on my utility server.

Containers are a very nice UX for isolating apps from the host system, and k8s is a very nice UX for running things made out of containers. Sure it's designed for complex distributed apps with lots of separate pieces, but it still handles the degenerate case (single instance of a single container) just fine.

gunapologist99
0 replies
1d18h

It's worth bearing in mind that, although any of these can be accomplished with any number of other products as you point out, LB and Horizontal Scaling, in particular, have been solved problems for more than 25 years (or longer depending on how you count)

For example, even servers (aka instances/vms/vps) with load balancers (aka fabric/mesh/istio/traefik/caddy/nginx/ha proxy/ATS/ALB/ELB/oh just shoot me) in front existed for apps that are LARGER than can fit on a single server (virtually the definition of horizontally scalable). These apps are typically monoliths or perhaps app tiers that have fallen out of style (like the traditional n-tier architecture of app server-cache-database, swap out whatever layers you like).

However, K8s is actually more about microservices. Each microservice can act like a tiny app on its own, but they are often inter-dependent and, especially at the beginning, it's often seen as not cost-effective to dedicate their own servers to them (along with the associate load balancing, redundant and cross-AZ, etc). And you might not even know what the scaling pain points for an app is, so this gives you a way to easily scale up without dedicating slightly expensive instances or support staff to running each cluster; your scale point is on the entire k8s cluster itself.

Even though that is ALL true, it's also true that k8s' sweet spot is actually pretty narrow, and many apps and teams probably won't benefit from it that much (or not at all and it actually ends up being a net negative, and that's not even talking about the much lower security isolation between containers compared to instances; yes, of course, k8s can schedule/orchestrate VMs as well, but no one really does that, unfortunately.)

But, it's always good resume fodder, and it's about the closest thing to a standard in the industry right now, since everyone has convinced themselves that the standard multi-AZ configuration of 2014 is just too expensive or complex to run compared to k8s, or something like that.

drdaeman
0 replies
1d18h

For what k8s provides, it isn't complex, and it's all documented very well

I had a different experience. Some years ago I wanted to set up a toy K8s cluster over an IPv6-only network. It was a total mess - documentation did not cover this case (at least I have not found it back then) and there was a lot of code to dig through to learn that it was not really supported back then as some code was hardcoded with AF_INET assumptions (I think it's all fixed nowadays). And maybe it's just me, but I really had much easier time navigating Linux kernel source than digging through K8s and CNI codebases.

This, together with a few very trivial crashes of "normal" non-toy clusters that I've seen (like two nodes suddenly failing to talk to each other, typically for simple textbook reasons like conntrack issues), resulted in an opinion "if something about this breaks, I have very limited ideas what to do, and it's a huge behemoth to learn". So I believe that simple things beat complex contraptions (assuming a simple system can do all you want it to do, of course!) in the long run because of the maintenance costs. Yeah, deploying K8s and running payloads is easy. Long-term maintenance - I'm not convinced that it can be easy, for a system of that scale.

I mean, I try to steer away from K8s until I find a use case for it, but I've heard that when K8s fails, a lot of people just tend to deploy a replacement and migrate all payloads there, because it's easier to do so than troubleshoot. (Could be just my bubble, of course.)

BobbyJo
25 replies
1d21h

Kubernetes isn't even that complicated, and first party support from cloud providers often means you're doing something in K8s inleu of doing it in a cloud specific way (like ingress vs cloud specific load balancer setups).

At a certain scale, K8s is the simple option.

I think much of the hate on HN comes from the "ruby on rails is all you need" crowd.

eutropia
12 replies
1d18h

I guess the ones who quietly ship dozens of rails apps on k8s are too busy getting shit done to stop and share their boring opinions about pragmatically choosing the right tool for the job :)

BobbyJo
11 replies
1d16h

"But you can run your rails app on a single host with embedded SQLite, K8s is unnecessary."

threeseed
5 replies
1d5h

Always said by people who haven't spent much time in the cloud.

Because single hosts will always go down. Just a question of when.

BossingAround
4 replies
1d5h

I love k8s, but bringing back up a single app that crashed is a very different problem from "our k8s is down" - because if you think your k8s won't go down, you're in for a surprise.

You can view a single k8s also as a single host, which will go down at some point (e.g. a botched upgrade, cloud network partition, or something similar). While much less frequent, also much more difficult to get out of.

Of course, if you have a multi-cloud setup with automatic (and periodically tested!) app migration across clouds, well then... Perhaps that's the answer nowadays.. :)

solatic
3 replies
1d4h

if you think your k8s won't go down, you're in for a surprise

Kubernetes is a remarkably reliable piece of software. I've administered (large X) number of clusters that often had several years of cluster lifetime, each, everything being upgraded through the relatively frequent Kubernetes release lifecycle. We definitely needed some maintenance windows sometimes, but well, no, Kubernetes didn't unexpectedly crash on us. Maybe I just got lucky, who knows. The closest we ever got was the underlying etcd cluster having heartbeat timeouts due to insufficient hardware, and etcd healed itself when the nodes were reprovisioned.

There's definitely a whole lotta stuff in the Kubernetes ecosystem that isn't nearly as reliable, but that has to be differentiated from Kubernetes itself (and the internal etcd dependency).

You can view a single k8s also as a single host, which will go down at some point (e.g. a botched upgrade, cloud network partition, or something similar)

The managed Kubernetes services solve the whole "botched upgrade" concern. etcd is designed to tolerate cloud network partitions and recover.

Comparing this to sudden hardware loss on a single-VM app is, quite frankly, insane.

cyberpunk
1 replies
1d1h

Even if your entire control plane disappears your nodes will keep running and likely for enough time to build an entirely new cluster to flip over to.

I don’t get it either. It’s not hard at all.

BossingAround
0 replies
23h13m

Your nodes & containers keep running, but is your networking up when your control plane is down?

__turbobrew__
0 replies
1d

If you start using more esoteric features the reliability of k8s goes down. Guess what happens when you enable the in place vertical pod scaling feature gate?

It restarts every single container in the cluster at the same time: https://github.com/kubernetes/kubernetes/issues/122028

We have also found data races in the statefulset controller which only occurs when you have thousands of statefulsets.

Overall, if you stay on the beaten path k8s reliability is good.

kayodelycaon
3 replies
1d4h

I've been working with rails since 1.2 and I've never seen anyone actually do this. Every meaningful deployment I've seen uses postgres or mysql. (Or god forbid mongodb.) It takes very little time with yours sol statements

You can run rails on a single host using a database on the same server. I've done it and it works just fine as long as you tune things correctly.

a_bored_husky
2 replies
1d2h

as long as you tune things correctly

Can you elaborate?

kayodelycaon
1 replies
1d1h

I don't remember the exact details because it was a long time ago, but what I do remember is

- Limiting memory usage and number of connections for mysql

- Tracking maximum memory size of rails application servers so you didn't run out a memory by running too many of them

- Avoid writing unnecessarily memory intensive code (This is pretty easy in ruby if you know what you're doing)

- Avoiding using gems unless they were worth the memory use

- Configuring the frontend webserver to start dropping connections before it ran out of memory (I'm pretty sure that was just a guess)

- Using the frontend webserver to handle traffic whenever possible (mostly redirects)

- Using IP tables to block traffic before hitting the webserver

- Periodically checking memory use and turning off unnecessary services and cronjobs

I had the entire application running on a 512mb VPS with roughly 70mb to spare. It was a little less spare than I wanted but it worked.

Most of this was just rate limiting with extra steps. At the time rails couldn't use threads, so there was a hard limit on the number of concurrent tasks.

When the site went down it was due to rate limiting and not the server locking up. It was possible to ssh in and make firewall adjustments instead of a forced restart.

a_bored_husky
0 replies
1d1h

Thank you.

ffsm8
0 replies
1d11h

And there is truth to that. Most deployments are at that level, and it absolutely is way more performant then the alternative. it just comes with several tradeoffs... But these tradeoffs are usually worth it for deployments with <10k concurrent users. Which Figma certainly isn't.

Though you probably could still do it, but that's likely more trouble then it's worth

(The 10k is just an arbitrary number I made up, there is no magic number which makes this approach unviable, it all depends on how the users interact with the platform/how often and where the data is inserted)

JohnMakin
4 replies
1d19h

I think much of the hate on HN comes from the "ruby on rails is all you need" crowd.

Maybe - people seem really gungho about serverless solutions here too

dexwiz
3 replies
1d18h

The hype for serverless cooled after that article about Prime Video dropping lambda. No one wants a product that a company won’t dogfood. I realize Amazon probably uses lambda elsewhere, but it was still a bad look.

cmckn
0 replies
1d18h

Amazon probably uses lambda elsewhere

Yes, you could say that. :)

LordKeren
0 replies
1d18h

I think it was much more about one specific use case of lambda that was a bad fit for the prime video team’s need and not a rejection of lambda/serverless. TBH, it kind of reflected more poorly on the team than lambda as a product

JohnMakin
0 replies
1d17h

not probably, their lambda service powers much of their control plane.

MrDarcy
3 replies
1d15h

Kubernetes isn't even that complicated

I’ve been struggling to square this sentiment as well. I spend all day in AWS and k8s and k8s is at least an order of magnitude simpler than AWS.

What are all the people who think operating k8s is too complicated operating on? Surely not AWS…

tbrownaw
0 replies
1d14h

The thing you already know tends to be less complicated than the thing you don't know.

rco8786
0 replies
1d7h

I think "k8s is complicated" and "AWS is even more complicated" can both be true.

Doing anything in AWS is like pulling teeth.

brainzap
0 replies
1d3h

The sum is complex, specially with the custom operators.

dorianmariefr
0 replies
1d6h

ruby on rails is all you need

chrischen
0 replies
1d12h

There are also a lot of cog-in-the-machine engineers here that totally do not get the bigger picture or the vantage point from another department.

bamboozled
0 replies
1d17h

Agreed, we're a small team and we benefit greatly from managed k8s (EKS). I have to say the whole ecosystem just continues to improve as far as I can tell and the developer satisfaction is really high with it.

Personally I think k8s is where it's at now. The innovation and open source contributions are immense.

I'm glad we made the switch. I understand the frustrations of the past, but I think it was much harder to use 4+ years ago. Now, I don't see how anyone could mess it up so hard.

logifail
14 replies
1d21h

it's written by a company that nearly got sold to Adobe for $20 billion

(Apologies if this is a dumb question) but isn't Figma big enough to want to do any of their stuff on their own hardware yet? Why would they still be paying AWS rates?

Or is it the case that a high-profile blog post about K8S and being provider-agnostic gets you sufficient discount on your AWS bill to still be value-for-money?

ijidak
2 replies
1d18h

It's a fair question.

Data centers are wildly expensive to operate if you want proper security, redundancy, reliability, recoverability, bandwidth, scale elasticity, etc.

And when I say security, I'm not just talking about software level security, but literal armed guards are needed at the scale of a company like Figma.

Bandwidth at that scale means literally negotiating to buy up enough direct fiber and verifying the routes that fiber takes between data centers.

At one of the companies I worked at, it was not uncommon to lose data center connectivity because a farmer's tractor cut a major fiber line we relied on.

Scalability might include tracking square footage available for new racks in physical buildings.

As long as your company is profitable, at anything but Facebook like scale, it may not be worth the trouble to try to run your own data center.

Even if the cloud doesn't save money, it saves mental energy and focus.

shrubble
0 replies
1d3h

This is a 20-years-ago take. If your datacenter provider doesn't have multiple fiber entry into the building with multiple carriers, you chose the wrong provider at this point.

mcpherrinm
0 replies
1d17h

There’s a ton of middle ground between a fully managed cloud like AWS and building your own hyperscaler datacenter like Facebook.

Renting a few hundred cabinets from Equinix or Digital Realty is going to potentially be hugely cheaper than AWS, but you probably need a team of people to run it. That can be worthwhile if your growth is predictable and especially if your AWS bandwidth bill is expensive.

But then you’re building on bare metal. Gotta deploy your own databases, maybe kubernetes for running workloads, or something like VMware for VMs. And you don’t get any managed cloud services, so that’s another dozen employees you might need.

hyperbolablabla
2 replies
1d20h

I work for a company making ~$9B in annual revenue and we use AWS for everything. I think a big aspect of that is just developer buy-in, as well as reliability guarantees, and being able to blame Amazon when things do go down

st3fan
1 replies
1d5h

Also, you don't have to worry about half of your stack? The shared responsibility model really works.

consteval
0 replies
1d3h

No, you still do. You just replace those sysadmins with AWS dev ops people. But ultimately your concerns haven't gone down, they've changed. It's true you don't have to worry about hardware. But, then again, you can use coloco datacenters or even VPS.

jeffbee
1 replies
1d20h

There are a lot of ex-Dropbox people at Figma who might have learned firsthand that bringing your stuff on-prem under a theory of saving money is an intensely stupid idea.

logifail
0 replies
1d20h

There are a lot of ex-Dropbox people at Figma who might have learned firsthand that bringing your stuff on-prem under a theory of saving money is an intensely stupid idea

Well, that's one hypothesis.

Another is that "Every maturing company with predictable products must be exploring ways to move workloads out of the cloud. AWS took your margin and isn't giving it back." ( https://news.ycombinator.com/item?id=35235775 )

tayo42
0 replies
1d19h

There must be a prohibitively expensive upfront cost to buy enough servers to do this. Plus bringing in all the skill that doesn't exist that can stand up and run something like they would require.

I wonder if as time goes on that skill to use hardware is dissappearing. New engineers don't learn it, and the ones that slowly forget. I'm not that sharp on anything I haven't done in years, even if it's in a related domain.

sangnoir
0 replies
1d18h

Why would they still be paying AWS rates?

They are almost certainly not paying sticker prices. Above a certain size, companies tend to have bespoke prices and SLAs that are negotiated in confidence.

ozim
0 replies
1d20h

They are preparing for next blog post in a year - „how we cut costs by xx% by moving to our own servers”.

manquer
0 replies
1d18h

A valuation is a just headline number which have no operational bearing.

Their ARR in 2022 was around $400M-450M. Say the infra budget at a typical 10% would be $50M. While it is a lot of money, it is not build your hardware money, also not all of it would be compute budget. They also would be spending on other SaaS apps like say Snowflake etc to special workloads like with GPUs, so not all workloads would be in-house ready. I would surprised if their commodity compute/k8s is more than half their overall budgets.

It is lot more likely to slow product growth to focus on this now, especially since they were/are still growing rapidly.

Larger SaaS companies than them in ARR still find using cloud exclusively is more productive/efficient.

j_kao
0 replies
1d18h

Companies like Netflix with bigger market caps are still on AWS.

I can imagine the productivity of spinning up elastic cloud resources vs fixed data center resourcing being more important, especially considering how frequently a company like Figma ships new features.

NomDePlum
0 replies
1d20h

Much bigger companies use AWS for very practical well thought out reasons.

Not managing procurement of hardware, upgrades, etc, and a defined standard operating model with accessible documentation and the ability to hire people with experience, and have to hire less people as you are doing less is enough to build a viable and demonstrable business case.

Scale beyond a certain point is hard without support and delegated responsibility.

osigurdson
12 replies
1d12h

Kubernetes is the most amazing piece of software engineering that I have ever seen. Most of the hate is merely being directed at the learning curve.

otabdeveloper4
9 replies
1d7h

No, k8s is shit.

It's only useful for the degenerate "run lots of instances of webapp servers running slow interpreted languages" use case.

Trying to do anything else in it is madness.

And for the "webapp servers" use case they could have built something a thousand times simpler and more robust. Serving templated html ain't rocket science. (At least compared to e.g. running an OLAP database cluster.)

shakiXBT
5 replies
1d3h

Could you please bless us with another way to easily orchestrate thousands of containers in a cloud vendor agnostic fashion? Thanks!

Oh, and just in case your first rebuttal is "having thousands of containers means you've already failed" - not everyone works in a mom n pop shop

otabdeveloper4
3 replies
1d2h

Read my post again.

Just because k8s is the only game in town doesn't mean it is technically any good.

As a technology it is a total shitshow.

Luckily, the problem it solves ("orchestrating" slow webapp containers) is not a problem most professionals care about.

Feature creep of k8s into domains it is utterly unsuitable for because devops wants a pay raise is a different issue.

osigurdson
1 replies
23h56m

> As a technology it is a total shitshow.

What aspects are you referring to?

> is not a problem most professionals care about

professional as in True Scotsman?

otabdeveloper4
0 replies
20h9m

professional as in True Scotsman?

No, I mean that Kubernetes solves a super narrow and specific problem that most developers do not need to solve.

shakiXBT
0 replies
1d2h

Orchestrating containers is not a problem most professionals care about

I truly wish you were right, but maybe it's good job security for us professionals!

candiddevmike
0 replies
1d2h

Oh, and just in case your first rebuttal is "having thousands of containers means you've already failed" - not everyone works in a mom n pop shop

The majority of folks, whether or not they admit it, probably do...

otabdeveloper4
1 replies
20h6m

Yeah, they basically spent a shitload of effort developing their own cluster management platform that turns off all the Kubernetes functionality in Kubernetes.

Must be some artifact of hosting on Azure, because I can't imagine any other reason to do something this contorted.

osigurdson
0 replies
17h41m

How much hands on time do you personally have with Kubernetes?

spmurrayzzz
0 replies
1d4h

I agree with respect to admiring it from afar. I've gone through large chunks of the source many times and always have an appreciation for what it does and how it accomplishes it. It has a great, supportive community around it as well (if not a tiny bit proselytizing at times, which doesn't bother me really).

With all that said, while I have no "hate" for the stack, I still have no plans to migrate our container infrastructure to it now or in the foreseeable future. I say that precisely because I've seen the source, not in spite of it. The net ROI on subsuming that level of complexity for most application ecosystems just doesn't strike me as obvious.

hylaride
0 replies
1d4h

Not to be rude, but K8s has had some very glaring issues, especially early on when the hype was at max.

* Its secrets management was terrible, and for awhile it stored them in plaintext in etcd. * The learning curve was real and that's dangerous as there were no "best practice" guides or lessons learned. There are lots of horror stories of upgrades gone wrong, bugs, etc. Complexity leaves a greater chance of misconfiguration, which can cause security or stability problems. * It was often redundant. If you're in the cloud, you already had load balancers, service discovery, etc. * Upgrades were dangerous and painful in its early days. * It initially had glaring third party tooling integration issues, which made monitoring or package management harder (and led to third party apps like Helm, etc).

A lot of these have been rectified, but a lot of us have been burned by the promise of a tool that google said was used internally, which was a bit of a lie as kubernetes was a rewrite of Borg.

Kubernetes is powerful, but you can do powerful in simple(r) ways, too. If it was truly "the most amazing" it would have been designed to be simple by default with as much complexity needed as everybody's deployments. It wasn't.

smitelli
2 replies
1d5h

I can only speak for myself, and some of the reasons why K8s has left a bad taste in my mouth:

- It can be complex depending on the third-party controllers and operators in use. If you're not anticipating how they're going to make your resources behave differently than the documentation examples suggest they will, it can be exhausting to trace down what's making them act that way.

- The cluster owners encounter forced software updates that seem to come at the most inopportune times. Yes, staying fresh and new is important, but we have other actual business goals we have to achieve at the same time and -- especially with the current cost-cutting climate -- care and feeding of K8s is never an organizational priority.

- A bunch of the controllers we relied on felt like alpha grade toy software. We went into each control plane update (see previous point) expecting some damn thing to break and require more time investment to get the cluster simply working like it was before.

- While we (cluster owners) begrudgingly updated, software teams that used the cluster absolutely did not. Countless support requests for broken deployments, which were all resolved by hand-holding the team through a Helm chart update that we advised them they'd need to do months earlier.

- It's really not cheaper than e.g. ECS, again, in my experience.

- Maybe this has/will change with time, but I really didn't see the "onboarding talent is easier because they already know it." They absolutely did not. If you're coming from a shop that used Istio/Argo and move to a Linkerd/Flux shop, congratulations, now there's a bunch to unlearn and relearn.

- K8s is the first environment where I palpably felt like we as an industry reached a point where there were so many layers and layers on top of abstractions of abstractions that it became un-debuggable in practice. This is points #1-3 coming together to manifest as weird latency spikes, scaling runaways, and oncall runbooks that were tantamount to "turn it off and back on."

Were some of these problems organizational? Almost certainly. But K8s had always been sold as this miracle technology that would relieve so many pain points that we would be better off than we had been. In my experience, it did not do that.

ahoka
1 replies
1d4h

What would be the alternative?

smitelli
0 replies
6h26m

Truthfully, I don't know. But I suspect I'm not the only one who feels a kind of debilitating ennui about the way things have gone and how they continue to go.

saaspirant
2 replies
1d14h

Unrelated: What does _TFA_ mean here? Google and GPT didn't help (even with context)

solatic
1 replies
1d13h

The Featured Article.

(or, if you read it in a frustrated voice, The F**ing Article.)

Xixi
0 replies
1d13h

Related acronyms: RTFA (Read The F**ing Article) and RTFM (Read The F**ing Manual). The latter was a very common answer when struggling with Linux in the early 2000s...

globular-toast
2 replies
1d12h

I don't get the hate even if you are a small company. K8s has massively simplified our deployments. It used to be each app had it's own completely different deployment process. Could have been a shell script that SSHed to some VM. Who managed said VM? Did it do its own TLS termination? Fuck knows. Maybe they used Ansible. Great, but that's another tool to learn and do I really need to set up bare metal hosts from scratch for every service. No, so there's probably some other Ansible config somewhere that sets them up. And the secrets are stored where? Etc etc.

People who say "you don't need k8s" never say what you do need. K8s gives us a uniform interface that works for everything. We just have a few YAML files for each app and it just works. We can just chuck new things on there and don't even have to think about networking. Just add a Service and it's magically available with a name to everything in the cluster. I know how to do this stuff from scratch and I do not want to be doing it every single time.

shakiXBT
1 replies
1d3h

if you don't need High Availability you can even deploy to a single node k3s cluster. It's still miles better than having to setup systemd services, an Apache/NGINX proxy, etc. etc.

globular-toast
0 replies
1d2h

Yep, and you can get far with k3s "fake" load balancer (ServiceLB). Then when you need a more "real" cluster basically all the concepts are the same you just move to a new cluster.

tracerbulletx
0 replies
1d18h

Even on a small project it's actually better imo than tying everything to a platform like netlify or vercel. I have this little notepad app that I deploy to a two node cluster in a github action and its an excellent workflow. The k8s to get everything deployed, provision tls and everything on commit is like 150 lines of mostly boilerplate yaml, I could pretty easily make it support branch previews or whatever too. https://github.com/SteveCastle/modelpad

roydivision
0 replies
1d9h

One thing I learned when I started learning Kubernetes is that it is two disciplines that overlap, but are distinct none the less:

- Platform build and management - App build and management

Getting a stable K8s cluster up and running is quite different to building and running apps on it. Obviously there is overlap in the knowledge required, but there is a world of difference between using a cloud based cluster over your own home made one.

We are a very small team and opted for cloud managed clusters, which really freed me up to concentrate on how to build and manage applications running on it.

xyst
83 replies
1d19h

Of course there’s no mention of performance loss or gain after migration.

I remember when microservices architecture was the latest hot trend that came off the presses. Small and big firms were racing to redesign/reimplement apps. But most forgot they weren’t Google/Netflix/Facebook.

I remember end user experience ended up being _worse_ after the implementation. There was a saturation point where a single micro service called by all of the other micro services would cause complete system meltdown. There was also the case of an “accidental” dependency loop (S1 -> S2 -> S3 -> S1). Company didn’t have an easy way to trace logs across different services (way before distributed tracing was a thing). Turns out only a specific condition would trigger the dependency loop (maybe, 1 in 100 requests?).

Good times. Also, job safety.

pram
27 replies
1d18h

The best part is when it’s all architected to depend on something that becomes essentially a single point of failure, like Kafka.

lmm
19 replies
1d13h

At least Kafka can be properly master-master HA. How people ever got away with building massively redundant fault-tolerant applications that were completely dependent on a single SQL server, I'll never understand.

lelanthran
13 replies
1d11h

How people ever got away with building massively redundant fault-tolerant applications that were completely dependent on a single SQL server, I'll never understand.

It works, with a lower cognitive burden than that of horizontally scaling.

For the loading concern (i.e. is this enough to handle the load):

For most businesses, being able to serve 20k concurrent requests is way more than they need anyway: an internal app used by 500k users typically has fewer than 20k concurrent requests in flight at peak.

A cheap VPS running PostgreSQL can easily handle that.[1]

For the "if something breaks" concern:

Each "fault-tolerance" criteria added adds some cost. At some point the cost of being resistant to errors exceeds the cost of downtime. The mechanisms to reduce downtime when the single large SQL server shits the bed (failovers, RO followers, whatever) can reduce that downtime to mere minutes.

What is the benefit to removing 3 minutes of downtime? $100? $1k? $100k? $1m? The business will have to decide what those 3 minutes are worth, and if that worth exceeds the cost of using something other than a single large SQL server.

Until and unless you reach the load and downtime-cost of Google, Amazon, Twitter, FB, Netflix, etc, you're simply prematurely optimising for a scenario that, even in the businesses best-case projections, might never exist.

The best thing to do, TBH, is ask the business for their best-case projections and build to handle 90% of that.

[1] An expensive VPS running PostgreSQL can handle a lot more than you think.

zelphirkalt
10 replies
1d9h

The business can try to decide what those 3min are worth, but ultimately the customers vote by either staying or leaving that service.

lelanthran
3 replies
1d9h

The business can try to decide what those 3min are worth, but ultimately the customers vote by either staying or leaving that service.

That's still a business decision.

Customers don't vote with their feet based on what tech stack the business chose, they vote based on a range of other factors, few, if none, of which are related to 3m of downtime.

There are little to no services I know off that would lose customers over 3m of downtime per week.

IOW, 3m of downtime is mostly an imaginary problem.

zelphirkalt
2 replies
1d7h

That's really a too broad generalization.

Services that people might leave, because of downtime are for example a git hoster, or a password manager. When people cannot push their commits and this happens multiple times, they may leave for another git hoster. I have seen this very example when gitlab was less stable and often unreachable for a few minutes. When people need some credentials, but cannot reach their online password manager, they cannot work. They cannot trust that service to be available in critical moments. Not being able to access your credentials leaves a very bad impression. Some will look for more reliable ways of storing their credentials.

skydhash
0 replies
1d2h

Why does a password manager needs to be online. I understand the need for synchronization, but being exclusively online is a very bad decision. And git synchronization is basically ssh, and if you mess that up on a regular basis, you have no job being in business in the first place. These are examples, but there's a few things that do not need to be online unless your computer is a thin client or you don't trust it at all.

Bjartr
0 replies
1d6h

The user experience of "often unreachable" means way more than 3m per week in practice.

scott_w
2 replies
1d4h

> What is the benefit to removing 3 minutes of downtime?

The business can try to decide what those 3min are worth, but ultimately the customers vote by either staying or leaving that service.

What do you think the business is doing when it evaluates what 3 minutes are worth?

zelphirkalt
1 replies
1d4h

There is no "the business". Businesses do all kinds of f'ed up things and lie to themselves all the time as well.

I don't understand, what people are arguing about here. Are we really arguing about customers making their own choice? Since that is all I stated. The business can jump up and down all it wants, if the customers decide to leave. Is that not very clear?

lelanthran
0 replies
1d2h

The business can jump up and down all it wants, if the customers decide to leave.

I think the point is that, for a few minutes of downtime, businesses lose so little customers that it's not worth avoiding that downtime.

Just now, we had a 5m period where disney+ stopped responding. We aren't going to cut off our toddler from peppa big and bluey for 5m of downtime per day, nevermind per week.

You appeared to be under the impression that 3m downtime/week is enough to make people leave. This is simply not true, especially for internet services where the users are conditioned to simply wait.

consteval
2 replies
1d3h

True, but what people should understand about databases is they're incredibly mature software. They don't fail, they just don't. It's not like the software we're used to using where "whoopsie! Something broke!" is common.

I've never, in my life, seen an error in SQL Server related to SQL Server. It's always been me, the app code developer.

Now, to be fair, the server itself or the hardware CAN fail. But having active/passive database configurations is simple, tried and tested.

skydhash
1 replies
1d2h

And the server itself can be very resilient if you run something like debian or freebsd. Even on arch, I've seen things fails rarely unless it's fringe/proprietary code (bluetooth, nvidia, the browser and 3d accelerated graphics,...). That presumes you will use boring tech which are heavily tested by people around the world, not something "new" and "hyped" which is still on 0.x

consteval
0 replies
1d2h

I agree 100%. Unfortunately my company is pretty tied to windows and windows server, which is a pain. Upgrading and sysadmin-type work is still very manual and there's a lot of room for human error.

I wish we would use something like Debian and take advantage of tech like systemd. But alas, we're still using COM and Windows Services and we still need to remote desktop in and click around on random GUIs to get stuff to work.

Luckily, SQL Server itself is very stable and reliable. But even SQL Server runs on Linux.

brazzy
0 replies
1d10h

Each "fault-tolerance" criteria added adds some cost. At some point the cost of being resistant to errors exceeds the cost of downtime.

Not to forget: those costs are not just in money and time, but also in complexity. And added complexity comes with its own downtime risks. It's not that uncommon for systems to go down due to problems with mechanisms or components that would not exist in a simpler, "not fault tolerant" system.

ClumsyPilot
0 replies
1d3h

For most businesses, being able to serve 20k concurrent requests is way more than they need anyway: an internal app used by 500k

This is a very simple distinction and I am not sure why is it not understood

For some reason people design public apps the same as internal apps

The largest companies employ circa 1 million people - that’s Walmart, Amazon, etc. most giants, like Shell, etc. companies have ~ 100k tops. That can be handled by 1 beefy server.

Successful consumer facing apps are hundred millions to billion. It’s 3 orders of magnitude difference

I have seen a company with 5k employees invest into mega-scalable microservice event driven architecture and I was was thinking - I hope they realise what they are doing and it’s just CV-driven development

slt2021
1 replies
1d11h

stack overflow has no problem serving entire planet from just four SQL Servers (https://nickcraver.com/blog/2016/02/17/stack-overflow-the-ar...)

There is really nothing wrong with a large vertically scaled up SQL server. You need to be either really really large large scale - or really really UNSKILLED in sql as to keep your relational model and working set in SQL so bad that you reach its limits

HideousKojima
0 replies
1d3h

or really really UNSKILLED in sql as to keep your relational model and working set in SQL so bad that you reach its limits

Sadly that's the case at my current job. Zero thought put into table design, zero effort into even formatting our stored procedures in a remotely readable way, zero attempts to cache data on the application side even when it's glaringly obvious. We actually brought in a consultant to diagnose our SQL Server performance issues (I'm sure we paid a small fortune for that) and the DB team and all of the other higher ups capable of actually enforcing change rejected every last one of his suggestions.

trog
0 replies
1d12h

I think because it works pretty close to 100 percent of the time with only the most basic maintenance and care (like making sure you don't run out of disk space and keeping up with security updates). You can go amazingly far with this, and adding read only replicas gets you a lot further with little extra effort.

hylaride
0 replies
1d4h

Because people are beholden to costs and it's often out of our hands when to spend money on redundancy (or opportunity costs elsewhere).

It's less true today when redundancy is baked into SaS products (like AWS Aurora, where even if you have a single database instance, it's easy to spin up a replacement one if the hardware on the one running fails).

ClumsyPilot
0 replies
1d3h

Ow yeah, I am looking at that problem right now

arctek
2 replies
1d13h

Isn't this somewhat better, at least when it fails it's in a single place?

As someone using Kafka, I'd like to know what the (good) alternatives are if you have suggestions.

happymellon
0 replies
1d11h

It really depends on what your application is.

Where I'm at, most of Kafka usage adds nothing of note and could be replaced with a rest service. It sounds good that Kafka makes everything execute in order, but honestly just making requests block does the same thing.

At least then I could autoscale, which Kafka prevents.

NortySpock
0 replies
1d

NATS JetStream seemed to support horizontal scaling (either hierarchical via leaf nodes or a flat RAFT quorum) and back pressure when I played with it.

I found it easy to get up and running, even as a RAFT cluster, but I have not tried to use JetStream mode heavily yet.

vergessenmir
1 replies
1d6h

I'm sorry,but I can't tell if you're being serious or not since you commented without qualification.

One of the most stable system archictectures I've built was on Kafka AND it was running with microservices managed by teams across multiple geographies and time zones. It was one of the most reliable systems in the bank. There are situations where it isn't appropriate, which can be said for most tech e.g( K8S vs ECS vs Nomad vs bare metal)

Every system has failure characteristics. Kafka's is defined as Consistent and Available and your system architecture needs to take that into consideration. Also the transactionality of tasks across multiple services and process boundaries is important.

Let's not pretend that kubernetes (or the tech of the day) is at fault while completely ignoring the complex architectural considerations that are being juggled

pram
0 replies
1d2h

Basically because people end up engineering their microservices as a shim to funnel data into the “magical black hole”

From my experience most microservices aren’t engineered to handle back pressure. If there is a sudden upsurge in traffic or data the Kafka cluster is expected to absorb all of the throughput. If the cluster starts having IO issues then literally everything in your “distributed” application is now slowly failing until the consumers/brokers can catch up.

intelVISA
0 replies
1d15h

Shhh, you're ruining the party!

bamboozled
0 replies
1d17h

Not saying you're wrong, but what is your grand plan ? I've never seen anything perfect.

jiggawatts
23 replies
1d18h

Even if the microservices platform is running at 1% capacity, it's guaranteed to have worse performance than almost any monolith architecture.

It's very rare to meet a developer who has even the vaguest notion of what an RPC call costs in terms of microseconds.

Fewer still that know about issues such as head-of-line blocking, the effects of load balancer modes such as hash versus round-robin, or the CPU overheads of protocol ser/des.

If you have an architecture that involves about 5 hops combined with sidecars, envoys, reverse proxies, and multiple zones you're almost certainly spending 99% to 99.9% of the wall clock time just waiting. The useful compute time can rapidly start to approach zero.

This is how you end up with apps like Jira taking a solid minute to show an empty form.

throwaway48540
15 replies
1d18h

Are you talking about cloud Jira? I use it daily and it's very quick, even search results appear immediately...

abrookewood
8 replies
1d17h

Maybe you're being honest, but you're using a throwaway account and I use Cloud Jira every day. It's slow and bloated and drives me crazy.

kahmeal
7 replies
1d16h

Has a LOT to do with how it's configured and what plugins are installed.

jiggawatts
6 replies
1d15h

It really doesn't. This is the excuse trotted out by Atlassian staff when defending their products in public forums, essentially "corporate propaganda". They have a history of gaslighting users, either telling them to disregard the evidence of their own lying eyes, or that it's their own fault for using the product wrong somehow.

I tested the Jira cloud service with a new, blank account. Zero data, zero customisations, zero security rules. Empty.

Almost all basic operations took tens of seconds to run, even when run repeatedly to warm up any internal caches. Opening a new issue ticket form was especially bad, taking nearly a minute.

Other Atlassian excuses included: corporate web proxy servers (I have none), slow Internet (gigabit fibre), slow PCs (gaming laptop on "high performance" settings), browser security plugins (none), etc...

ffsm8
3 replies
1d14h

Opening a new issue ticket form was especially bad, taking nearly a minute.

At that point, something mustve been wrong with your instance. I'd never call jira fast, but the new ticked dialog on a unconfigured instance opens within <10s (which is absolutely horrendous performance to be clear. Anything more then 200-500ms is.)

jiggawatts
2 replies
1d12h

Cloud Jira is notably slower than on-prem Jira, which takes on the order of 10 seconds like you said.

ffsm8
1 replies
1d12h

That does not mirror my own experience. And it's very easy to validate. Just create a free jira cloud instance, takes about 1 minute ( https://www.atlassian.com/software/jira/try ) and click new issue.

it's open within 1-2 sec (which is still bad performance, objectively speaking. It's an empty instance after all and already >1s )

jiggawatts
0 replies
1d12h

Ah, there's a new UI now. The last time I tested this, the entire look & feel was different and everything was in slow motion.

It's still sluggish compared to a desktop app from the 1990s, but it's much faster than just a couple of years ago.

throwaway48540
0 replies
1d11h

This is just not true. The create new issue form appears nearly immediately. I have created two tickets right now - in less than a minute including the writing of few sentences.

hadrien01
0 replies
1d5h

Atlassian themselves don't use JIRA Cloud. They use the datacenter edition (on-premise) for their public bug tracker, and it's sooooo much faster than the Cloud version: https://jira.atlassian.com/browse/

nikau
3 replies
1d18h

Have you ever used an old ticketing system like remedy? Ugly as sin, but screens appear instantly.

I think Web apps have been around so long now people have forgotten how unresponsive things are vs old 2 tier stuff.

throwaway48540
0 replies
1d11h

The SPA Jira cloud is faster than anything server rendered for me, my connection is shit. In Jira I can at least move to static forms quickly and I'm not downloading the entire page on every move.

chuckadams
0 replies
1d4h

I’ve used Remedy in three different shops, and the experience varied dramatically. The entry screen might have popped instantly, but RPC timeouts on submission were common. Tab order for controls was the order they were added, not position on the screen. Remedy could be pleasant, but it was very dependent on a competent admin to set up and maintain it.

BeefWellington
0 replies
1d14h

There's a definite vibe these days of "this is how it's always been" when it really hasn't.

lmm
0 replies
1d13h

With cloud Jira you get thrown on a shared instance with no control over who you're sharing with, so it's random whether yours will be fast or extremely slow.

Lucasoato
0 replies
1d11h

I’m starting to think that these people praising Jira are just part of an architected PR campaign that tries to deny what’s evident to the end users: Jira is slow, bloated in many cases almost unusable.

randomdata
2 replies
1d16h

> It's very rare to meet a developer who has even the vaguest notion of what an RPC call costs in terms of microseconds.

To be fair, small time units are difficult to internalize. Just look at what happens when someone finds out that it takes tens of nanoseconds to call a C function in Go (gc). They regularly conclude that it's completely unusable, and not just in a tight loop with an unfathomable number of calls, but even for a single call in their program that runs once per day. You can flat out tell another developer exactly how many microseconds the RPC is going to add and they still aren't apt to get it.

It is not rare to find developers who understand that RPC has a higher cost than a local function, though, and with enough understanding of that to know that there could be a problem if overused. Where they often fall down, however, is when the tools and frameworks try to hide the complexity by making RPC look like a local function. It then becomes easy to miss that there is additional overheard to consider. Make the complexity explicit and you won't find many developers oblivious to it.

j16sdiz
0 replies
1d14h

Those time cost need to contextualize with time budgets for each service. Without that, it is always somebody else's problem in a RPC world.

HideousKojima
0 replies
1d2h

Just look at what happens when someone finds out that it takes tens of nanoseconds to call a C function in Go (gc).

I'm not too familiar with Go but my default assumption is that it's just used as a convenient excuse to avoid learning how to do FFI.

otabdeveloper4
1 replies
1d12h

It's actually worse than what you said. In 2024 the network is the only resource we can't upgrade on demand. There are physical limits we can't change. (I.e., there are only so many wires connecting your machines, and any significant upgrade involves building your own data center.)

So really eventually we'll all be optimizing around network interfaces as the bottleneck.

jiggawatts
0 replies
1d12h

cough speed of light cough

lmm
1 replies
1d13h

This is how you end up with apps like Jira taking a solid minute to show an empty form.

Nah. Jira was horrible and slow long before the microservice trend.

p_l
0 replies
1d11h

JIRA slowness usually involved under provisioned server resources in my experience

jamesfinlayson
8 replies
1d18h

There was a saturation point where a single micro service called by all of the other micro services would cause complete system meltdown.

Yep - saw that at a company recently - something in AWS was running a little slower than usual which cascaded to cause massive failures. Dozens of people were trying to get to the bottom of it, it mysteriously fixed itself and no one could offer any good explanation.

spyspy
7 replies
1d13h

Any company that doesn’t have some form of distributed tracing in this day and age is acting with pure negligence IMO. Literally flying blind.

consteval
2 replies
1d3h

Solution: don't build a distributed system. Just have a computer somewhere running .NET or Java or something. If you really want data integrity and safety, just make the data layer distributed.

There's very little reason to distribute application code. It's very, very rare that the limiting factor in an application is compute. Typically, it's the data layer, which you can change independently of your application.

MaKey
1 replies
1d3h

I have yet to personally see an application where distribution of its parts was beneficial. For most applications a boring monolith works totally fine.

consteval
0 replies
1d2h

I'm sure it exists when the problem itself is distributed. For example, I can imagine something like YouTube would require a complex distributed system.

But I think very few problems fit into that archetype. Instead, people build distributed systems for reliability and integrity. But it's overkill, because you bring all the baggage and complexity of distributed computing. This area is incredibly difficult. I view it similar to parallelism. If you can avoid it for your problem, then avoid it. If you really can't, then take a less complex approach. There's no reason to jump to "scale to X threads and every thread is unaware of where it's running" type solutions, because those are complex.

bboygravity
1 replies
1d10h

Distributed tracing?

Is that the same as what Elixir/Erlang call supervision trees?

williamdclt
0 replies
1d7h

Likely about Opentelemetry

ahoka
1 replies
1d9h

Oh, come on! Just make sure your architecture is sound. If you need to run an expensive data analysis cluster connected to massive streams of collected call information to see that you have loops on your architecture, then you have a bigger issue.

hylaride
0 replies
1d4h

I don't know if you're being sarcastic, which if you are: heh.

But to the point, if you're going to build a distributed system, you need tools to track problems across the distributed system that also works across teams. A poorly performing service could be caused by up/downstream components and doing that without some kind of tracing is hard even if your stack is linear.

The same is true for a giant monolithic app, but the sophisticated tools are just different.

api
5 replies
1d19h

This is a very fad driven industry. One of the things you earn after being in it for a long time is intuition for spotting fads and gratuitous complexity traps.

makeitdouble
2 replies
1d19h

I think that aspect is indirectly covered, as one of the main motivation was to get on a popular platform that helps hiring.

I agree on how it's technically a waste of time to pursue fads, but it's also a huge PITA to have a platform that good engineers actively try to avoid, as their careers would stagnate (even as they themselves know that it's half a fad)

sangnoir
1 replies
1d18h

I avoid working at organisations with NIH syndrome - if they are below a certain size (i.e. they lack a standing dev-eng team to support their homegrown K8s "equivalent"). Extra red flags if the said homegrown-system was developed by that guy[1] who's ostensibly a genius and has very strong opinions about his system. Give me k8s' YAML-hell any day instead, at least that bloat has transferable skills, amd I can actually Google common resolutions.

1. Has been at org for so long, management condones them flaunting the rules, like pushing straight to prod. Hates the "inefficiency" of open source platforms and purpose-built something "suitable for the company" by themselves, no documentation you have to ask them to fix issues because they don't accept code or suggestions from others. The DSL they developed is inconsistent and has no parser/linter.

fragmede
0 replies
1d17h

yeah. If you think Kubernetes is too complicated, the flip side of that is someone built the simpler thing, but then unfortunately it grew and grew, and now you've got this mess of a system. you could have just used a hosted k8s or k3s system from the start instead of reinventing the wheel.

absoutely start as simple as you can, but plan to move to a hosted kube something asap instead of writing your own base images, unless that's a differentiator for your company.

keybored
1 replies
1d8h

1. You have to constantly learn to keep up!

2. Fad-driven

I wonder why I don’t often see (1) critiqued on the basis of (2).

api
0 replies
1d4h

There’s definitely a connection. Some of the change is improvement, like memory safe languages, but a ton of it is fads and needless complexity invented to provide a reason for some business or consultancy or group in a FAANG to exist. The rest of the industry cargo cults the big companies.

alexpotato
2 replies
1d5h

As often, Grug has a great line about microservices:

    grug wonder why big brain take hardest problem, factoring system correctly, and introduce network call too

    seem very confusing to grug
https://grugbrain.dev/#grug-on-microservices

kiesel
0 replies
1d4h

Thanks for the link to this awesome page!

Escapado
0 replies
1d2h

Thanks for that link, I genuinely laughed out loud while reading some of those points! Love the presentation and what a wonderful reality check I can’t agree with more.

tsss
1 replies
1d4h

single micro service called by all of the other micro services

So they didn't do microservices correctly. Big surprise.

zbentley
0 replies
1d1h

I mean, that pattern is pretty common in the micro service world. Services for things like authz, locking, logging/tracing, etc. are often centralized SPOFs.

There are certainly ways to mitigate the SPOFiness of each of those cases, but that doesn’t make having them an antipattern.

ellieh
1 replies
1d10h

I imagine because the article mentions:

More broadly, we’re not a microservices company, and we don’t plan to become one
dgb23
0 replies
1d5h

It seems the "micro" implies that services are separated by high level business terms, like "payment" or "inventory" with each having their own databases instead of computational terms like "storage", "load balancing" or "preprocessing" etc.

Is this generally correct? How well is this term defined?

If yes, then I'm not surprised this type of design has become a target for frustration. State is smeared across system, which implies a lot of messaging and arbitrary connections between services.

That type of design is useful if you are an application platform (or similar) where you have no say in what the individual entities are, and actually have no idea what they will be.

But if you have the birds-eye view and implement all of it, then why would you do that?

trhway
0 replies
1d14h

there’s no mention of performance loss or gain after migration

to illustrate performance cost i usually ask people what ping they have say from one component/pod/service to another and to compare that value to what ping they'd get between 2 linux boxes sitting on that gorgeous 10Gb/40Gb/100Gb or even 1000Gb network that they are running their modern microservices architecture over.

teleforce
0 replies
1d19h

There was also the case of an “accidental” dependency loop (S1 -> S2 -> S3 -> S1).

The classic dependency loop example that you thought will never encounter again for the rest of your life after OS class

rco8786
0 replies
1d7h

Or cost increase or decrease!

okr
0 replies
1d3h

And if it was a single application, how would that solved it? You still would have that loop, no?

Personally, i think, it does not have to be hundreds of microservices, basically each function a service. But i see it more as the web in the internet. Things are sometimes not reachable or overloaded. I think that is normal life.

jknoepfler
0 replies
1d3h

One of the points of a microservice architecture (on k8s or otherwise) is that you can easily horizontally scale the component that was under pressure without having to scale out a monolithic application... that just sounds like people being clueless, not a failure of microservice architecture...

ec109685
0 replies
1d16h

Why would moving to Kubernered from ECS introduce performance issues?

They already has their architecture, largely, and just moved it over to K8s.

They even mention they aren’t a micro service company.

dangus
0 replies
1d17h

This article specifically mentions that they are not running microservices and has pretty clearly defined motivations for making the migration.

crossroadsguy
0 replies
1d6h

I see this in Android. Every few years (sometimes multiple times in a year) a new arch. becomes the fad and every TDH dev starts hankering for one way or the other on the lines "why are you not doing X..". Problem is, at immature firms (specially startups) the director and senior manager level leaders happily agree for the re-write i.e re-archs because they usually leave every season for the next place and they get to talk about that new thing which probably didn't last beyond their stay, or might have conked off before that.

And the "test cases" porn! Goodness! People want "coverage" and that's it. It's a box to tick, irrespective of whether those test cases and they way those are written are actually meaningful in anyway. There are more like "let's have dependency injection everywhere" charade.

beeandapenguin
0 replies
1d17h

Extremely slow times - from development to production, backend to frontend. Depending on how bad things are, you might catch the microservice guys complaining over microseconds from a team downstream, in front of a FE dev who’s spent his week optimizing the monotonically-increasing JS bundles with code splitting heuristics.

Of course, it was because the client app recently went over the 100MB JS budget. Which they decided to make because the last time that happened, customers abroad reported seeing “white screens”. International conversion dropped sharply not long after that.

It’s pretty silly. So ya, good times indeed. Time to learn k8s.

vouwfietsman
43 replies
2d

Maybe its normal for a company this size, but I have a hard time following much of the decision making around these gigantic migrations or technology efforts because the decisions don't seem to come from any user or company need. There was a similar post from Figma earlier, I think around databases, that left me feeling the same.

For instance: they want to go to k8s because they want to use etcd/helm, which they can't on ECS? Why do you want to use etcd/helm? Is it really this important? Is there really no other way to achieve the goals of the company than exactly like that?

When a decision is founded on a desire of the user, its easy to validate that downstream decisions make sense. When a decision is founded on a technological desire, downstream decisions may make sense in the context of the technical desire, but do they make sense in the context of the user, still?

Either I don't understand organizations of this scale, or it is fundamentally difficult for organizations of this scale to identify and reason about valuable work.

WaxProlix
25 replies
2d

People move to K8s (specifically from ECS) so that they can use cloud provider agnostic tooling and products. I suspect a lot of larger company K8s migrations are fueled by a desire to be multicloud or hybrid on-prem, mitigate cost, availability, and lock-in risk.

zug_zug
20 replies
2d

I've heard all of these lip-service justifications before, but I've yet to see anybody actually publish data showing how they saved any money. Would love to be proven wrong by some hard data, but something tells me I won't be.

Alupis
10 replies
1d21h

Why would you assume it's lip-service?

Being vendor-locked into ECS means you must pay whatever ECS wants... using k8s means you can feasibly pick up and move if you are forced.

Even if it doesn't save money today it might save a tremendous amount in the future and/or provide a much stronger position to negotiate from.

greener_grass
3 replies
1d20h

Great in theory but in practice when you do K8s on AWS, the AWS stuff leaks through and you still have lock-in.

Alupis
1 replies
1d20h

Then don't use the AWS stuff. You can bring your own anything that they provide.

greener_grass
0 replies
1d10h

This requires iron discipline. Maybe with some kind of linter for Terraform / kubectl it could be done.

cwiggs
0 replies
1d20h

It doesn't have to be that way though. You can use the AWS ingress controller, or you can use ingress-nginx. You can use external secrets operator and tie it into AWS Secrets manager, or you can tie it into 1pass, or Hashicorp Vault.

Just like picking EKS you have to be aware of the pros and cons of picking the cloud provider tool or not. Luckily the CNCF is doing a lot for reducing vender lock in and I think it will only continue.

elktown
3 replies
1d20h

I don't understand why this "you shouldn't be vendor-locked" rationalization is taken at face value at all?

1. The time it will take to move to another cloud is proportional to the complexity of your app. For example, if you're a Go shop using managed persistence are you more vendor locked in any meaningful way than k8s? What's the delta here?

2. Do you really think you can haggle with the fuel-producers like you're MAERKS? No, you're more likely just a car driving around for a gas station with increasingly diminishing returns.

Alupis
2 replies
1d20h

This year alone we've seen significant price increases from web services, including critical ones such as Auth. If you are vendor-locked into, say Auth0, and they increase their price 300%[1]... What choice do you have? What negotiation position do you have? None... They know you cannot leave.

It's even worse when your entire platform is vendor-locked.

There is nothing but upside to working towards a vendor-neutral position. It gives you options. Even if you never use those options, they are there.

Do you really think you can haggle

At the scale of someone like Figma? Yes, they do negotiate rates - and a competent account manager will understand Figma's position and maximize the revenue they can extract. Now, if the account rep doesn't play ball, Figma can actually move their stuff somewhere else. There's literally nothing but upside.

I swear, it feels like some people are just allergic to anything k8s and actively seek out ways to hate on it.

[1] https://auth0.com/blog/upcoming-pricing-changes-for-the-cust...

elktown
1 replies
1d19h

Why skip point 1 and do some strange tangent on a SaaS product unrelated to using k8s or not?

Most people looking into (and using) k8s that are being told the "you most avoid vendor lock in!" selling point are nowhere near the size where it matters. But I know there's essentially bulk-pricing, as we have it where I work as well. That it's because of picking k8s or not however is an extremely long stretch, and imo mostly rationalization. There's nothing saying that a cloud move without k8s couldn't be done within the same amount of time. Or that even k8s is the main problem, I imagine it isn't since it's usually supposed to be stateless apps.

Alupis
0 replies
1d19h

The point was about vendor lock, which you asserted is not a good reason to make a move, such as this. The "tangent" about a SaaS product was to make it clear what happens when you build your system in such a way as-to become entirely dependent on that vendor. Just because Auth0 is not part of one of the big "cloud" providers, doesn't make it any less vendor-locky. Almost all of the vendor services offered on the big clouds are extremely vendor-locked and non-portable.

Where you buy compute from is just as big of a deal as where you buy your other SaaS' from. In all of the cases, if you cannot move even if you had to (ie. it'll take 1 year+ to move), then you are not in a good position.

Addressing your #1 point - if you use a regular database that happens to be offered by a cloud provider (ie. Postgres, MySQL, MongoDB, etc) then you can pick up and move. If you use something proprietary like CosmoDB, then you are stuck or face significant efforts to migrate.

With k8s, moving to another cloud can be as simple as creating an account and updating your configs to point at the new cluster. You can run every service you need inside your cluster if you wanted. You have freedom of choice and mobility.

Most people looking into (and using) k8s that are being told the "you most avoid vendor lock in!" selling point are nowhere near the size where it matters.

This is just simply wrong, as highlighted by the SaaS example I provided. If you think you are too small so it doesn't matter, and decide to embrace all of the cloud vendor's proprietary services... what happens to you when that cloud provider decides to change their billing model, or dramatically increases price? You are screwed and have no options but cough up more money.

There's more decisions to make and consider regarding choosing a cloud platform and services than just whatever is easiest to use today - for any size of business.

I have found that, in general, people are afraid of using k8s because it isn't trivial to understand for most developers. People often mistakenly believe k8s is only useful when you're "google scale". It solves a lot of problems, including reduced vendor-lock.

watermelon0
0 replies
1d20h

I would assume that the migration from ECS to something else would be a lot easier, compared to migrating from other managed services, such as S3/SQS/Kinesis/DynamoDB, and especially IAM, which ties everything together.

otterley
0 replies
1d20h

Amazon ECS is and always has been free of charge. You pay for the underlying compute and other resources (just like you do with EKS, too), but not the orchestration service.

jgalt212
2 replies
1d23h

True but if AWS knows your lock-in is less locked-in, I'd bet they'd more flexible when contracts are up for renewal. I mean it's possible the blog post's primary purpose was a shot across bow to their AWS account manager.

logifail
1 replies
1d21h

it's possible the blog post's primary purpose was a shot across bow to their AWS account manager

Isn't it slightly depressing that this explanation is fairly (the most?) plausible?

jiggawatts
0 replies
1d20h

Our state department of education is one of the biggest networks in the world with about half a million devices. They would occasionally publicly announce a migration to Linux.

This was just a Microsoft licensing negotiation tactic. Before he was CEO, Ballmer flew here to negotiate one of the contracts. The discounts were epic.

tengbretson
1 replies
1d23h

There are large swaths of the b2b space where (for whatever reason) being in the same cloud is a hard business requirement.

imtringued
0 replies
5m

There are good technical reasons for this. Anything latency or throughput sensitive is better done within the same datacenter. There have been submissions about an ffmpeg as a service company and a GPU over TCP company on HN recently that would significantly benefit from 'same cloud'.

vundercind
0 replies
1d23h

The vast majority of corporate decisions are never justified by useful data analysis, before or after the fact.

Many are so-analyzed, but usually in ways that anyone who paid attention in high school science or stats classes can tell are so flawed that they’re meaningless.

We can’t even measure manager efficacy to any useful degree, in nearly all cases. We can come up with numbers, but they don’t mean anything. Good luck with anything more complex.

Very small organizations can probably manage to isolate enough variables to know how good or bad some move was in hindsight, if they try and are competent at it (… if). Sometimes an effect is so huge for a large org that it overwhelms confounders and you can be pretty confident that it was at least good or bad, even if the degree is fuzzy. Usually, no.

Big organizations are largely flying blind. This has only gotten worse with the shift from people-who-know-the-work-as-leadership to professional-managers-as-leadership.

nailer
0 replies
1d23h

Likewise. I'm not sure Kubernetes famous complexity (and the resulting staff requirements) are worth it to preemptively avoid vendor lockin, and wouldn't be solved more efficiently by migrating to another cloud provider's native tools if the need arises.

bryanlarsen
0 replies
1d23h

I'm confident Figma isn't paying published rates for AWS. The transition might have helped them in their rate negotiations with AWS, or it might not have. Hard data on the money saved would be difficult to attribute.

WaxProlix
0 replies
1d21h

It looks like I'm implying that companies are successful in getting those things from a K8s transition, but I wasn't trying to say that, just thinking of the times when I've seen these migrations happen and relaying the stated aims. I agree, I think it can be a burner of dev time and a burden on the business as devs acquire the new skillset instead of doing more valuable work.

timbotron
0 replies
2d

there's a pretty direct translation from ECS task definition to docker-compose file

teyc
0 replies
1d21h

People move to K8s so that their resumes and job ads are cloud provider agnostic. Peoples careers stagnate when their employers platform on a home baked tech, or on specific offerings from cloud providers. Employers find Mmoving to a common platform makes recruiting easier.

fazkan
0 replies
1d23h

This, most of it, I think is to support on-prem, and cloud-flexibility. Also from the customers point of view, they can now sell the entire figma "box" to controlled industries for a premium.

OptionOfT
0 replies
1d23h

Flexibility was a big thing for us. Many different jurisdictions required us to be conscious of where exactly data was stored & processed.

K8s makes this really easy. Don't need to worry whether country X has a local Cloud data center of Vendor Y.

Plus it makes hiring so much easier as you only need to understand the abstraction layer.

We don't hire people for ARM64 or x86. We have abstraction layers. Multiple even.

We'd be fooling us not to use them.

wg0
7 replies
1d22h

If you haven't broken down your software into 50+ different separate applications written in 15 different languages using 5 different database technologies - you'll find very little use for k8s.

All you need is a way to roll out your artifact to production in a roll over or blue green fashion after the preparations such as required database alterations be it data or schema wise.

imiric
2 replies
1d21h

All you need is a way to roll out your artifact to production in a roll over or blue green fashion after the preparations such as required database alterations be it data or schema wise.

Easier said than done.

You can start by implementing this yourself and thinking how simple it is. But then you find that you also need to decide how to handle different environments, configuration and secret management, rollbacks, failover, load balancing, HA, scaling, and a million other details. And suddenly you find yourself maintaining a hodgepodge of bespoke infrastructure tooling instead of your core product.

K8s isn't for everyone. But it sure helps when someone else has thought about common infrastructure problems and solved them for you.

mattmanser
1 replies
1d20h

You need to remove a lot of things from that list. Almost all of that functionality is available in build tools that have been available for decades. I want to emphasize the DECADES.

And then all you're left with is scaling. Which most business do not need.

Almost everything you've written there is a standard feature of almost any CI toolchain, teamcity, Jenkins, Azure DevOps, etc., etc.

We were doing it before k8s was even written.

imiric
0 replies
1d10h

Almost all of that functionality is available in build tools that have been available for decades.

Build tools? These are runtime and operational concerns. No build tool will handle these things.

And then all you're left with is scaling. Which most business do not need.

Eh, sure they do. They might not need to hyperscale, but they could sure benefit from simple scaling, autoscaling at peak hours, and scaling down to cut costs.

Whether they need k8s specifically to accomplish this is another topic, but every business needs to think about scaling in some way.

Almost everything you've written there is a standard feature of almost any CI toolchain, teamcity, Jenkins, Azure DevOps, etc., etc.

Huh? Please explain how a CI pipeline will handle load balancing, configuration and secret management, and other operational tasks for your services. You may use it for automating commands that do these things, but CI systems are entirely decoupled from core infrastructure.

We were doing it before k8s was even written.

Sure. And k8s isn't the absolute solution to these problems. But what it does give you is a unified set of interfaces to solve common infra problems. Whatever solutions we had before, and whatever you choose to compose from disparate tools, will not be as unified and polished as what k8s offers. It's up to you to decide the right trade-off, but I find the head-in-the-sand dismissal of it equally as silly as cargo culting it.

mplewis
1 replies
1d20h

Yeah, all you need is a rollout system that supports blue-green! Very easy to homeroll ;)

wg0
0 replies
1d9h

Not easy, but already a solved problem.

javaunsafe2019
1 replies
1d22h

But you do know which problems the k8s abstraction solves, right? Cause it has nothing to do with many languages nor many services but things like discovery, scaling, failover and automation …

wg0
0 replies
1d9h

If all you have one single application listening on port 8080 with SSL terminated elsewhere, why would you need so many abstractions in first place.

ianvonseggern
4 replies
1d22h

Hey, author here, I think you ask a good question and I think you frame it well. I agree that, at least for some major decisions - including this one, "it is fundamentally difficult for organizations of this scale to identify and reason about valuable work."

At its core we are a platform teams building tools, often for other platform teams, that are building tools that support the developers at Figma creating the actual product experience. It is often harder to reason about what the right decisions are when you are further removed from the end user, although it also gives you great leverage. If we do our jobs right the multiplier effect of getting this platform right impacts the ability of every other engineer to do their job efficiently and effectively (many indirectly!).

You bring up good examples of why this is hard. It was certainly an alternative to say sorry we can't support etcd and helm and you will need to find other ways to work around this limitation. This was simply two more data points helping push us toward the conclusion that we were running our Compute platform on the wrong base building blocks.

While difficult to reason about, I do think its still very worth trying to do this reasoning well. It's how as a platform team we ensure we are tackling the right work to get to the best platform we can. Thats why we spent so much time making the decision to go ahead with this and part of why I thought it was an interesting topic to write about.

vouwfietsman
1 replies
1d11h

Hi! Thanks for the thoughtful reply.

I understand what you're saying, the thing that worries me though is that the input you get from other technical teams is very hard to verify. Do you intend to measure the development velocity of the teams before and after the platform change takes effect?

In my experience it is extremely hard to measure the real development velocity (in terms of value-add, not arbitrary story points) of a single team, not to mention a group of teams over time, not to mention as a result of a change.

This is not necessarily criticism of Figma, as much as it is criticism of the entire industry maybe.

Do you have an approach for measuring these things?

felixgallo
0 replies
1d3h

You're right that the input from other technical teams is hard to verify. On the other hand, that's fundamental table stakes, especially for a platform team that has a broad impact on an organization. The purpose of the platform is to delight the paying customer, and every change should have a clear and well documented and narrated line of sight to either increasing that delight or decreasing the frustration.

The canonical way to do that is to ensure that the incoming demand comes with both the ask and also the solid justification. Even at top tier organizations, frequently asks are good ideas, sensible ideas, nice ideas, probably correct ideas -- but none of that is good enough/acceptable enough. The proportion of good/sensible/nice/probably correct ideas that are justifiable is about 5% in my lived experience of 38 years in the industry. The onus is on the asking team to provide that full true and complete justification with sufficiently detailed data and in the manner and form that convinces the platform team's leadership. The bar needs to be high and again, has to provide a clear line of sight to improving the life of the paying customer. The platform team has the authority and agency necessary to defend the customer, operations and their time, and can (and often should) say no. It is not the responsibility of the platform team to try to prove or disprove something that someone wants, and it's not 'pushing back' or 'bureaucracy', it's basic sober purpose-of-the-company fundamentals. Time and money are not unlimited. Nothing is free.

Frequently the process of trying to put together the justification reveals to the asking team that they do not in fact have the justification, and they stop there and a disaster is correctly averted.

Sometimes, the asking team is probably right but doesn't have the data to justify the ask. Things like 'Let's move to K8s because it'll be better' are possibly true but also possibly not. Vibes/hacker news/reddit/etc are beguiling to juniors but do not necessarily delight paying customers. The platform team has a bunch of options if they receive something of that form. "No" is valid, but also so is "Maybe" along with a pilot test to perform A/B testing measurements and to try to get the missing data; or even "Yes, but" with a plan to revert the situation if it turns out to be too expensive or ineffective after an incrementally structured phase 1. A lot depends on the judgement of the management and the available bandwidth, opportunity cost, how one-way-door the decision is, etc.

At the end of the day, though, if you are not making a data-driven decision (or the very closest you can get to one) and doing it off naked/unsupported asks/vibes/resume enhancement/reddit/hn/etc, you're putting your paying customer at risk. At best you'll be accidentally correct. Being accidentally correct is the absolute worst kind of correct, because inevitably there will come a time when your luck runs out and you just killed your team/organization/company because you made a wrong choice, your paying customers got a worse/slower-to-improve/etc experience, and they deserted you for a more soberly run competitor.

felixgallo
0 replies
1d22h

I have a constructive recommendation for you and your engineering management for future cases such as this.

First, when some team says "we want to use helm and etcd for some reason and we haven't been able to figure out how to get that working on our existing platform," start by asking them what their actual goal is. It is obscenely unlikely that helm (of all things) is a fundamental requirement to their work. Installing temporal, for example, doesn't require helm and is actually simple, if it turns out that temporal is the best workflow orchestrator for the job and that none of the probably 590 other options will do.

Second, once you have figured out what the actual goal is, and have a buffet of options available, price them out. Doing some napkin math on how many people were involved and how much work had to go into it, it looks to me that what you have spent to completely rearchitect your stack and operations and retrain everyone -- completely discounting opportunity cost -- is likely not to break even in even my most generous estimate of increased productivity for about five years. More likely, the increased cost of the platform switch, the lack of likely actual velocity accrual, and the opportunity cost make this a net-net bad move except for the resumes of all of those involved.

Spivak
0 replies
1d22h

we can't support etcd and helm and you will need to find other ways to work around this limitation

So am I reading this right that either downstream platform teams or devs wanted to leverage existing helm templates to provision infrastructure and being on ECS locked you out of those and the water eventually boiled over. If so that's a pretty strong statement about the platform effect of k8s.

samcat116
0 replies
1d23h

I have a hard time following much of the decision making around these gigantic migrations or technology efforts because the decisions don't seem to come from any user or company need

I mean the blog post is written by the team deciding the company needs. They explained exactly why they can't easily use etcd on ECS due to technical limitations. They also talked about many other technical limitations that were causing them issues and increasing cost. What else are you expecting?

lmm
0 replies
1d13h

For instance: they want to go to k8s because they want to use etcd/helm, which they can't on ECS? Why do you want to use etcd/helm? Is it really this important? Is there really no other way to achieve the goals of the company than exactly like that?

I'm no fan of Helm, but there are surprisingly few good alternatives to etcd (i.e. highly available but consistent datastores, suitable for e.g. the distributed equivalent of a .pid file) - Zookeeper is the only one that comes to mind, and it's a real pain on the ops side of things, requiring ancient JVM versions and being generally flaky even then.

Flokoso
0 replies
1d23h

Managing 500 or more VMS is a lot of work.

Aline the VM upgrade, auth, backup, log rotation etc.

With k8s I can give everyone a namespace, policies, volumes, have automatic log aggregation due to demon sets and k8s/cloud native stacks.

Self healing and more.

It's hard to describe how much better it is.

xiwenc
38 replies
1d22h

I’m baffled to see so many anti-k8s sentiments on HN. Is it because most commenters are developers used to services like heroku, fly.io, render.com etc. Or run their apps on VM’s?

elktown
19 replies
1d21h

I think some are just pretty sick and tired of the explosion of needless complexity we've seen in the last decade or so in software, and rightly so. This is an industry-wide problem of deeply misaligned incentives (& some amount of ZIRP gold rush), not specific to this particular case - if this one is even a good example of this to begin with.

Honestly, as it stands, I think we'd be seen as pretty useless craftsmen in any other field due to an unhealthy obsession of our tooling and meta-work - consistently throwing any kind of sensible resource usage out of the window in favor of just getting to work with certain tooling. It's some kind of a "Temporarily embarrassed FAANG engineer" situation.

cwiggs
7 replies
1d20h

I agree with this somewhat. The other day I was driving home and I saw a sprinkler head and broke on the side of the road and was spraying water everywhere. It made me think, why aren't sprinkler systems designed with HA in mind? Why aren't there dual water lines with dual sprinkler heads everywhere with an electronic component that detects a break in a line and automatically switches to the backup water line? It's because the downside of having the water spray everywhere, the grass become unhealthy or die is less than how much it would cost to deploy it HA.

In the software/tech industry it's common place to just accept that your app can't be down for any amount of time no matter what. No one checked to see how much more it would cost (engineering time & infra costs) to deploy the app so it would be HA, so no one checked to see if it would be worth it.

I blame this logic on the low interest rates for a decade. I could be wrong.

fragmede
3 replies
1d16h

Why would wanting redundancy be a ZIRP? Is blaming everything on ZIRP like Mercury was in retrograde but for economics dorks?

felixgallo
0 replies
1d9h

Because the company overhired to the point where people were sitting around dreaming up useless features just to justify their workday.

consteval
0 replies
1d3h

It depends on the cost of complexity you're adding. Adding another database or whatever is really not that complex so yeah sure, go for it.

But a lot of companies are building distributed systems purely because they want this ultra-low downtime. Distributed systems are HARD. You get an entire set of problems you don't get otherwise, and the complexity explodes.

Often, in my opinion, this is not justified. Saving a few minutes of downtime in exchange for making your application orders of magnitude more complex is just not worth it.

Distributed systems solve distributed problems. They're overkill if you just want better uptime or crisis recovery. You can do that with a monolith and a database and get 99.99% of the way there. That's good enough.

addaon
0 replies
1d

Redundancy, like most engineering choices, is a cost/benefit tradeoff. If the costs are distorted, the result of the tradeoff study will be distorted from the decisions that would be made in "more normal" times.

loire280
1 replies
1d17h

This week we had a few minutes of downtime on an internal service because of a node rotation that triggered an alert. The responding engineer started to put together a plan to make the service HA (which would have tripled the cost to serve). I asked how frequently the service went down and how many people would be inconvenienced if it did. They didn't know, but when we checked the metrics it had single-digit minutes of downtime this year and fewer than a dozen daily users. We bumped the threshold on the alert to longer than it takes for a pod to be re-scheduled and resolved the ticket.

jack_riminton
0 replies
5h25m

This is most sensible thing I’ve read on here in a while. Engineers’ obsession with tinkering and perfection is the slow death of many startups. If you’re doing something important like banking or air traffic control fair enough but a CRUD app for booking hair appointments will survive a bit of downtime

zerkten
0 replies
1d

You assume that the teams running these systems achieve acceptable uptime and companies aren't making refunds for missed uptime targets when contracts enforce that, or losing customers. There is definitely a vision for HA at many companies, but they are struggling with and without k8s.

bobobob420
7 replies
1d18h

Any software engineer who thinks K8 is complex shouldn’t be a software engineer. It’s really not that hard to manage.

LordKeren
5 replies
1d18h

I think the key word is “needless” in terms of complexity. There are a lot of k8 projects that probably could benefit from a simpler orchestration system— especially at smaller firms

fragmede
3 replies
1d16h

do you have a simpler orchestration system you'd recommend?

ahoka
1 replies
1d3h

How is it more simple?

Ramiro
0 replies
19h39m

Every time I read about Nomad, I wonder the same. I swear I'm not trolling here, I honestly don't get how running Nomad is simpler than Kubernetes. Especially considering that there are substantially more resources and help on Kubernetes than Nomad.

javadevmtl
0 replies
1d3h

For me it was DC/OS with marathon and mesos! It worked, it was a tank and it's model was simple.There was also some nice 3rd party open source systems around Mesos that where also simple to use. Unfortunately Kube won.

While nomad can be interesting again it's a single "smallish" vendor pushing an "open" (see debacle with Teraform) source project.

javadevmtl
0 replies
1d5h

No, it just looks and feels like enterprisy SOAP XML

methodical
1 replies
1d6h

Fair point but I think the key point here is unnecessary complexity versus necessary complexity. Are zero-downtime deployments and load balancing unnecessary? Perhaps for a personal project, but for any company with a consistent userbase I'd argue these are a non-negotiable, or should be anyways. In a situation where this is the expectation, k8s seems like the simplest answer, or near enough to it.

iTokio
0 replies
4h1m

They are many ways to do deployments without downtime and load balancing is easy to configure without k8s.

darby_nine
0 replies
1d1h

It's some kind of a "Temporarily embarrassed FAANG engineer" situation.

FAANG engineers made the same mistake, too, even though the analogy implies comparative competency or value.

moduspol
12 replies
1d17h

For me personally, I get a little bit salty about it due to imagined, theoretical business needs of being multi-cloud, or being able to deploy on-prem someday if needed. It's tough to explain just how much longer it'll take, how much more expertise is required, how much more fragile it'll be, and how much more money it'll take to build out on Kubernetes instead of your AWS deployment model of choice (VM images on EC2, or Elastic Beanstalk, or ECS / Fargate, or Lambda).

I don't want to set up or maintain my own ELK stack, or Prometheus. Or wrestle with CNI plugins. Or Kafka. Or high availability Postgres. Or Argo. Or Helm. Or control plane upgrades. I can get up and running with the AWS equivalent almost immediately, with almost no maintenance, and usually with linear costs starting near zero. I can solve business problems so, so much faster and more efficiently. It's the difference between me being able to blow away expectations and my whole team being quarters behind.

That said, when there is a genuine multi-cloud or on-prem requirement, I wouldn't want to do it with anything other than k8s. And it's probably not as bad if you do actually work at a company big enough to have a lot of skilled engineers that understand k8s--that just hasn't been the case anywhere I've worked.

drawnwren
8 replies
1d12h

Genuine question: how are you handling load balancing, log aggregation, failure restart + readiness checks, deployment pipelines, and machine maintenance schedules with these “simple” setups?

Because as annoying as getting the prometheus + loki + tempo + promtail stack going on k8s is —- I don’t really believe that writing it from scratch is easier.

moduspol
4 replies
1d5h

* Load balancing is handled pretty well by ALBs, and there are integrations with ECS autoscaling for health checks and similar

* Log aggregation happens out of the box with CloudWatch Logs and CloudWatch Log Insights. It's configurable if you want different behavior

* On ECS, you configure a "service" which describes how many instances of a "task" you want to keep running at a given time. It's the abstraction that handles spinning up new tasks when one fails

* ECS supports ready checks, and (as noted above) integrates with ALB so that requests don't get sent to containers until they pass a readiness check

* Machine maintenance schedules are non-existent if you use ECS / Fargate, or at least they're abstracted from you. As long as your application is built such that it can spin up a new task to replace your old one, it's something that will happen automatically when AWS decommissions the hardware it's running on. If you're using ECS without Fargate, it's as simple as changing the autoscaling group to use a newer AMI. By default, this won't replace all of the old instances, but will use the new AMI when spinning up new instances

But again, though: the biggest selling point is the lack of maintenance / babysitting. If you set up your stack using ECS / Fargate and an ALB five years ago, it's still working, and you've probably done almost nothing to keep it that way.

You might be able to do the same with Kubernetes, but your control plane will be out of date, your OSes will have many missed security updates. Might even need a major version update to the next LTS. Prometheus, Loki, Tempo, Promtail will be behind. Your helm charts will be revisions behind. Newer ones might depend on newer apiVersions that your control plane won't support until you update it. And don't forget to update your CNI plugin across your cluster, too.

It's at least one full time job just keeping all that stuff working and up-to-date. And it takes a lot more know-how than just ECS and ALB.

drawnwren
1 replies
1d1h

(Apologies for the snark, someone else made a short snarky comment that I felt was also wrong and I thought this thread was in reply to them before I typed it out -- thank you for the reply)

- ALBs -- yeah this is correct. However ALBs have much longer startup/health check times than Envoy/Traefik

- Cloudwatch - this is true, however the "configurable" behavior makes cloudwatch trash out of the box. you get i.e. exceptions split across multiple log entries with the default configure

- ECS tasks - yep, but the failure behavior of tasks is horrible because there're no notifications out of the box (you can configure it)

- Fargate does allow you to avoid maintenance, however it has some very hairy edges like i.e. you can't use any container that expects to know its own ip address on a private vpc without writing a custom script. Networking in general is pretty arcane on Fargate and you're going to have to manually write and maintain the breakages from all this

You might be able to do the same with Kubernetes, but your control plane will be out of date, your OSes will have many missed security updates. Might even need a major version update to the next LTS. Prometheus, Loki, Tempo, Promtail will be behind. Your helm charts will be revisions behind. Newer ones might depend on newer apiVersions that your control plane won't support until you update it. And don't forget to update your CNI plugin across your cluster, too.

I think maybe you haven't used K8S in years. Karpenter, EKS, + a GitOps (Flux or Argo) makes you get the same machine maintenance feeling as ECS but on K8S without any of the annoyances of dealing with ECS. All your app versions can be pinned or set to follow latest as you prefer. You get rolling updates each time you switch machines (same as ECS, and if you really want to you can run on top of Fargate).

By contrast, if your ECS/Fargate instance fails you haven't mentioned any notifications in your list -- so if you forgot to configure and test that correctly, your ECS could legitimately be stuck on a version of your app code that is 3 years old and you might not know if you haven't inspected the correct part of amazon's arcane interface.

By the way, you're paying per use for all of this.

At the end of the day, I think modern Kubernetes is strictly simpler, cheaper, and better than ECS/Fargate out of the box and has the benefit of not needing to rely on 20 other AWS specific services that each have their own unique ways of failing and running a bill up if you forget to do "that one simple thing everyone who uses this niche service should know".

mrgaro
0 replies
23h46m

ECS+Fargate does give you zero maintenance, both in theory and in practise. As someone, who runs k8s at home and manages two clusters at work, I still do recommend our teams to use ECS+Fargate+ALB if they satisfy their requirements for stateless apps and they all love it because it is literaly zero maintenance, unlike you just described what k8s requires.

Sure there are a lot of great feature with k8s which ECS cannot do, but when ECS does satisfy the requirements, it will require less maintenance, no matter what kind of k8s you compare it against to.

NewJazz
1 replies
1d1h

It seems like you are comparing ECS to a self-managed Kubernetes cluster. Wouldn't it make more sense to compare to EKS or another managed Kubernetes offering? Many of your points don't apply in that case, especially around updates.

moduspol
0 replies
21h15m

A managed Kubernetes offering removes only some of the pain, and adds more in other areas. You're still on the hook for updating whatever add-ons you're using, though yes, it'll depend on how many you're using, and how painful it will be varies depending on how well your cloud provider handles it.

Most of my managed Kubernetes experience is through Amazon's EKS, and the pain I remember included frustration from the supported Kubernetes versions being behind the upstream versions, lack of visibility for troubleshooting control nodes, and having to explain / understand delays in NIC and EBS appropriation / attachments for pods. Also the ALB ingress controller was something I needed to install and maintain independently (though that may be different now).

Though that was also without us going neck-deep into being vendor agnostic. Using EKS just for the Kubernetes abstractions without trying hard to be vendor agnostic is valid--it's just not what I was comparing above because it was usually that specific business requirement that steered us toward Kubernetes in the first place.

If you ARE using EKS with the intention of keeping as much as possible vendor agnostic, that's also valid, but then now you're including a lot of the stuff I complained about in my other comment: your own metrics stack, your own logging stack, your own alarm stack, your own CNI configuration, etc.

felixgallo
1 replies
1d9h

He named the services. Go read about them.

drawnwren
0 replies
1d9h

I’m not sure which services you think were named that solve the problems I mentioned, but none were. You’re welcome to go read about them, I do this for a living.

Bjartr
0 replies
1d5h

Depending on use case specifics, Elastic Beanstalk can do that just fine.

angio
1 replies
1d9h

I think you're just used to AWS services and don't see the complexity there. I tried running some stateful services on ECS once and it took me hours to have something _not_ working. In Kubernetes it takes me literally minutes to achieve the same task (+ automatic chart updates with renovatebot).

moduspol
0 replies
1d6h

I'm not saying there's no complexity. It exists, and there are skills to be learned, but once you have the skills, it's not that hard.

Obviously that part's not different from Kubernetes, but here's the part that is: maintenance and upgrades are either completely out of my scope or absolutely minimal. On ECS, it might involve switching to a more recently built AMI every six months or so. AWS is famously good about not making backward incompatible changes to their APIs, so for the most part, things just keep working.

And don't forget you'll need a lot of those AWS skills to run Kubernetes on AWS, too. If you're lucky, you'll get simple use cases working without them. But once PVCs aren't getting mounted, or pods are stuck waiting because you ran out of ENI slots on the box, or requests are timing out somewhere between your ALB and your pods, you're going to be digging into the layer between AWS and Kubernetes to troubleshoot those things.

I run Kubernetes at home for my home lab, and it's not zero maintenance. It takes care and feeding, troubleshooting, and resolution to keep things working over the long term. And that's for my incredibly simple use cases (single node clusters with no shared virtualized network, no virtualized storage, no centralized logs or metrics). I've been in charge of much more involved ones at work and the complexity ceiling is almost unbounded. Running a distributed, scalable container orchestration platform is a lot more involved than piggy backing on ECS (or Lambda).

mountainriver
0 replies
1d

I hear a lot of comments that sound like people who used K8s years ago and not since. The clouds have made K8s management stupid simple at this point, you can absolutely get up and running immediately with no worry of upgrades on a modern provider like GKE

tryauuum
1 replies
1d20h

For me it is about VMs. Feel uneasy knowing that any kernel vulnerability will allow a malicious code to escape the container and explore the kubernetes host

There are kata-containers I think, they might solve my angst and make me enjoy k8s

Overall... There's just nothing cool in kubernetes to me. Containers, load balancers, megabytes of yaml -- I've seen it all. Nothing feels interesting enough to try

stackskipton
0 replies
1d20h

vs the Application getting hacked and running lose on the VM?

If you have never dealt with, I have to run these 50 containers plus Nginx/CertBot while figuring out which node is best to run it, yea, I can see you not being thrilled about Kubernetes. For the rest of us though, Kubernetes helps out with that easily.

maayank
0 replies
1d21h

It’s one of those technologies where there’s merit to use them in some situations but are too often cargo culted.

caniszczyk
0 replies
1d21h

Hating is a sign of success in some ways :)

In some ways, it's nice to see companies move to use mostly open source infrastructure, a lot of it coming from CNCF (https://landscape.cncf.io), ASF and other organizations out there (on top of the random things on github).

archenemybuntu
0 replies
22h4m

Kubernetes itself is built around mostly solid distributed system principles.

It's the ecosystem around it which turns things needlessly complex.

Just because you have kubernetes, you don't necessarily need istio, helm, Argo cd, cilium, and whatever half baked stuff is pushed by CNCF yesterday.

For example take a look at helm. Its templating is atrocious, and if I am still correct, it doesn't have a way to order resources properly except hooks. Sometimes resource A (deployment) depends on resource B (some CRD).

The culture around kubernetes dictates you bring in everything pushed by CNCF. And most of these stuff are half baked MVPs.

---

The word devops has created expectations that back end developer should be fighting kubernetes if something goes wrong.

---

Containerization is done poorly by many orgs, no care about security and image size. That's a rant for another day. I suspect this isn't a big reason for kubernetes hate here.

julienmarie
18 replies
1d20h

I personally love k8s. I run multiple small but complex custom e-commerce shops and handle all the tech on top of marketing, finance and customer service.

I was running on dedicated servers before. My stack is quite complicated and deploys were a nightmare. In the end the dread of deploying was slowing down the little company.

Learning and moving to k8s took me a month. I run around 25 different services ( front ends, product admins, logistics dashboards, delivery routes optimizers, orsm, ERP, recommendation engine, search, etc.... ).

It forced me to clean my act and structure things in a repeatable way. Having all your cluster config in one place allows you to exactly know the state of every service, which version is running.

It allowed me to do rolling deploys with no downtime.

Yes it's complex. As programmers we are used to complex. An Nginx config file is complex as well.

But the more you dive into it the more you understand the architecture if k8s and how it makes sense. It forces you to respect the twelve factors to the letter.

And yes, HA is more than nice, especially when your income is directly linked to the availability and stability of your stack.

And it's not that expensive. I lay around 400 usd a month in hosting.

maccard
16 replies
1d19h

Figma were running on ECS before, so they weren't just running dedicated servers.

I'm a K8S believer, but it _is_ complicated. It solves hard problems. If you're multi-cloud, it's a no brainer. If you're doing complex infra that you want a 1:1 mapping of locally, it works great.

But if you're less than 100 developers and are deploying containers to just AWS, I think you'd be insane to use EKS over ECS + Fargate in 2024.

epgui
12 replies
1d17h

I don’t know if it’s just me, but I really don’t see how kubernetes is more complex than ECS. Even for a one-man show.

mrgaro
11 replies
1d

Kubernetes needs regular updates, just as everything else (unless you carefully freeze your environment and somehow manage the vulnerability risks) and that requires manual work.

ECS+Fargate however does not. If you are a small team managing the entire stack, you need to factor this into accounts. For example EKS forces you to upgrade the cluster to keep in the main kubernetes release cycle, albeit you can delay it somewhat.

I personally run k8s at home and another two at work and I recommend our teams to use ECS+Fargate+ALB if it is enough for them.

metaltyphoon
10 replies
21h1m

Kubernetes needs regular updates, just as everything else (unless you carefully freeze your environment and somehow manage the vulnerability risks) and that requires manual work

Just use a managed K8s solution that deals with this? AKS, EKS and GKE all do this for you.

ttymck
9 replies
20h38m

It doesn't do everything for you. You still need to update applications that use deprecated APIs.

This sort of "just" thinking is a great way for teams to drown in ops toil.

metaltyphoon
5 replies
19h48m

Are you assuming the workloads have to use K8s APIs? Where is this coming from? If that’s not the case can you actually explain with a concrete example?

epgui
1 replies
16h32m

You mean operators?

(genuine tone, not rhetorical)

ttymck
0 replies
15h42m

Sure, an operator is likely to use a wide array of APIs.

But, to reiterate, everything uses APIs. The *betavX APIs are of course likely to be deprecated and replaced with stable APIs after a few versions.

Too
1 replies
12h53m

Man, you don't need to use service mesh just because you use k8s. Istio is a very advanced component that 99% of users don't need.

So if you are going to compare with a managed solution, compare with something equivalent. Take a bare managed cluster and add a single Deployment to it, it will be no more complex than ECS, while giving you much better developer ergonomics.

ttymck
0 replies
12h25m

99% of users don't need kubernetes. Just deploy to heroku, and you'll have a much better developer experience.

epgui
1 replies
17h53m

My experience with k8s has been very much “just”, and I’ve never really had issues or experienced any real friction with updates. shrugs

ttymck
0 replies
17h8m

That's great. I guess I've somehow been making things harder than they need to be.

Ramiro
0 replies
19h41m

I agree with @metaltyphoon on this. Even for small teams, a managed version of Kubernetes takes away most of the pain. I've used both ECS+Fargate and Kubernetes, but these days, I prefer Kubernetes mainly because the ecosystem is way bigger, both vendor and open source. Most of the problems we run into are always one search or open source project away.

mountainriver
2 replies
1d

This just feels like a myth to me at this point. Kubernetes isn’t hard, the clouds have made is so simple now that it’s in no way more difficult than ECS and is way more flexible

davewritescode
1 replies
1d

I’m not saying I agree with the comment above you but Kubernetes upgrades and keeping all your addons/vpc stuff up to date can be a never ending slog of one-way upgrades that, when they go wrong, can cause big issues.

organsnyder
0 replies
23h59m

Those are all issues that should be solved by the managed provider.

It's been a while since I spun up a k8s instance on AWS, Azure, or the like, but when I did I was astounded at how many implementation decisions and toil I had to do myself. Hosted k8s should be plug-and-play unless you have a very specialized use-case.

belter
0 replies
5h0m

I run multiple small but complex custom e-commerce shops

How do you handle the lack of multi tenancy in Kubernetes?

wrs
16 replies
1d23h

A migration with the goal of improving the infrastructure foundation is great. However, I was surprised to see that one of the motivations was to allow teams to use Helm charts rather than converting to Terraform. I haven’t seen in practice the consistent ability to actually use random Helm charts unmodified, so by encouraging its use you end up with teams forking and modifying the charts. And Helm is such a horrendous tool, you don’t really want to be maintaining your own bespoke Helm charts. IMO you’re actually better off rewriting in Terraform so at least your local version is maintainable.

Happy to hear counterexamples, though — maybe the “indent 4” insanity and multi-level string templating in Helm is gone nowadays?

cwiggs
5 replies
1d20h

Helm Charts and Terraform are different things IMO. Terraform is better used to deploying cloud resources (s3 bucket, EKS cluster, EKS workers, RDS, etc). Sure you can manage your k8s workloads with Terraform, but I wouldn't recommend it. Terraform having state when you already have your start in k8s makes working with Terraform + k8s a pain. Helm is purpose built for k8s, Terraform is not.

I'm not a fan of Helm either though, templat-ed yaml sucks, you still have the "indent 4" insanity too. Kustomize is nice when things are simple, but once your app is complex Kustomize is worse than Helm IMO. Try to deploy an app that has a ING, with a TLS cert and external-DNS with Kustomize for multiple environments; you have to patch the resources 3 times instead of just have 1 variable you and use in 3 places.

Helm is popular, Terraform is popular so they both are talked a lot, but IMO there is a tool that is yet to become popular that will replace both of these tools.

stackskipton
1 replies
1d20h

Lack of Variable substitution in Kustomize is downright frustrating. We use Flux so we have the feature anyways, but I wish it was built into Kustomize.

no_circuit
0 replies
1d1h

I don't miss variable substitution at all.

For my setup anything that needs to be variable or secret gets specified in a custom json/yaml file which is read by a plugin which in turn outputs the rendered manifest if I can't write it as a "patch". That way the CI/CD runner can access things like the resolved secrets for production without being accessible by developers without elevated access. It requires some digging but there are even annotations that can be used to control things like if Kustomize should add a hash suffix or not to ConfigMap or Secret manifests you generate with plugins.

wrs
0 replies
1d20h

I agree, I wouldn’t generate k8s from Terraform either, that’s just the alternative I thought the OP was presenting. But I’d still rather convert charts from Helm to pretty much anything else than maintain them.

tionate
0 replies
1d13h

Re your kustomizen complaint, just create a complete env-specific ingress for each env instead of patching.

- it is not really any more lines - doesn’t break if dev upgrades to a different version of the resource (has happened before) - allows you to experiment with dev with other setups (eg additional ingresses, different paths etc) instead of changing a base config which will impact other envs

TLDR patch things that are more or less the same in each env; create complete resources for things that change more.

There is a bit of duplication but it is a lot more simple (see ‘simple made easy’ - rich hockey) than tracing through patches/templates.

3np
0 replies
12h5m

You can deploy your Helm charts through Terraform, even. It's been several years sinc so the situation might have improved but last I worked this way it was OK except of state drift due togaps in the Helm TF provider. Still found it better then either by itself.

solatic
2 replies
1d12h

My current employer (BigCo) has Terraform managing both infra + deployments in Terraform, at (ludicrous) scale. It's a nightmare. The problem with Terraform is that you must plan your workspaces such that you will not exceed the best-practice amount of resources per workspace (~100-200) or else plans will drastically slow down your time-to-deploy, checking stuff like databases and networking that you haven't touched and have no desire to touch. In practice this means creating a latticework of Terraform workspaces that trigger each other, and there are currently no good open-source tools that support it.

Best practice as I can currently see it is to have Terraform set up what you need for continuous delivery (e.g. ArgoCD) as part of the infrastructure, then use the CD tool to handle day-to-day deployments. Most CD tooling then asks you to package your deployment in something like Helm.

chucky_z
1 replies
1d5h

You can setup dependent stacks in CDKTF. It’s far from as clean as a standard TF DAG plan/apply but I’m having a lot of success with it right now. If I were actively using k8s at the moment I would probably setup dependent cluster resources using this method, e.g: ensure a clean, finished CSI daemon deployment before deploying a deployment using that CSI provider :)

solatic
0 replies
1d4h

You're right that CDKTF with dependent stacks is probably better than nothing, but (a) CDKTF's compatibility with OpenTofu depends on a lack of breaking changes in CDKTF, since the OpenTofu team didn't fork CDKTF, so this is a little hairy for building production infrastructure; (b) CDKTF stacks, even when they can run in parallel, still run on the same machine that invoked CDKTF. When you have (ludicrous scale) X number of "stacks", this isn't a good fit. It's something that should be doable in one of the managed Terraform services, but the pricing if you try to do (ludicrous scale) parallelism gets to be insane.

BobbyJo
1 replies
1d21h

Helm is quite often the default supported way of launching containerized third-party products. I have works at two separate startups whose 'on prem' product was offered this way.

freedomben
0 replies
1d21h

Indeed. I try hard to minimize the amount of Helm we use, but a significant amount of tools are only shipped as Helm charts. Fortunately I'm increasingly seeing people provide "raw k8s" yaml, but it's far from universal.

smellybigbelly
0 replies
1d23h

Our team also suffered from the problems you described of public helm charts. There is always something you need to customise to make things work on your own environment. Our approach has been to use the public helm chart as-is and do any customisation with `kustomize —enable-helm`.

mnahkies
0 replies
22h16m

Whilst I'll agree that writing helm charts isn't particularly delightful, consuming them can be.

In our case we have a single application/service base helm chart that provides sane defaults and all our deployments extend from. The amount of helm values config required by the consumers is minimal, and there has been very little occasion for a consumer to include their own templates - the base chart exposes enough knobs to avoid this.

When it comes to third-party charts, many we've been able to deploy as is (sometimes with some PRs upstream to add extra functionality), and occasionally we've needed to wrap/fork them. We've deployed far more third-party charts as-is than not though.

One thing probably worth mentioning w.r.t to maintaining our custom charts is the use of helm unittest (https://github.com/helm-unittest/helm-unittest) - it's been a big help to avoid regressions.

We do manage a few kubernetes resources through terraform, including Argocd (via the helm provider which is rather slow when you have a lot of CRDs), but generally we've found helm chart deployed through Argocd to be much more manageable and productive.

gouggoug
0 replies
1d20h

Talking about helm - I personally have come to profoundly loathe it. It was amazing when it came out and filled a much needed gap.

However it is loaded with so many footguns that I spend my time redoing and debugging others engineers work.

I’m hoping this new tool called « timoni » picks up steam. It fixes pretty every qualm I have with helm.

So if like me you’re looking for a better solution, go check timoni.

brainzap
0 replies
56m

Helm charts are a bit painful but they have a few critical features. Atomic deploy that rolls back onfailure. Ability to generate the full kubernetes definition with helm template. Ability to print out all configuration values with description.

At our company we have all deployments wrapped into a flat helm chart with as little variables as possible. (I always have to fight for that because devs like to abstract helm 100 levels and end up with nothing)

JohnMakin
0 replies
1d19h

It's completely cursed, but I've started deploying helm via terraform lately. Many people, ironically me included, find that managing deployments via terraform is an anti pattern.

I'm giving it a try and I don't despise it yet, but it feels gross - application configs are typically far more mutable and dynamic than cloud infrastructure configs, and IME, terraform does not likey super dynamic configs.

kachapopopow
10 replies
1d16h

It appears to me that people don't really understand kubernetes here.

Kubernetes does not mean microservices, it does not mean containerization and isolation, hell, it doesn't even mean service discovery most of the time.

The default smallest kubernetes installation provides you two things: kubelet (the scheduling agent) and kubeapi.

What do these two allow you to do? KubeApi provides an API to interact with kubelet instances by telling them to do with manifests.

That's all, that's all kubernetes is, just a dumb agent with some default bootstrap behavior that allows you to interact with a backend database.

Now, let's get into kubernetes default extensions:

- CoreDNS - linking service names to service addresses.

- KubeProxy - routing traffic from host to services.

- CNI(many options) - Networking between various service resources.

After that, kubernetes is whatever you want it to be. It can be something that you can use to spawn few test databases. Deploy an entire production-certified clustered databases. A full distributed fs with automatic device discovery. Deploy backend services if you want to take advantage of service discovery, autoscaling and networking. Or it can be something as small as deploying monitoring (such as node-exporter) to every instance.

And as a bonus, it allows you to do it from the comfort of your own local computer.

This article says that figma migrated necessary services to kubernetes to improve developer experience and clearly said that things that don't need to be kubernetes aren't. For all we know they still run their services in raw instances and only use kubernetes for their storage and databases. And to add to all of that, kubernetes doesn't care where it runs, which is a great way to increase competition between cloud providers lowering the costs for all.

tbrownaw
4 replies
1d14h

Is it very common to use it without containers?

otabdeveloper4
1 replies
1d7h

It is impossible.

lisnake
0 replies
19h39m

it's possible with virtlet or kubevirt

kachapopopow
0 replies
19h33m

I run it that way on my windows machines, the image is downloaded and executed directly.

This ties into a funny example: k8s manages my vm's via kubevirt, those then have a minimal k8s version installed that runs my jobs. The implementation simply mounts the extracted image to a virtual fs and executes it there, then deletes the file system.

darby_nine
0 replies
1d1h

You can use VMs instead. I don't think the distinction matters very much though.

consteval
2 replies
1d3h

To be fair, at the small scales you're talking about (maybe 1-2 machines) systemd does the same stuff, just better with less complexity. And there's various much simpler ways to automate your deployments.

If you don't have a distributed system then personally I think k8s makes no sense.

kachapopopow
0 replies
19h31m

I did say with few machines it can be overkill, but when you have more than a dozen of 2-3 machines or 6+ machines it gets overwhelming really fast. Kubernetes in it's smallest form is around 50MiB of memory and 0.1cpu.

Too
0 replies
12h14m

How do you deploy to systemd? How do you run a container in systemd? Now you need a second and third system, perhaps Ansible and docker-compose, which is simple on the surface but quickly grows in complexity with home made glue to keep all the loose components together.

I agree that for a handful of pet-servers for a team with more existing Linux experience than k8s experience, this is a better starting point, because of the shorter learning curve. Just not kid ourselves that the end product has any less complexity, it's only a different skill set.

lmm
1 replies
1d13h

Kubernetes absolutely means containerisation in practice. There is no other supported way of doing things with it. And "fake" abstraction where you pretend something is generic but it's actually not is one of the easiest ways to overcomplicate anything.

kachapopopow
0 replies
19h33m

If you disable security policy you and remount to pid 1 you escape any encapsulation. Or you can use a k8s implementation that just extracts the image and runs it.

But that's assuming you're running containerd or something similar. There are dozens of k8s implementations some as light as only providing you with manifests which then you have external schedulers which subscribe (called controllers) that execute on these manifests.

jb1991
9 replies
2d

Can anyone advise what is the most common language used in enterprise settings for interfacing with K8s?

JohnMakin
2 replies
2d

IME almost exclusively golang.

roshbhatia
0 replies
2d

++, most controllers are written in go, but there's plenty of client libraries for other languages.

A common pattern you'll see though is skipping writing any sort of code and instead using a higher level dsl-ish configuration usually via yaml, using tools like Kyverno.

angio
0 replies
1d9h

I'm seeing a lot of custom operators written in Rust nowadays. Obviously biased because I do a lot of rust myself so people I'm talking to also do rust.

cortesoft
1 replies
2d

A lot of yaml

yen223
0 replies
1d20h

The fun kind of yaml that has a lot of {{ }} in them that breaks your syntax highlighter.

mplewis
0 replies
1d20h

I have seen more Terraform than anything else.

gadflyinyoureye
0 replies
2d

Depends on what you mean. Helm will control a lot. You can make the yaml file in any language. Also you can admin it from command line tools. So again any language but often zsh or bash.

bithavoc
0 replies
1d15h

If you’re talking about connecting to Kubernetes and create resources programmatically, Pulumi allows you to interface with it from all the languages they support(js, ts, go, c#, python) including wrapping up Helm charts and inject secrets( my personal favorite).

If you want to build your own Kubernetes Custom Resources and Controllers, Go lang works pretty well for that.

akdor1154
0 replies
1d20h

On the platform consumer side (app infra description) - well schema'd yaml, potentially orchestrated by helm ("templates to hellish extremes") or kustomize ("no templates, this is the hill we will die on").

On the platform integration/hook side (app code doing specialised platform-specific integration stuff, extensions to k8s itself), golang is the lingua franca but bindings for many languages are around and good.

Aeolun
9 replies
2d

Honestly, I find the reasons they name for using Kubernetes flimsy as hell.

“ECS doesn’t support helm charts!”

No shit sherlock, that’s a thing literally built on Kubernetes. It’s like a government RFP that can only be fulfilled by a single client.

Carrok
4 replies
2d

We also encountered many smaller paper cuts, like attempting to gracefully terminate a single poorly behaving EC2 machine when running ECS on EC2. This is easy on Amazon’s Elastic Kubernetes Service (EKS), which allows you to simply cordon off the bad node and let the API server move the pods off to another machine while respecting their shutdown routines.

I dunno, that seems like a very good reason to me.

watermelon0
3 replies
1d20h

I assume that ECS Fargate would solve this, because one misbehaving ECS task would not affect others, and stopping it should still respect the shutdown routines.

ko_pivot
2 replies
1d19h

Fargate is very expensive at scale. Great for small or bursty workloads, but when you’re at Figma scale, you almost always go EC2 for cost-effectiveness.

ihkasfjdkabnsk
0 replies
1d12h

this isn't really true. It was very expensive when it was first released but now it's pretty cost competitive with EC2, especially when you consider easier scale down/up.

Aeolun
0 replies
1d6h

I think when you are at Figma scale you should have learned to keep things simpler. At this point I don’t think the (slightly) lower costs of EC2 weigh up against the benefits of Fargate.

liveoneggs
0 replies
1d19h

recipes and tutorials say "helm" so we need "helm"

cwiggs
0 replies
1d20h

I think what they should have said is "there isn't a tool like Helm for ECS" If you want to deploy a full prometheus, grafana, alertmanager, etc stack on ECS, good luck with that, no one has written the task definition for you to consume and override values.

With k8s you can easily deploy a helm chart that will deploy lots of things that all work together fairly easily.

aranelsurion
0 replies
1d22h

To be fair there are many benefits of running on the platform that has the most mindshare.

Unless they are in this space competing against k8s, it’s reasonable for them if they want to use Helm charts, to move where they can.

Also, Helm doesn’t work with ECS, so doesn’t <50 other tools and tech from the CNCF map>.

JohnMakin
0 replies
1d19h

It's almost like people factor in a piece of software's tooling environment before they use the software - wild.

dijksterhuis
5 replies
2d

When applied, Terraform code would spin up a template of what the service should look like by creating an ECS task set with zero instances. Then, the developer would need to deploy the service and clone this template task set [and do a bunch of manual things]

This meant that something as simple as adding an environment variable required writing and applying Terraform, then running a deploy

This sounds less like a problem with ECS and more like an overcomplication in how they were using terraform + ECS to manage their deployments.

I get the generating templates part for verification prior to live deploys. But this seems... dunno.

wfleming
1 replies
2d

Very much agree. I have built infra on ECS with terraform at two companies now, and we have zero manual steps for actions like this, beyond “add the env var to a terraform file, merge it and let CI deploy”. The majority of config changes we would make are that process.

dijksterhuis
0 replies
2d

Yeah.... thinking about it a bit more i just don't see why they didn't set up their CI to deploy a short lived environment on a push to a feature branch.

To me that seems like the simpler solution.

roshbhatia
1 replies
2d

I'm with you here -- ECS deploys are pretty painless and uncomplicated, but I can picture a few scenarios where this ends up being necessary, for ex; if they have a lot of services deployed on ECS and it ends up bloating the size of the Terraform state. That'd slow down plans and applies significantly, which makes sharding the Terraform state by literally cloning the configuration based on a template a lot safer.

freedomben
0 replies
1d21h

ECS deploys are pretty painless and uncomplicated

Unfortunately in my experience, this is true until it isn't. Once it isn't true, it can quickly become a painful blackbox debugging exercise. If your org is big enough to have dedicated AWS support then they can often get help from engineers, but if you aren't then life can get really complicated.

Still not a bad choice for most apps though, especially if it's just a run-of-the-mill HTTP-based app

ianvonseggern
0 replies
1d22h

Hey, author here, I totally agree that this is not a fundamental limitation of ECS and we could have iterated on this setup and made something better. I intentionally listed this under work we decided to scope into the migration process, and not under the fundamental reasons we undertook the migration because of that distinction.

JohnMakin
3 replies
2d

I like how this article clearly and articulately states the reasons it gains to benefit from Kubernetes. Many make the jump without knowing what they even stand to gain, or if they need to in the first place - the reasons given here are good.

samcat116
0 replies
1d23h

They're quite specific in that they mention that teams would like to make use of existing helm charts for other software products. Telling them to build and maintain definitions for those services from scratch is added work in their mind.

JohnMakin
0 replies
2d

I don't really see those rebuttals as all that valid. The reasons given in this article are completely valid, from my perspective of someone who's worked heavily with Kubernetes/ECS.

Helm, for instance, is a great time saver for installing software. Often software will support nothing but helm. Ease of deployment is a good consideration. Their points on networking are absolutely spot on. The scaling considerations are spot on. Killing/isolating unhealthy containers is completely valid. I could go on a lot more, but I don't see a single point listed as invalid.

tedunangst
2 replies
2d

How long will it take to migrate off?

hujun
0 replies
1d19h

depends on how much "k8s native" code you have, there are application designed to run on k8s which uses a lot of k8s api and also if you app already micro-serviced, it is also not straight forward to change it back

codetrotter
0 replies
1d22h

It’s irreversible.

rayrrr
2 replies
1d4h

Just out of curiosity, is there any other modern system or service that anyone here can think of, where anyone in their right might would brag about migrating to it in less than a year?

therealdrag0
0 replies
20h56m

I’ve seen many migrations take over a year. It’s less about the technology and more about your tech debt, integration complexity, and resourcing.

jjice
0 replies
1d4h

It's a hard question to answer. Not all systems are equal in size, scope, and impact. K8s as a system is often the core of your infra, meaning everything running will be impacted. That coupled with their team constraints in the article make it sound like a year isn't awful.

One system I can think of off the top of my head is when Amazon moved away from Oracle to fully Amazon/OSS RDBMSs a while ago, but that was multi year I think. If they could have done it in less than a year, they'd definitely be bragging.

ko_pivot
2 replies
1d20h

I’m not surprised that the first reason they state for moving off of ECS was the lack of support for stateful services. The lack of integration between EBS and ECS has always felt really strange to me, considering that AWS already built all the logic to integrate EKS with EBS in a StatefulSet compliant way.

ko_pivot
0 replies
1d19h

This adds support for ephemeral EBS volumes. When a task is created a volume gets created, and when the task is destroyed, for whatever reason, the volume is destroyed too. It has no concept of task identity. If the task needs to be moved to a new host, the volume is destroyed.

datadeft
2 replies
1d11h

Migrating onto Kubernetes can take years

What a heck am I reading? For who? I am not sure why companies even bother with such migrations. Where is the business value? Where is the gain for the customer? Is this one of those "L'art pour l'art" project that Figma does it just because they can?

xorcist
0 replies
1d7h

It solves the "we have recently been acquired and have a lot of resources that we must put to use" problem.

kevstev
0 replies
1d3h

FWIW... I was pretty taken aback by this statement as well- and also the "brag" that they moved onto K8s in less than a year. At a very well established firm ~30 years old and with the baggage that came with it, we moved to K8s in far less time- though we made zero attempt to move everything to k8s, just stuff that could benefit from it. Our pitch was more or less- move to k8s and when we do the planned datacenter move at the end of the year, you don't have to do anything aside from a checkout. Otherwise you will have to redeploy your apps to new machines or VMs and deal with all the headache around that. Or you could just containerize now if you aren't already and we take care of the rest. Most migrated and were very happy with the results.

There was plenty of services that were latency sensitive or in the HPC realm where it made no sense to force a migration though, and there was no attempt to force them to shoehorn in.

_pdp_
2 replies
1d21h

In my own experience, AWS Fargate is easier, more secure and way more robust then running your K8S even with EKS.

watermelon0
1 replies
1d20h

Do you mean ECS Fargate? Because you can use AWS Fargate with EKS, with some limitations.

_pdp_
0 replies
1d8h

Yes, ECS Fargate.

breakingcups
1 replies
1d22h

I feel so out of touch when I read a blog post which casually mentions 6 CNCF projects with kool names that I've never heard of, for gaining seemingly simple functionality.

I'm really wondering if I'm aging out of professional software development.

renewiltord
0 replies
1d21h

Nah, there’s lots of IC work. It just means that you’re unfamiliar with one approach to org scaling: abstracting over hardware, logging, retrying handled by platform team.

It’s not the only approach so you may well be familiar with others.

twodave
0 replies
1d21h

TL;DR because they already ran everything in containers. Having performed a migration where this wasn’t the case, the path from non-containerized to containerized is way more effort than going from containerized non-k8s to k8s.

syngrog66
0 replies
1d1h

k8s and "12 months" -> my priors likely confirmed. ha

surfingdino
0 replies
1d22h

ECS makes sense when you are building and breaking stuff. K8s makes sense when you are mature (as an org).

strivingtobe
0 replies
1d17h

At the time we did not auto-scale any of our containerized services and were spending a lot of unnecessary money to keep services provisioned such that they could always handle peak load, even on nights and weekends when our traffic is much lower.

Huh? You've been running on AWS for how long and haven't been using auto scaling AT ALL? How was this not priority number one for the company to fix? You're just intentionally burning money at that point!

While there is some support for auto-scaling on ECS, the Kubernetes ecosystem has robust open source offerings such as Keda for auto-scaling. In addition to simple triggers like CPU utilization, Keda supports scaling on the length of an AWS Simple Queue Service (SQS) queue as well as any custom metrics from Datadog.

ECS autoscaling is easy, and supports these things. Fair play if you just really wanted to use CNCF projects, but this just seems like you didn't really utilize your previous infrastructure very well.

sjkoelle
0 replies
1d1h

The title alone is a teensie bit hilarious

ravedave5
0 replies
1d2h

Completely left out of this post and most of the conversation is that being on K8 makes it much, much easier to go multi-cloud. K8 is k8.

jstrong
0 replies
12h47m

enjoyed the article - provides a comprehensive tour of the benefits of deploying in a tmux session (seamless restarts via ctrl-c, up, enter). for super important stuff, a systemd unit is worth considering.

jokethrowaway
0 replies
1d17h

In which universe migrating from docker containers in ECS to Kubernetes is an effort measured in years?

jmdots
0 replies
1d18h

Please just use it as a warehouse scale computer and don’t make mode groups into pets.

ec109685
0 replies
1d16h

At high-growth companies, resources are precious

Yeah, at those low growth companies, you have unlimited resources /s

andrewguy9
0 replies
1d22h

I look forward to the blog post where they get off K8, in just 18 months.

Ramiro
0 replies
19h37m

I love reading these "reports from the field"; I always pick up a thing or two. Thanks for sharing @ianvonseggern!

05bmckay
0 replies
1d3h

I don't think this is the flex they think it is...