HN comments for: SSDs have become fast, except in the cloud

This was a huge technical problem I worked on at Google, and is sort of fundamental to a cloud. I believe this is actually a big deal that drives peoples' technology directions.

SSDs in the cloud are attached over a network, and fundamentally have to be. The problem is that this network is so large and slow that it can't give you anywhere near the performance of a local SSD. This wasn't a problem for hard drives, which was the backing technology when a lot of these network attached storage systems were invented, because they are fundamentally slow compared to networks, but it is a problem for SSD.

According to the submitted article, the numbers are from AWS instance types where the SSD is "physically attached" to the host, not about SSD-backed NAS solutions.

Also, the article isn't just about SSDs being no faster than a network. It's about SSDs being two orders of magnitude slower than datacenter networks.

It's because the "local" SSDs are not actually physically attached and there's a network protocol in the way.

I think you're wrong about that. AWS calls this class of storage "instance storage" [0], and defines it as:

Many Amazon EC2 instances can also include storage from devices that are located inside the host computer, referred to as instance storage.

There might be some wiggle room in "physically attached", but there's none in "storage devices located inside the host computer". It's not some kind of AWS-only thing either. GCP has "local SSD disks"[1], which I'm going to claim are likewise local, not over the network block storage. (Though the language isn't as explicit as for AWS.)

[0] https://aws.amazon.com/ec2/instance-types/

[1] https://cloud.google.com/compute/docs/disks#localssds

That's the abstraction they want you to work with, yes. That doesn't mean it's what is actually happening - at least not in the same way that you're thinking.

As a hint for you, I said "a network", not "the network." You can also look at public presentations about how Nitro works.

it sounds like you're trying to say "PCI switch" without saying "PCI switch" (I worked at Google for over a decade, including hardware division).

That is what I am trying to say without actually giving it out. PCIe switches are very much not transparent devices. Apparently AWS has not published anything about this, and doesn't have Nitro moderating access to "local" SSD, though - that I did get confused with EBS.

Why are you acting as if PCIe switches are some secret technology? It was extremely grating for me to read your comments.

Although it used them for years, the first mention by Google of PCIe switches was probably in the 2022 Aquila paper, which doesn't really talk about storage anyway...

I don't understand why you would expect Google to state that. They have been standard technology for almost 2 decades. You don't see google claiming they use jtag or using SPI flash or whatever. It's just not special.

Because the parent works/worked for Google, so obviously it must be super secret sauce that nobody has heard of. /s

Next up they’re going explain to us that iSCSI wants us to think it’s SCSI but it’s actually not!

AWS has stated that there is a "Nitro Card for Instance Storage"[0][1] which is a NVMe PCIe controller that implements transparent encryption[2].

I don't have access to an EC2 instance to check, but you should be able to see the PCIe topology to determine how many physical cards are likely in i4i and im4gn and their PCIe connections. i4i claims to have 8 x 3,750 AWS Nitro SSD, but it isn't clear how many PCIe lanes are used.

Also, AWS claims "Traditionally, SSDs maximize the peak read and write I/O performance. AWS Nitro SSDs are architected to minimize latency and latency variability of I/O intensive workloads [...] which continuously read and write from the SSDs in a sustained manner, for fast and more predictable performance. AWS Nitro SSDs deliver up to 60% lower storage I/O latency and up to 75% reduced storage I/O latency variability [...]"

This could explain the findings in the article - they only meared peak r/w, not predictability.

[0] https://perspectives.mvdirona.com/2019/02/aws-nitro-system/ [1] https://aws.amazon.com/ec2/nitro/ [2] https://d1.awsstatic.com/events/reinvent/2019/REPEAT_2_Power...

Like many other people in this thread, I think we disagree that a PCI switch means that an SSD "is connected over a network" to the host bus.

Now if you can show me two or more hosts connected to a box of SSDs through a PCI switch (and some sort of cool tech for coordinating between the hosts), that's interesting.

I've linked to public documentation that is pretty clearly in conflict with what you said. There's no wiggle room in how AWS describes their service without it being false advertising. There's no "ah, but what if we define the entire building to be the host computer, then the networked SSDs really are inside the host computer" sleight of hand to pull off here.

You've provided cryptic hints and a suggestion to watch some unnamed presentation.

At this point I really think the burden of proof is on you.

You are correct, and the parent you’re replying to is confused. Nitro is for EBS, not the i3 local NVMe instances.

Those i3 instances lose your data whenever you stop and start them again (ie migrate to a different host machine), there’s absolutely no reason they would use network.

EBS itself uses a different network than the “normal” internet, if I were to guess it’s a converged Ethernet network optimized for iSCSI. Which is what Nitro optimizes for as well. But it’s not relevant for the local NVMe storage.

The argument could also be resolved by just getting the latency numbers for both cases and compare them, on bare metal it shouldn't be more than a few hundred nanoseconds.

I see wiggle room in the statement you posted in that the SSD storage that is physically inside the machine hosting the instance might be mounted into the hypervised instance itself via some kind of network protocol still, adding overhead.

At minimum, the entire setup will be virtualized, which does add overhead.

Nitro "virtual NVME" device are mostly (only?) for EBS -- remote network storage, transparently managed, using a separate network backbone, and presented to the host as a regular local NVME device. SSD drives in instances such as i4i, etc. are physically attached in a different way -- but physically, unlike EBS, they are ephemeral and the content becomes unavaiable as you stop the instance, and when you restart, you get a new "blank slate". Their performance is 1 order of magnitude faster than standard-level EBS, and the cost structure is completely different (and many orders of magnitude more affordable than EBS volumes configured to have comparable I/O performance).

This is the way Azure temporary volumes work as well. They are scrubbed off the hardware once the VM that accesses them is dead. Everything else is over the network.

Both the documentation and Amazon employees are in here telling you that you're wrong. Can you resolve that contradiction or do you just want to act coy like you know some secret? The latter behavior is not productive.

The parent thinks that AWS' i3 NVMe local instance storage is using a PCIe switch, which is not the case. EBS (and the AWS Nitro card) use a PCIe switch, and as such all EBS storage is exposed as e.g. /dev/nvmeXnY . But that's not the same as the i3 instances are offering, so the parent is confused.

If the SSD is installed in the host server, doesn't that still allow for it to be shared among many instances running on said host? I can imagine that a compute node has just a handful of SSDs and many hundreds of instances sharing the I/O bandwidth.

How do these machines manage the sharing of one local SSD across multiple VMs? Is there some wrapper around the I/O stack? Does it appear as a network share? Geniuinely curious...

With Linux and KVM/QEMU, you can map an entire physical disk, disk partition, or file to a block device in the VM. For my own VM hosts, I use LVM and map a logical volume to the VM. I assumed cloud providers did something conceptually similar, only much more sophisticated.

Heh, you'd probably be surprised, there's some really cool cutting edge stuff being done in those data centers but a lot of what is done is just plan old standard server management without much in the way of tricks. Its just someone else does it instead of you and the billing department is counting milliseconds.

Do cloud providers document these internals anywhere? I'd love to read about that sort of thing.

Not generally, especially not the super generic stuff. Where they really excel is having the guy that wrote the kernel driver or hypervisor on staff. But a lot of it is just an automated version of what you'd do on a smaller scale

Files with reflinks are a common choice, the main benefit being: only storing deltas. The base OS costs basically nothing

LVM/block like you suggest is a good idea. You'd be surprised how much access time is trimmed by skipping another filesystem like you'd have with a raw image file

In say VirtualBox you can create a file backed on the physical disk, and attach it to the VM so the VM sees it as a NVMe drive.

In my experience this is also orders of magnitude slower that true direct access, ie PCIe pass-through, as all access has to pass through the VM storage driver and so could explain what is happening.

The storage driver may have more impact on VBox. You can get very impressive results with 'virtio' on KVM

Yeah I've yet to try that. I know I get a similar lack of performance with Bhyve (FreeBSD) using VirtIO, so it's not a given it's fast.

I have no idea how AWS run their VMs, was just saying a slow storage driver could give such results.

just saying a slow storage driver could give such results

Oh, absolutely - not to contest that! There's a whole lot of academia on 'para-virtualized' and so on in this light.

That's interesting to hear about FreeBSD; basically all of my experience has been with Linux/Windows.

Probably NVME namespaces [0]?

[0]: https://nvmexpress.org/resource/nvme-namespaces/

Less fancy, quite often... at least on VPS providers [1]. They like to use reflinked files off the base images. This way they only store what differs.

1: Which is really a cloud without a certain degree of software defined networking/compute/storage/whatever.

AWS have custom firmware for at least some of their SSDs, so could be that

Instance storage is not networked. That's why it's there.

PCI bus, etc too

If you have one of the metal instance types, then you get the whole host, e.g. i4i.metal:

https://aws.amazon.com/ec2/instance-types/i4i/

On AWS yes, the older instances which I am familiar with had 900GB drives and they sliced that up into volumes of 600, 450, 300, 150, 75GB depending on instance size.

But they also tell you how much IOPS you get: https://docs.aws.amazon.com/AWSEC2/latest/WindowsGuide/stora...

the tests were for these local (metal direct connect ssds). The issue is not network overhead -- its that just like everything else in cloud the performance of 10 years ago was used as the baseline that carries over today with upcharges to buy back the gains.

there is a reason why vcpu performance is still locked to the typical core from 10 years ago when every core on a machine today in those data scenters is 3-5x or more speed basis. Its cause they can charge you for 5x the cores to get that gain.

vcpu performance is still locked to the typical core from 10 years ago

No. In some cases I think AWS actually buys special processors that are clocked higher than the ones you can buy.

You are talking about real CPU not virtual cpu

Generally each vCPU is a dedicated hardware thread, which has gotten significantly faster in the last 10 years. Only lambdas, micros, and nanos have shared vCPUs and those have probably also gotten faster although it's not guaranteed.

In fairness, there are a not insignificant number of workloads that do not benefit from hardware threads on CPUs [0], instead isolating processes along physical cores for optimal performance.

[0] Assertion not valid for barrel processors.

The parent claims that though aws uses better hardware, they bill in vcpus whose benchmarks are from a few years ago, so that they can sell more vcpu units per performant physical cpu. This does not contradict your claim that aws buys better hardware.

It's so obviously wrong that I can't really explain it. Maybe someone else can. To believe that requires a complete misunderstanding of IaaS.

there is a reason why vcpu performance is still locked to the typical core from 10 years ago

That is transparently nonsense.

You can disprove that claim in 5 minutes, and it makes literally zero sense for offerings that aren't oversubscribed

AWS is so large, every concept of hardware is virtualized over a software layer. “Instance storage” is no different. It’s just closer to the edge with your node. It’s not some box in a rack where some AWS tech slots in an SSD. AWS has a hardware layer, but you’ll never see it.

Local SSD is part of the machine, not network attached.

You’re wrong. Instance local means SSD is physically attached to the droplet and is inside the server chassis, connected via PCIe.

Sourece: I work on nitro cards.

"Attached to the droplet"?

Droplets are what EC2 calls their hosts. Confusing? I know.

Yes! That is confusing! Tell them to stop it!

FYI it's not a AWS term, it's a DigitalOcean term.

I could not be more confused. Does EC2 quietly call their hosting machines "droplets"? I knew "droplets" to be a DigitalOcean team, but DigitalOcean doesn't have Nitro cards.

Now I'm wondering if that's where DO got the name in the first place

Surely "droplet" is a derivative of "ocean?"

Clouds (like, the big fluffy things in the sky) are made up of many droplets of liquid. Using "droplet" to refer to the things that make up cloud computing is a pretty natural nickname for any cloud provider, not just DO. I do imagine that DO uses "droplet" as a public product branding because it works well with their "Ocean" brand, though.

...now I'm actually interested in knowing if "droplet" is derived from "ocean", or if "Digital Ocean" was derived from having many droplets (which was derived from cloud). Maybe neither.

Clouds are water vapor, not droplets.

“Cloud: Visible mass of liquid droplets or frozen crystals suspended in the atmosphere“

https://en.wikipedia.org/wiki/Cloud

I believe AWS was calling them droplets prior to digital ocean.

digitalocean squad

No, that’s AWS.

That is more than likely a team-specific term being used outside of its context. FYI, the only place where you will find the term <droplet> used, is in the public-facing AWS EC2 API documentation under InstanceTopology:networkNodeSet[^1]. Even that reference seems like a slip of the tongue, but the GP did mention working on the Nitro team, which makes sense when you look at the EC2 instance topology[^2].

[^1]: https://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_I... [^2]: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/how-ec2-...

Depends on the cloud provider. Local SSDs are physically attached to the host on GCP, but that makes them only useful for temporary storage.

If you're at G, you should read the internal docs on exactly how this happens and it will be interesting.

Why would I lose all data on these SSDs when I initiate a power off of the VM on console, then?

I believe local SSDs are definitely attached to the host. They are just not exposed via NVMe ZNS hence the performance hit.

Your EC2 instance with instance-store storage when stopped can be launched on any other random host in the AZ when you power it back on. Since your rootdisk is an EBS volume attached across the network, so when you start your instance back up you're going to be launched likely somewhere else with an empty slot, and empty local-storage. This is why there is always a disclaimer that this local storage is ephemeral and don't count on it being around long-term.

I think the parent was agreeing with you. If the “local” SSDs _weren’t_ actually local, then presumably they wouldn’t need to be ephemeral since they could be connected over the network to whichever host your instance was launched on.

It is because on reboot you may not get the same physical server . They are not rebooting the physical server for you , just the VM

Same VM is not allocated for a variety of reasons , scheduled maintenance, proximity to other hosts on the vpc , balancing quiet and noisy neighbors so on.

It is not that the disk will always wiped , sometimes the data is still there on reboot just that there is no guarantee allowing them to freely move between hosts

Data persists between reboots, but not shutdowns:

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-inst...

Why are you protecting Google's internal architecture onto to AWS? Your Google mental model is not correct here

In most cases, they're physically plugged into a PCIe CEM slot in the host.

There is no network in the way, you are either misinformed or thinking of a different product.

Which is a weird sort of limitation. For any sort of you-own-the-hardware arrangement, NVMe disks are fine for long term storage. (Obviously one should have backups, but that’s a separate issue. One should have a DR plan for data on EBS, too.)

You need to migrate that data if you replace an entire server, but this usually isn’t a very big deal.

This is Hyrum’s law at play: AWS wants to make sure that the instance stores aren’t seen as persistent, and therefore enforce the failure mode for normal operations as well.

You should also see how they enforce similar things for their other products and APIs, for example, most of their services have encrypted pagination tokens.

Yes, that's what their purpose is in cloud applications: temporary high performance storage only.

If you want long term local storage you'll have to reserve an instance host.

They do this because they want SSDs to be in a physically separate part of the building for operational reasons, or what's the point in giving you a "local" SSD that isn't actually plugged into the real machine?

The reason for having most instances use network storage is that it makes possible migrating instances to other hosts. If the host fails, the network storage can be pointed at the new host with a reboot. AWS sends out notices regularly when they are going to reboot or migrate instances.

Their probably should be more local instance storage types for using with instances that can be recreated without loss. But it is simple for them to have a single way of doing things.

At work, someone used fast NVMe instance storage for Clickhouse which is a database. It was a huge hassle to copy data when instances were going to be restarted because the data would be lost.

Sure, I understand that, but this user is claiming that on GCP even local SSDs aren't really local, which raises the question of why not.

I suspect the answer is something to do with their manufacturing processes/rack designs. When I worked there (pre GCP) machines had only a tiny disk used for booting and they wanted to get rid of that. Storage was handled by "diskful" machines that had dedicated trays of HDDs connected to their motherboards. If your datacenters and manufacturing processes are optimized for building machines that are either compute or storage but not both, perhaps the more normal cloud model is hard to support and that pushes you towards trying to aggregate storage even for "local" SSD or something.

The GCE claim is unverified. OP seems to be referring to PD-SSD and not LocalSSD

GCE local SSDs absolutely are on the same host as the VM. The docs [0] are pretty clear on this, I think:

Local SSD disks are physically attached to the server that hosts your VM.

Disclosure: I work on GCE.

[0] https://cloud.google.com/compute/docs/disks/local-ssd

They're claiming so, but they're wrong.

At work, someone used fast NVMe instance storage for Clickhouse which is a database. It was a huge hassle to copy data when instances were going to be restarted because the data would be lost.

This post on how Discord RAIDed local NVMe volumes with slower remote volumes might be on interest https://discord.com/blog/how-discord-supercharges-network-di...

We moved to running Clickhouse on EKS with EBS volumes for storage. It can better survive instances going down. I didn't work on it so don't how much slower it is. Lowering the management burden was big priority.

Are you saying that a reboot wipes the ephemeral disks? Or a stop the instance and start the instance from AWS console/api?

Reboot keeps the instance storage volumes. Restarting wipes them. Starting frequently migrates to new host. And the "restart" notices AWS sends are likely cause the host has a problem and need to migrate it.

The comment you’re responding to is wrong. AWS offers many kinds of storage. Instance local storage is physically attached to the droplet. EBS isn’t but that’s a separate thing entirely.

I literally work in EC2 Nitro.

That seems like a big opportunity for other cloud providers. They could provide SSDs that are actually physically attached and boast (rightfully) that their SSDs are a lot faster, drawing away business from older cloud providers.

For what kind of workloads would a slower SSD be a significant bottleneck?

I run very large database-y workloads. Storage bandwidth is by far the throughput rate limiting factor. Cloud environments are highly constrained in this regard and there is a mismatch between the amount of CPU you are required to buy to get a given amount of bandwidth. I could saturate a much faster storage system with a fraction of the CPU but that isn’t an option. Note that latency is not a major concern here.

This has an enormous economic impact. I once did a TCO study with AWS to run data-intensive workload running on purpose-built infrastructure on their cloud. AWS would have been 3x more expensive per their own numbers, they didn’t even argue it. The main difference is that we had highly optimized our storage configuration to provide exceptional throughput for our workload on cheap hardware.

I currently run workloads in the cloud because it is convenient. At scale though, the cost difference to run it on your own hardware is compelling. The cloud companies also benefit from a learned helplessness when it comes to physical infrastructure. Ironically, it has never been easier to do a custom infrastructure build, which companies used to do all the time, but most people act like it is deep magic now.

Thanks for the details!

Does this mean you're colocating your own server in a data center somewhere? Or do you have your own data center/running it off a bare metal server with a business connection?

Just wondering if the TCO included the same levels of redundancy and bandwidth, etc.

We were colocated in large data centers right on the major IX with redundancy. All of this was accounted for in their TCO model. We had a better switch fabric than is typical for the cloud but that didn’t materially contribute to cost. We were using AWS for overflow capacity when we exceeded the capacity of our infrastructure at the time; they wanted us to move our primary workload there.

The difference in cost could be attributed mostly to the server hardware build, and to a lesser extent the better scalability with a better network. In this case, we ended up working with Quanta on servers that had everything we needed and nothing we didn’t, optimizing heavily for bandwidth/$. We worked directly with storage manufacturers to find SKUs that stripped out features we didn’t need and optimized for cost per byte given our device write throughput and durability requirements. They all have hundreds of custom SKUs that they don’t publicly list, you just have to ask. A hidden factor is that the software was designed to take advantage of hardware that most enterprises would not deign to use for high-performance applications. There was a bit of supply chain management but we did this as a startup buying not that many units. The final core server configuration cost us just under $8k each delivered, and it outperformed every off-the-shelf server for twice the price and essentially wasn’t something you could purchase in the cloud (and still isn’t). These servers were brilliant, bulletproof, and exceptionally performant for our use case. You can model out the economics of this and the zero-crossing shows up at a lower burn rate than I think many people imagine.

We were extremely effective at using storage, and we did not attach it to expensive, overly-powered servers where the CPUs would have been sitting idle anyway. The sweet spot was low-clock high-core CPUs, which are typically at a low-mid price point but optimal performance-per-dollar if you can effectively scale software to the core count. Since the software architecture was thread-per-core, the core count was not a bottleneck. The economics have not shifted much over time.

AWS uses the same pricing model as everyone else in the server leasing game. Roughly speaking, you model your prices to recover your CapEx in 6 months of utilization. Ignoring overhead, doing it ourselves pulled that closer to 1.5-2 months for the same burn. This moves a lot of the cost structure to things like power, space, and bandwidth. We definitely were paying more for space and power than AWS (usually less for bandwidth) but not nearly enough to offset our huge CapEx advantage relative to workload.

All of this can be modeled out in Excel. No one does it anymore but I am from a time when it was common, so I have that skill in my back pocket. It isn’t nearly as much work as it sounds like, much of the details are formulaic. You do need to have good data on how your workload uses hardware resources to know what to build.

All of this can be modeled out in Excel. No one does it anymore but I am from a time when it was common, so I have that skill in my back pocket. It isn’t nearly as much work as it sounds like, much of the details are formulaic. You do need to have good data on how your workload uses hardware resources to know what to build.

And this is one of the big "screts" AWS success: Shifting a lot of resource allocation and power from people with budgeting responsibility to developers who have usually never seen the budget or accounts, don't keep track, and at most retrospectively gets pulled in to explain line items in expenses, and obscuring it (to the point where I know people who've spent 6 figure amounts worth of dev time building analytics to figure out where their cloud spend goes... tooling has gotten better but is still awful)

I believe a whole lot of tech stacks would look very different if developers and architects were more directly involved in budgeting, and bonuses etc. were linked at least in part to financial outcomes affected by their technical choices.

A whole lot of claims to low cloud costs come from people who have never done actual comparisons and who seem to have a pathological fear of hardware, even when for most people you don't need to ever touch a physical box yourself - you can get maybe 2/3's of the savings with managed hosting as well.

You don't get the super-customized server builds, but you do get far more choice than with cloud providers, and you can often make up for the lack of fine-grained control by being able to rent/lease them somewhere where the physical hosting is cheaper (e.g. at a previous employer what finally made us switch to Hetzner for most new capacity was that while we didn't get exactly the hardware we wanted, we got "close enough" coupled with data centre space in their locations in Germany being far below data centre space in London - it didn't make them much cheaper, but it did make them sufficiently cheaper to outweigh the hardware differences with a margin sufficient for us to deploy new stuff there but still keep some of our colo footprint)

I tend some workloads that transform data grids of varying sizes. The grids are anon mmaps so that when mem runs out, they get paged out. This means processing stays mostly in-mem yet won't abort when mem runs tight. The processes that get hit by paging slow to a crawl though. Getting faster SSD means they're still crawling but crawling faster. Doubling SSD throughput would pretty much half the tail latency.

I see. Thanks for explaining!

Pretty much all work loads, work loads that are not affected would be the exception

This is already a thing. AWS instance store volumes are directly attached to the host. I’m pretty sure GCP and Azure also have an equivalent local storage option.

Next thing the other clouds will offer is cheaper bandwidth pricing, right?

I suspect you must be conflating several different storage products. Are you saying https://cloud.google.com/compute/docs/disks/local-ssd devices talk to the host through a network (say, ethernet with some layer on top)? Because the documentation very clearly says otherwise, "This is because Local SSD disks are physically attached to the server that hosts your VM. For this same reason, Local SSD disks can only provide temporary storage." (at least, I'm presuming that by physically attached, they mean it's connected to the PCI bus without a network in between).

I suspect you're thinking of SSD-PD. If "local" SSDs are not actually local and go through a network, I need to have a discussion with my GCS TAM about truth in advertising.

I don’t really agree with assuming the form of physical attachment and interaction unless it is spelled out.

If that’s what’s meant it will be stated in some fine print, if it’s not stated anywhere then there is no guarantee what the term means, except I would guess they may want people to infer things that may not necessarily be true.

"Physically attached" has had a fairly well defined meaning and i don't normally expect a cloud provider to play word salad to convince me a network drive is locally attached (like I said, if true, I would need to have a chat with my TAM about it).

Physically attached for servers, for the past 20+ years, has meant a direct electrical connection to a host bus (such as the PCI bus attached to the front-side bus). I'd like to see some alternative examples that violate that convention.

Ethernet cables are physical...

If that’s the game we’re going to play then technically my driveway is on the same road as the White House.

exactly. it's not about what's good for the consumer, it's about what they can do without losing a lawsuit for false advertising.

The NIC is attached to the host bus through the north bridge. But other hosts on the same ethernetwork are not considered to be "local". We dont need to get crazy about teh semantics to know that when a cloud provider says an SSD is locally attached, that it's closer than an ethernetwork away.

physically attached

Believe it or not, superglue and a wifi module! /s

Local SSD is part of the machine.

For AWS there are EBS volumes attached through a custom hardware NVMe interface and then there's Instance Store which is actually local SSD storage. These are different things.

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/Instance...

EBS is also slower than local NVMe mounts on i3's.

Also, both features use Nitro SSD cards, according to AWS docs. The Nitro architecture is all locally attached -- instance storage to the instance, EBS to the EBS server.

What makes you think that?

I can attest to the fact that on EC2, "instance store" volumes are actually physically attached.

Do you have a link to explain this? I dont think its true.

This is incorrect.

Amazon offers both locally-attached storage devices as well as instance-attached storage devices. The article is about the latter kind.

Nope! Well not as advertised. There are instances, usually more expensive ones, where there are supposed to be local NVME disks dedicated to the instance. You're totally right that providing good I/O is a big problem! And I have done studies myself showing just how bad Google Cloud is here, and have totally ditched Google Cloud for providing crappy compute service (and even worse customer service).

instances can have block storage which is network attached, or local attached ssd/nvme. its 2 separate things.

At first you'd think maybe they can do a volume copy from a snapshot to a local drive on instance creation but even at 100gbps you're looking at almost 3 minutes for a 2TB drive.

It's about SSDs being two orders of magnitude slower than datacenter networks.

Could that have to do with every operation requiring a round trip, rather than being able to queue up operations in a buffer to saturate throughput?

It seems plausible if the interface protocol was built for a device it assumed was physically local and so waited for confirmation after each operation before performing the next.

In this case it's not so much the throughput rate that matters, but the latency -- which can also be heavily affected by buffering of other network traffic.

Underlying protocol limitations wouldn't be an issue - the cloud provider's implementation can work around that. They're unlikely to be sending sequential SCSI/NVMe commands over the wire - instead, the hypervisor pretends to be the NVME device, but then converts to some internal protocol (that's less chatty and can coalesce requests without waiting on individual ACKs) before sending that to the storage server.

The problem is that ultimately your application often requires the outcome of a given IO operation to decide which operation to perform next - let's say when it comes to a database, it should first read the index (and wait for that to complete) before it knows the on-disk location of the actual row data which it needs to be able to issue the next IO operation.

In this case, there's no other solution than to move that application closer to the data itself. Instead of the networked storage node being a dumb blob storage returning bytes, the networked "storage" node is your database itself, returning query results. I believe that's what RDS Aurora does for example, every storage node can itself understand query predicates.

I've run CI/CD pipelines on EC2 machines with local storage, typically running Raid-0, btrfs, noaccestime. I didn't care if the filesystem got corrupt or whatever, I had a script that would rebuild it in under 30mins. In addition to the performance you're not paying by IOPs.

Why do they fundamentally need to be network attached storage instead of local to the VM?

They don't. Some cloud providers (i.e. Hetzner) let you rent VMs with locally attached NVMe, which is dramatically faster than network-attached even factoring in the VM tax.

Of course then you have a single point of failure, in the PCIe fabric of the machine you're running on if not the NVMe itself. But if you have good backups, which you should, then the juice really isn't worth the squeeze for NAS storage.

A network adds more points of failure. It does not reduce them.

A network attached, replicated storage hedges against data loss but increases latency; however most customers usually prefer higher latency to data loss. As an example, see the highly upvoted fly.io thread[1] with customers complaining about the same thing.

[1] https://news.ycombinator.com/item?id=36808296

Locally-attached, replicated storage also hedges against data loss.

RAID rebuild times make it an unviable option and customers typically expect problematic VMs to be live-migrated to other hosts with the disks still having their intended data.

The self hosted version of this is GlusterFS and Ceph, which have the same dynamics as EBS and its equivalents in other cloud providers.

With NVMe SSDs? What makes RAID unviable in that environment?

This depends, like all things.

When you say RAID, what level? Software-raid or hardware raid? What controller?

Let's take best-case:

RAID10, small enough (but many) NVMe drives and an LVM/Software RAID like ZFS, which is data aware so only rebuilds actual data: rebuilds will degrade performance enough potentially that your application can become unavailable if your IOPS are 70%+ of maximum.

That's an ideal scenario, if you use hardware raid which is not data-aware then your rebuild times depend entirely on the size of the drive being rebuilt and it can punish IOPs even more during the rebuild. But it will affect your CPU less.

There's no panacea. Most people opt for higher latency distributed storage where the RAID is spread across an enormous amount of drives, which makes rebuilds much less painful.

What I used to do was swap machines over from the one with failing disks to a live spare (slave in the old frowned upon terminology), do the maintenance and then replicate from the now live spare back if I had confidence it was all good.

Yes it’s costly having the hardware to do that as it mostly meant multiple machines as I always wanted to be able to rebuild one whilst having at least two machines online.

If you are doing this with your own hardware it is still less costly than cloud even if it mostly sits idle.

Cloud is approx 5x sticker cost for compute if its sustained.

Your discounts may vary, rue the day those discounts are taken away because we are all sufficiently locked in.

A network adds more points of failures but also reduces user-facing failures overall when properly architected.

If one CPU attached to storage dies, another can take over and reattach -- or vice-versa. If one network link dies, it can be rerouted around.

Using a SAN (which is what networked storage is, after all) also lets you get various "tricks" such as snapshots, instant migration, etc for "free".

Because even if you can squeeze 100TB or more of SSD/NVMe in a server, and there are 10 tenants using the machine, you're limited to 10TB as a hard ceiling.

What happens when one tenant needs 200TB attached to a server?

Cloud providers are starting to offer local SSD/NVMe, but you're renting the entire machine, and you're still limited to exactly what's installed in that server.

How is that different from how cores, mem and network bandwidth is allotted to tenants?

Because a fair number of customers spin up another image when cores/mem/bandwidth run low. Dedicated storage breaks that paradigm.

Also, adding, if I am on an 8 core machine and need 16, network storage can be detached from host A and connected to host B. In dedicated storage it must be fully copied over first.

It isn't. You could ask for network-attached CPUs or RAM. You'd be the only one, though, so in practice only network-attached storage makes sense business-wise. It also makes sense if you need to provision larger-than-usual amounts like tens of TB - these are usually hard to come by in a single server, but quite mundane for storage appliances.

Given AWS and GCP offer multiple sizes for the same processor version with local SSDs, I don't think you have to rent the entire machine.

Search for i3en API names and you'll see:

i3en.large, 2x CPU, 1250GB SSD

i3en.xlarge, 4x CPU, 2500GB SSD

i3en.2xlarge, 8x CPU, 2x2500GB SSD

i3en.3xlarge, 12x CPU, 7500GB SSD

i3en.6xlarge, 24x CPU, 2x7500GB SSD

i3en.12xlarge, 48x CPU, 4x7500GB SSD

i3en.24xlarge, 96x CPU, 8x7500GB SSD

i3en.metal, 96x CPU, 8x7500GB SSD

So they've got servers with 96 CPUs and 8x7500GB SSDs. You can get a slice of one, or you can get the whole one. All of these are the ratio of 625GB of local SSD per CPU core.

https://instances.vantage.sh/

On GCP you can get a 2-core N2 instance type and attach multiple local SSDs. I doubt they have many physical 2-core Xeons in their datacenters.

What happens when one tenant needs 200TB attached to a server?

Link to this mythical hosting service that expects far less than 200TB of data per client but just pulls a sad face and takes the extra cost on board when a client demands it. :D

Redundancy, local storage is a single point of failure.

You can use local SSD’s as slow RAM, but anything on it can go away at any moment.

I've seen SANs get nuked by operator error or by environmental issues (overheated DC == SAN shuts itself down).

Distributed clusters of things can work just fine on ephemeral local storage (aka local storage). A kafka cluster or an opensearch cluster will be fine using instance local storage, for instance.

As with everything else.... "it depends"

Sure distributed clusters get back to network/workload limitations.

These days it's likely that your SAN is actually just a cluster of commodity hardware where the disks/SSDs have custom firmware and some fancy block shoveling software.

Reliability. SSDs break and screw up a lot more frequently and more quickly than CPUs. Amazon has published a lot on the architecture of EBS, and they go through a good analysis of this. If you have a broken disk and you locally attach, you have a broken machine.

RAID helps you locally, but fundamentally relies on locality and low latency (and maybe custom hardware) to minimize the time window where you get true data corruption on a bad disk. That is insufficient for cloud storage.

Sure, but there's plenty of software that's written to use distributed unreliable storage similar to how cloud providers write their own software (e.g. Kafka). I can understand if many applications are just need something like EBS that's durable but looks like a normal disk, but not so sure it's a fundamentally required abstraction.

The major clouds do offer VMs with fast local storage, such as SSDs connected by NVMe connections directly to the VM host machine:

- https://cloud.google.com/compute/docs/disks/local-ssd

- https://learn.microsoft.com/en-us/azure/virtual-machines/ena...

- https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ssd-inst...

They sell these VMs at a higher cost because it requires more expensive components and is limited to host machines with certain configurations. In our experience, it's also harder to request quota increases to get more of these VMs -- some of the public clouds have a limited supply of these specific types of configurations in some regions/zones.

As others have noted, instance storage isn't as dependable. But it can be the most performant way to do IO-intense processing or to power one node of a distributed database.

So much of this. The amount of times I've seen someone complain about slow DB performance when they're trying to connect to it from a different VPC, and bottlenecking themselves to 100Mbits is stupidly high.

Literally depending on where things are in a data center... If you're looking for closely coupled and on a 10G line on the same switch, going to the same server rack. I bet you performance will be so much more consistent.

Literally depending on where things are in a data center

I thought cloud was supposed to abstract this away? That's a bit of a sarcastic question from a long-time cloud skeptic, but... wasn't it?

Reality always beats the abstraction. After all, it's just somebody else's computer in somebody else's data center.

Which can cause considerable "amusement" depending on the provider - one I won't name directly but is much more centered on actual renting racks than their (now) cloud offering - if you had a virtual machine older than a year or so, deleting and restoring it would get you on a newer "host" and you'd be faster for the same cost.

Otherwise it'd stay on the same physical piece of hardware it was allocated to when new.

Amusing is a good description.

"Hardware degradation detected, please turn it off and back on again"

I could do a migration with zero downtime in VMware for a decade but they can't seamlessly move my VM to a machine that works in 2024? Great, thanks. Amusing.

Cloud providers have live migration now but I guess they don't want to guarantee anything.

It's better (and better still with other providers) but I naively thought that "add more RAM" or "add more disk" was something they would be able to do with a reboot at most.

Nope, some require a full backup and restore.

Resizing VMs doesn't really fit the "cattle" thinking of public cloud, although IMO that was kind of a premature optimization. This would be a perfect use case for live migration.

I have always been incredibly saddened that apparently the cloud providers usually have nothing as advanced as old VMware was.

Cloud makes provisioning more servers quicker because you are paying someone to basically have a bunch of servers ready to go right away with an API call instead of a phone call, maintained by a team that isn’t yours, with economies of scale working for the provider.

Cloud does not do anything else.

None of these latency/speed problems are cloud-specific. If you have on-premise servers and you are storing your data on network-attached storage, you have the exact same problems (and also the same advantages).

Unfortunately the gap between local and network storage is wide. You win some, you lose some.

Oh, I'm not a complete neophyte (in what seems like a different life now, I worked for a big hosting provider actually), I was just surprised that there was a big penalty for cross-VPC traffic implied by the parent poster.

It's more of a matter of adding additional abstraction layers. For example in most public clouds the best you can hope for is to place two things in the same availability zone to get the best performance. But when I worked at Google, internally they had more sophisticated colocation constraint than that: for example you can require two things to be on the same rack.

Aren’t 10G and 100G connections standard nowadays in data centers? Heck, I thought they were standard 10 years ago.

Bandwidth delay product does not help serialized transactions. If you're reaching out to disk for results, or if you have locking transactions on a table the achievable operations drops dramatically as latency between the host and the disk increases.

The typical way to trade bandwidth away for latency would, I guess, be speculative requests. In the CPU world at least. I wonder if any cloud providers have some sort of framework built around speculative disk reads (or maybe it is a totally crazy trade to make in this context)?

Often times it’s the app (or something high level) that would need speculative requests, which it may not be possible in the given domain.

I don’t think it’s possible in most domains.

I mean we already have readahead in the kernel.

This said the problem can get more complex than this really fast. Write barriers for example and dirty caches. Any application that forces writes and the writes are enforced by the kernel are going to suffer.

The same is true for SSD settings. There are a number of tweakable values on SSDs when it comes to write commit and cache usage which can affect performance. Desktop OS's tend to play more fast and loose with these settings and servers defaults tend to be more conservative.

You'd need the whole stack to understand your data format in order to make speculative requests useful. It wouldn't surprise me if cloud providers indeed do speculative reads but there isn't much they can do to understand your data format, so chances are they're just reading a few extra blocks beyond where your OS read and are hoping that the next OS-initiated read will fall there so it can be serviced using this prefetched data. Because of full-disk-encryption, the storage stack may not be privy to the actual data so it couldn't make smarter, data-aware decisions even if it wanted to, limiting it to primitive readahead or maybe statistics based on previously-seen patterns (if it sees that a request for block X is often followed by block Y, it may choose to prefetch that next time it sees block X accessed).

A problem in applications such as databases is when the outcome of an IO operation is required to initiate the next one - for example, you must first read an index to know the on-disk location of the actual row data. This is where the higher latency absolutely tanks performance.

A solution could be to make the storage drives smarter - have an NVME command that could say like "search in between this range for this byte pattern" and one that can say "use the outcome of the previous command as a the start address and read N bytes from there". This could help speed up the aforementioned scenario (effectively your drive will do the index scan & row retrieval for you), but would require cooperation between the application, the filesystem and the encryption system (typical, current FDE would break this).

400G is fairly normal thing in DCs nowadays

Datacenters are up to 400 Gbps and beyond (many places are adopting 1+ Tbps on core switching).

However, individual servers may still operate at 10, 25, or 40 Gbps to save cost on the thousands of NICs in a row of racks. Alternatively, servers with multiple 100G connections split that bandwidth allocation up among dozens of VMs so each one gets 1 or 10G.

Yes, but you have to think about contention. Whilst the Top of rack might have 2x400 gig links to the core, thats shared with the entire rack, and all the other machines trying to shout at the core switching infra.

Then stuff goes away, or route congested, etc, etc, etc.

The problem is that this network is so large and slow that it can't give you anywhere near the performance of a local SSD. This wasn't a problem for hard drives, which was the backing technology when a lot of these network attached storage systems were invented, because they are fundamentally slow compared to networks, but it is a problem for SSD.

Certainly true that SSD bandwidth and latency improvements are hard to match, but I don't understand why intra-datacenter network latency in particular is so bad. This ~2020-I-think version of the "Latency Numbers Everyone Should Know" says 0.5 ms round trip (and mentions "10 Gbps network" on another line). [1] It was the same thing in a 2012 version (that only mentions "1 Gbps network"). [2] Why no improvement? I think that 2020 version might have been a bit conservative on this line, and nice datacenters may even have multiple 100 Gbit/sec NICs per machine in 2024, but still I think the round trip actually is strangely bad.

I've seen experimental networking stuff (e.g. RDMA) that claims significantly better latency, so I don't think it's a physical limitation of the networking gear but rather something at the machine/OS interaction area. I would design large distributed systems significantly differently (be much more excited about extra tiers in my stack) if the standard RPC system offered say 10 µs typical round trip latency.

[1] https://static.googleusercontent.com/media/sre.google/en//st...

[2] https://gist.github.com/jboner/2841832

That document is probably deliberately on the pessimistic side to encourage your code to be portable across all kinds of "data centers" (however that is defined). When I previously worked at Google, the standard RPC system definitely offered 50 microseconds of round trip latency at the median (I measured it myself in a real application), and their advanced user-space implementation called Snap could offer about 10 microseconds of round trip latency. The latter figure comes from page 9 of https://storage.googleapis.com/gweb-research2023-media/pubto...

nice datacenters may even have multiple 100 Gbit/sec NICs per machine in 2024,

Google exceeded 100Gbps per machine long before 2024. IIRC it had been 400Gbps for a while.

Interesting. I worked at Google until January 2021. I see 2019 dates on that PDF, but I wasn't aware of snap when I left. There was some alternate RPC approach (Pony Express, maybe? I get the names mixed up) that claimed 10 µs or so but was advertised as experimental (iirc had some bad failure modes at the time in practice) and was simply unavailable in many of the datacenters I needed to deploy in. Maybe they're two names for the same thing. [edit: oh, yes, starting to actually read the paper now, and: "Through Snap, we created a new communication stack called Pony Express that implements a custom reliable transport and communications API."]

Actual latency with standard Stubby-over-TCP and warmed channels...it's been a while, so I don't remember the number I observed, but I remember it wasn't that much better than 0.5 ms. It was still bad enough that I didn't want to add a tier that would have helped with isolation in a particularly high-reliability system.

Snap was the external name for the internal project known as User Space Packet Service (abbreviated USPS) so naturally they renamed it prior to publication. I deployed an app using Pony Express in 2023 and it was available in the majority of cells worldwide. Pony Express supported more than just RPC though. The alternate RPC approach that you spoke of was called Void. It had been experimental for a long time and indeed it wasn't well known even inside Google.

but I remember it wasn't that much better than 0.5 ms.

If you and I still worked at Google I'd just give you an automon dashboard link showing latency an order of magnitude better than that to prove myself…

Interesting, thanks!

If you and I still worked at Google I'd just give you an automon dashboard link showing latency an order of magnitude better than that to prove myself…

I believe you, and I think in principle we should all be getting the 50 µs latency you're describing within a datacenter with no special effort.

...but it doesn't match what I observed, and I'm not sure why. Maybe difference of a couple years. Maybe I was checking somewhere with older equipment, or some important config difference in our tests. And obviously my memory's a bit fuzzy by now but I know I didn't like the result I got.

50 microseconds is a lot. I'm looking at disk read latency on a bunch of baremetal servers (nothing fancy, just node_disk_read.* metrics from node-exporter) and one of the slowest fleets has a median disk read latency barely above 1 microsecond. (And that's with rather slow HDDs.)

Samsung 980 Pro SSDs, the early generation ones (they seem to later have replaced them with a different, likely worse architecture), have average latency 4k reads of 30~70~120 microseconds for single queue/70% of the IOPS the 120us gets you/maximum parallelism before latency goes through the roof.

The metrics you mention have to be pagecache hits. Basically all MLC NAND is in the double digit microseconds for uncontended random reads.

They likely are cache hits, indeed (any suggestion what other metrics would be more comparable?). Still, at the end of the day I don't care whether a disk operation was made fast by kernel caching or by some other optimization at a lower level, I only care about the final result. With public cloud virtualization there are more layers where something may go wrong, and good luck finding answers from Amazon or Microsoft if your performance turns out to be abysmal.

with such speed and CXL gaining traction (think ram and GPUs over network) why network SSD is still issue? you could have like one storage server per rack that would serve storage only for that particular rack

you could easily have like 40GB/s with some over provisioning / bucketing

Networks are not reliable, despite what you hear, so latency is used to mask re-tries and delays.

The other thing to note about big inter-DC links are heavily QoS'd and contented, because they are both expensive and a bollock to maintain.

Also, from what I recall, 40gig links are just parallel 10 gig links, so have no lower latency. I'm not sure if 100/400 gigs are ten/fourty lines of ten gigs in parallel or actually able to issue packets at 10/40 times a ten gig link. I've been away from networking too long

Networks are not reliable, despite what you hear, so latency is used to mask re-tries and delays.

Of course, but even the 50%ile case is strangely slow, and if that involves retries something is deeply wrong.

You're right, but TCP doesn't like packets being dropped halfway through a stream. If you have a highly QoS'd link then you'll see latency spikes.

Again, I'm not talking about spikes (though better tail latency is always desirable) but poor latency in the 50%ile case. And for high-QoS applications, not batch stuff. The snap paper linked elsewhere in the thread shows 10 µs latencies; they've put in some optimization to achieve that, but I don't really understand why we don't expect close to that with standard kernel networking and TCP.

The snap paper linked elsewhere in the thread shows 10 µs latencies; they've put in some optimization to achieve that, but I don't really understand why we don't expect close to that with standard kernel networking and TCP.

You can get similar results by looking at comparisons between DPDK and kernel networking. Most of the usual gap comes from not needing to context-switch for kernel interrupt handling, zero-copy abstractions, and busy polling (wherein you trade CPU for lower latency instead of sleeping between iterations if there's no work to be done).

https://talawah.io/blog/linux-kernel-vs-dpdk-http-performanc... goes into some amount of detail comparing request throughput of an unoptimized kernel networking stack, optimized kernel networking stack, and DPDK. I'm not aware of any benchmarks (public or private) comparing Snap vs DPDK vs Linux, so that's probably as close as you'll get.

Thanks for the link. How does this compare to the analogous situation for SSD access? I know there are also userspace IO stacks for similar reasons, but it seems like SSD-via-kernel is way ahead of network-via-kernel in the sense that it adds less latency per operation over the best userspace stack.

Great reading, thanks for the link on vanilla vs dpdk

40gig links are just parallel 10 gig links, so have no lower latency

That's not correct. Higher link speeds do have lower serialization latency, although that's a small fraction of overall network latency.

Modern data center networks don't have full cross connectivity. Instead they are built using graphs and hierarchies that provide less than the total bandwidth required for all pairs of hosts to be communicating. This means, as workloads start to grow and large numbers of compute hosts demand data IO to/from storage hosts, the network eventually gets congested, which typically exhibits as higher latencies and more dropped packets. Batch jobs are often relegated to "spare" bandwidth while serving jobs often get dedicated bandwidth

At the same time, ethernetworks with layered network protocols on top typically have a fair amount of latency overhead, that makes it much slower than bus-based direct-host-attached storage. I was definitely impressed at how quickly SSDs reached and then exceeded SATA bandwidth. nvme has made a HUGE difference here.

How much faster would the network need to get, in order to meet (or at least approach) the speed of a local SSD? are we talking about needing to 2x or 3x the speed, or by factors of hundreds or thousands?

The problem isn't necessarily speed, it's random access latency. What makes SSDs fast and "magical" is their low random-access latency compared to a spinning disk. The sequential-access read speed is merely a bonus.

Networked storage negates that significantly, absolutely killing performance for certain applications. You could have a 100Gbps network and it still won't match a direct-attached SSD in terms of latency (it can only match it in terms of sequential access throughput).

For many applications such as databases, random access is crucial, thus why nowadays' mid-range consumer hardware often outperforms hosted databases such as RDS unless they're so overprovisioned on RAM that the dataset is effectively always in there.

Um... why the hell does the network care whether I am doing random or sequential access? Your left that part out of your argument.

Ah sorry, my bad. You are correct that you can fire off many random access operations in parallel and get good throughput that way.

The problem is that this is not possible when the next IO request depends on the result of a previous one, like in a database where you must first read the index to know the location of the row data itself.

OK thanks yes that makes sense. Pipelining problems are real.

(The network indeed doesn't care, but bandwidth of dependent rather than independent accesses depends on latency)

100Gbps direct shouldn't be too bad, but it might be difficult to get anyone to sell it to you for exclusive usage in a vm...

The Samsung 990 in my desktop provides ~3.5 GB/s streaming reads, ~2 GB/s 4k random-access reads, all at a latency measured at around 20-30 microseconds. My exact numbers might be a little off, but that's the ballpark you're looking at, and a 990 is a relatively cheap device.

10GbE is about the best you can hope for from a local network these days, but that's 1/5th the bandwidth and many times the latency. 100GbE would work, except the latency would still mean any read dependencies would be far slower than local storage, and I'm not sure there's much to be done about that; at these speeds the physical distance matters.

In practice I'm having to architecture the entire system around the SSD just to not bottleneck it. So far ext4 is the only filesystem that even gets close to the SSD's limits, which is a bit of a pity.

Networking doesn't have to have high latency. You can buy network hardware that is able to provide sub-microsecond latency. Physical distance still matters, but 10% of typical NVMe latency gets you through a kilometer of fiber.

Around 4x-10x depending on how many SSDs you want. A single SSD is around the speed of a 100 Gbps Ethernet link.

SATA3 is 6 Gbit, so each VM on a machine multiplied by 6 Gbit. For NVMe, probably closer to 4-5x that. You’d need some serious interconnects to get a server rack access to un-bottlenecked SSD storage.

Dumb question. Why does the network have to be slow? If the SSDs are two feet away from the motherboard and there's an optical connection to it, shouldn't it be fast? Are data centers putting SSDs super far away from motherboards?

One theory is that EC2 intentionally caps the write speed at 1 GB/s to avoid frequent device failure, given the total number of writes per SSD is limited.

This is the theory that I would bet on because it lines up with their bottom line.

But the sentence right after undermines it.

However, this does not explain why the read bandwidth is stuck at 2 GB/s.

Faster read speeds would give them a more enticing product without wearing drives out.

They may be limiting the read artificially to increase your resource utilization else where. If you have disk bottleneck then you would be more likely to use more instances. It is still about the bottom line.

That could be. But it's a completely different reason. If you summarize everything as "bottom line", you lose all the valuable information.

It’s not the network being slow, but dividing the available network bandwidth amongst all users, while also distributing the written data to multiple nodes reliably so that one tenant doesn’t hog resources is quite challenging. The pricing structure is meant to control resource usage; a discussion of the exact prices and how much profit AWS or any other cloud provider makes is a separate discussion.

What happens when your vm is live migrated 1000 feet away or to a different zone?

This is untrue of Local SSD (https://cloud.google.com/local-ssd) in Google Cloud. Local SSDs are PCIe peripherals, not network attached.

There are also multiple Persistent Disk (https://cloud.google.com/persistent-disk) offerings that are backed by SSDs over the network.

(I'm an engineer on GCE. I work directly on the physical hardware that backs our virtualization platform.)

It's notable that your second link has a screenshot for 24(!) NVMe SSDs totalling 9 terabytes, but the aggregate performance is 2.4M IOPS and 9.3 GB/s for reads. In other words, just 100K/400MB per individual SSD, which is very low these days.

For comparison, a single 1 TB consumer SSD can deliver comparable numbers (lower IOPS but higher throughput).

If I plugged 24 consumer SSDs into a box, I would expect over 30M IOPS and near the memory bus limit for throughput (>50 GB/s).

There's a quadrant of the market which is poorly served by the Cloud model of elastic compute: persistent local SSDs across shutdown and restart.

Elastic compute means you want to be able to treat compute hardware as fungible. Persistent local storage makes that a lot harder because the Cloud provider wants to hand out that compute to someone else after shutdown, so the local storage needs to be wiped.

So you either get ephemeral local SSDs (and have to handle rebuild on restart yourself) or network-attached SSDs with much higher reliability and persistence, but a fraction of the performance.

Active instances can be migrated, of course, with sufficient cleverness in the I/O stack.

GCE fares a little better than this:

- VMs with SSDs can (in general -- there are exceptions for things like GPUs and exceptionally large instances) live migrate with contents preserved.

- GCE supports a timeboxed "restart in place" feature where the VM stays in limbo ("REPAIRING") for some amount of time waiting for the host to return to service: https://cloud.google.com/compute/docs/instances/host-mainten.... This mostly only applies to transient failures like power-loss beyond battery/generator sustaining thresholds, software crashes, etc.

- There is a related feature, also controlled by the `--discard-local-ssd=` flag, which allows preservation of local SSD data on a customer initiated VM stop.

I should've aimed for more clarity in my original comment -- the first link is to locally attached storage. The second is network attached storage (what the GP was likely referring to, but not what is described in the article).

Persistent Disk is not backed by single devices (even for a single NVMe attachment), but by multiple redundant copies spread across power and network failure domains. Those volumes will survive the failure of the VM to which they are attached as well as the failure of any individual volume or host.

Makes me wonder if we're on the crux of a shift back to client-based software. Historically changes in the relative cost of computing components have driven most of the shifts in the computing industry. Cheap teletypes & peripherals fueled the shift from batch-processing mainframes to timesharing minicomputers. Cheap CPUs & RAM fueled the shift from minicomputers to microcomputers. Cheap and fast networking fueled the shift from desktop software to the cloud. Will cheap SSDs & TPU/GPUs fuel a shift back toward thicker clients?

There are a bunch of supporting social trends toward this as well. Renewed emphasis on privacy. Big Tech canceling beloved products, bricking devices, and generally enshittifying everything - a lot of people want locally-controlled software that isn't going to get worse at the next update. Ever-rising prices which make people want to lock in a price for the device and not deal with increasing rents for computing power.

I think a major limiting factor here for many applications is that mobile users are a huge portion of the user base. In that space storage, and more importantly battery life, are still at a premium. Granted the storage cost just seems to be gouging from my layman’s point of view, so industry needs might force a shift upwards.

Mobile devices are the desktop computers of the 2010s though. They are mostly used with very thick clients.

SSDs in the cloud are attached over a network, and fundamentally have to be

Not on AWS. Instance stores (what the article is about) are physical local disks.

Same for Google (the GP is incorrect about how GCE Local SSD works).

The problem is that this network is so large and slow that it can't give you anywhere near the performance of a local SSD.

*As implemented in the public cloud providers.

You can absolutely get better than local disk speeds from SAN devices and we've been doing it for decades. To do it on-prem with flash devices will require NVMe over FC or Ethernet and an appropriate storage array. Modern all-flash array performance is measured in millions of IOPS.

Will there be a slight uptick in latency? Sure, but it's well worth it for the data services and capacity of an external array for nearly every workload.

A quick comparison of marketing slides lands you with 0.5 MOPS for FC-NVMe for Qlogic 2770 and 0.7 MOPS for PCIe Micron 9100 PRO, so the better speeds are not quite there, although spending quite a lot on server gear lands you near the performance of a workstation-grade drive from 2017.

Which is still not bad, when I was shopping around in 2018 no money could buy performance comparable to a locally-attached NVMe in a more professional/datacenter-ready form.

[1] https://media-www.micron.com/-/media/client/global/documents...

[2] https://www.marvell.com/content/dam/marvell/en/public-collat...

If the local drives are network drives(eg: SAN) then why are they ephemeral?

live vm migrations, perhaps

SSDs in the cloud are attached over a network, and fundamentally have to be.

No they don't. I work for a cloud provider and I can guarantee that your SSD is local to your VM.

This is also true for GCE Local SSD: https://cloud.google.com/local-ssd

The GP is incorrect.

No, the i3/i4 VMs discussed in the blog have local SSDs. The network isn't the reason local SSDs are slow.

In other words, they're saving money. This is a fundamental problem with cloud providers. Value created from technological innovation is captured by the cloud provider and bare minimum is shared to reduce prices. The margins are ridiculous.

It does not fundamentally have to be. That's an architectural choice driven by cloud providers backing off from the "instances can die on you" choice that AWS started with and then realized customers struggled with, towards attempting to keep things running in the face of many types of hardware failures.

Which is ironic given that when building on-prem/colo'ed setups you'll replicate things to be prepared for unknown lengths of downtime while equipment is repaired or replaced, so this was largely cloud marketing coming bak to bite cloud providers' collective asses. Not wanting instances to die "randomly" for no good reason does not always mean wanting performance sacrifices for the sake of more resilient instances.

But AWS at least still offers plenty of instances with instance storage.

If I'm setting up my own database cluster, while I don't want it running on cheap consumer-grade hardware without dual power supplies and RAID, I also don't want to sacrifice SSD speed for something network-attached to survive a catastrophic failure when I'm going to have both backups, archived shipped logs and at least one replica anyway.

I'd happily pick network-attached storage for many things if it gets me increased resilience, but selling me a network-attached SSD, unless it replicates local SSD performance characteristics, is not competitive for applications where performance matters and I'm set up to easily handle system-wide failures anyway.

SSDs in the cloud are attached over a network, and fundamentally have to be

SANs can still be quite fast, and instance storage is fast, both of which are available in cloud providers

Forgive if stupid, but when netflix was doing all their edge content boxes, where they were putting machines much more latency-close to customers... is/does/can this model kinda work for SSDs in a SScDN type of network (client---->CDN->SScDN-------<>SSD?

and fundamentally have to be

Can you expound?

Once again getting a sensible chuckle on hn listening to the cloud crowd whine and complain about what we used to call "the SAN being slow" with the architects argument "the ssds will make things faster"

So do cloud vendors simply not use fast SSDs? If so I would expect the SSD manufacturers themselves to work on this problem. Perhaps they already are.

This is an interesting benchmark that compares disk read latency across different clouds across various disk configurations, including SSD's and EBS: https://github.com/scylladb/diskplorer

Google's disk perform quite poorly.

And how Discord worked around it: https://discord.com/blog/how-discord-supercharges-network-di...

I can see network attached SSDs having poor latency, but shouldn’t the networking numbers quoted in the article allow for higher throughput than observed?

Yeah this was my impression.

I am but an end user, but I noticed that disk IO for a certain app was glacial compared to a local test deployment, and I chalked it up to networking/VM overhead

I'm not sure which external or internal product you're talking about, but there are no networks involved for Local SSD on GCE: https://cloud.google.com/compute/docs/disks/local-ssd

Are you referring to PD-SSD? Internal storage usage?

Kind of funny but we use similar idea in Azure.

They don’t have to be. Architecturally there are many benefits to storage area networks but I have built plenty of systems which are self-contained, download a dataset to a cloud instance with a direct attached SSD, load it into a database and provide a different way.

Even assuming that "local" storage is a lie, hasn't the network gotten a lot faster? The author is only asking for a 5x increase at the end of the post.

Core count plus modern nvme actually make a great case for moving away from the cloud- before it was, "your data probably fits into memory". These are so fast that they're close enough to memory so it's "your data surely fits on disk". This reduces the complexity of a lot of workloads so you can just buy a beefy server and do pretty insane caching/calculation/serving with just a single box or two for redundancy.

I keep hearing that, but that's simply not true. SSDs are fast, but they're several orders of magnitude slower than RAM, which is orders of magnitude slower than CPU Cache.

Samsung 990 Pro 2TB has a latency of 40 μs

DDR4-2133 with a CAS 15 has a latency of 14 nano seconds.

DDR4 latency is 0.035% of one of the fastest SSDs, or to put it another way, DDR4 is 2,857x faster than an SSD.

L1 cache is typically accessible in 4 clock cycles, in 4.8 ghz cpu like the i7-10700, L1 cache latency is sub 1ns.

I wonder how many people have built failed businesses that never had enough customer data to exceed the DDR4 in the average developer laptop, and never had so many simultaneous queries it couldn't be handled by a single core running SQLite, but built the software architecture on a distributed cloud system just in case it eventually scaled to hundreds of terabytes and billions of simultaneous queries.

I totally hear you about that. I work for FAANG, and I'm working on a service that has to be capable of sending 1.6m text messages in less than 10 minutes.

The amount of complexity the architecture has because of those constraints is insane.

When I worked at my previous job, management kept asking for that scale of designs for less than 1/1000 of the throughput and I was constantly pushing back. There's real costs to building for more scale than you need. It's not as simple as just tweaking a few things.

To me there's a couple of big breakpoints in scale:

* When you can run on a single server

* When you need to run on a single server, but with HA redundancies

* When you have to scale beyond a single server

* When you have to adapt your scale to deal with the limits of a distributed system, i.e. designing for DyanmoDB's partition limits.

Each step in that chain add irrevocable complexity, adds to OE, adds to cost to run and cost to build. Be sure you have to take those steps before you decide too.

Maybe I'm misunderstanding something, but that's about 2700 a second. Or about 3Mbps.

Even a very unoptimized application running on a dev laptop can serve 1Gbps nowadays without issues.

So what are the constraints that demand a complex architecture?

I'm not the OP but a few things:

* Reading/fetching the data - usernames, phone number, message, etc.

* Generating the content for each message - it might be custom per person

* This is using a 3rd party API that might take anywhere from 100ms to 2s to respond, and you need to leave a connection open.

* Retries on errors, rescheduling, backoffs

* At least once or at most once sends? Each has tradeoffs

* Stopping/starting that many messages at any time

* Rate limits on some services you might be using alongside your service (network gateway, database, etc)

* Recordkeeping - did the message send? When?

Oh, I absolutely agree that the complexity is in these topics. I'm just sceptic that they're enough to turn a task that could run on a laptop into one that requires an entire cluster of machines.

The third party API is the part that has the potential to turn this straightforward task into a byzantine mess, though, so I suspect that's the missing piece of information.

I'm comparing this to my own experience with IRC, where handling the same or larger streams of messages is common. And that's with receiving this in real time, storing the messages, matching and potentially reacting to them, and doing all that while running on a raspberry pi.

I literally spent the last week speccing out a system just like this and you are completely correct. You’ve touched on almost every single thing we ran into.

I'm trying to guess what "OE" stands for... over engineering? operating expenditure? I'd love to know what you meant :)

Sorry, thought it was a common term. Operational Excellence. All the effort and time it takes to keep a service online, on call included

probably operating expenses

That really doesn't require that much complexity.

I used to send something like 250k a minute complete with delivery report processing from a single machine running a bunch of other services like 10 years ago.

Nice.

But average latency is not the whole picture; tail latency is. For good tail latency and handling of spikes, you have to have a sizable "untapped" reserve of performance.

You're not considered serious if you don't. Kinda stupid.

In the startup world, this is correct.

The success that VCs are after is when your customer base doubles every month. Better yet, every week. Having a reasonably scalable infra at the start ensures that a success won't kill you.

Of course, the chances of a runaway success like this are slim, so 99% or more startups overbuild, given their resulting customer base. But it's like 99% or more pilots who put on a parachute don't end up using it; the whole point is the small minority who do, and you never know.

For a stable, predictable, medium-scale business it may make total sense to have a few dedicated physical boxes and run their whole operation from them comfortably, for a fraction of cloud costs. But starting with it is more expensive than starting with a cloud, because you immediately need an SRE, or two.

You aren't going to get there. The risks and complexity of a startup are high to begin with. Adding artificial roadblocks because of aspirational fantasies is going to hold you back.

Look at the big successes such as youtube, twitter, facebook, airbnb, lyft, google, yahoo - exactly zero of them did this preventatively. Even altavista and babelfish, done by DEC and running on Alphas, which they had plenty of, had to be redone multiple times due to growth. Heck, look at the first 5 years of Amazon. AWS was initially ideated in a contract job for Target.

Address the immediate and real needs and business cases, not pie in the sky aspirations of global dominance - wait until it becomes a need and then do it.

The chances of getting there are only reasonable if you move instead of plan, otherwise you'll miss the window and product opportunity.

I know it ruffles your engineering feathers - that's one of the reasons most attempts at building these things fails. The best ways feel wrong, are counterintuitive and are incidentally often executed by young college kids who don't know any better. It's why successful tech founders tend to be inexperienced; it can actually be advantageous if they make the right "mistakes".

Forget about any supposedly inevitable disaster until it's actually affecting your numbers. I know it's hard but the most controllable difference between success and failure in the startup space is in the behavioral patterns of the stakeholders.

Do you remember the companies that did not scale? friendster did well until it failed to scale, and Facebook took over.

So the converse argument might be: don't bungle it up because you failed to plan. Provision for at least 10x growth with every (re-)implementation.

https://highscalability.com/friendster-lost-lead-because-of-...

Hold on... You think Facebook took over from Friendster because of scaling problems?!

MySpace was the one that took the lead over Friendster and it withered after it got acquired for $500 million by news corp because that was the liquidity event. That's when Facebook gained ground. Your timeline is wrong.

The MySpace switch was because of themes and other features the users found more appealing. Twitter had similar crashes with its fail whale for a long time and they survived it fine. The teen exodus of Friendster wasn't because of TTLB waterfall graphs.

Also MySpace did everything on cheap Microsoft IIS 6 servers in ASP 2.0 after switching from Coldfusion in Macromedia HomeSite, they weren't genuises. It was a knockoff created by amateurs with a couple new twists. (A modern clone has 2.5 mil users: see https://spacehey.com/browse still mostly teenagers)

Besides, when the final Friendster holdout of the Asian market had exponential decline in 2008, the scaling problems of 5 years ago had long been fixed. Faster load times did not make up for a product consumers no longer found compelling.

Also Facebook initially was running literally out of Mark's dorm room. In 2007, after they had won the war, their code got leaked because they were deploying the .svn directory in their deploy strategy. Their code was widely mocked. So there we are again.

I don't care if you can find someone who agrees with you on the Friendster scaling thing, almost every collapsed startup has someone that says "we were just too successful and couldn't keep up" because thinking you were just too awesome is the gentler on the ego than realizing a bunch of scrappy hackers just gave people more of what they wanted and either you didn't realize it or you thought your lack of adaption was a virtue.

How sure are you that they switched because of themes? Did you see user research? I left because of its poor performance, and MySpace was no substitute for friendster; it targeted an artsy demographic. But Facebook was.

Yes. I worked in social networks 15 years ago. It was a heavy research topic for me.

You're a highly technical user. Non-technical people are weird - part of the MySpace exodus was the belief that it spread "computer viruses", really

There was more to the switches but I'd have to dredge it up probably through archive sites these days. The reasons the surveys supported I considered ridiculous but it doesn't matter it's better to understand consumer behavior - we can't easily change it.

Especially these days. It was not possible for me to be a teenager with high speed wi-fi when I was one 30 years ago. I've got near zero understanding of the modern consumer youth market or what they think. Against all my expectations I've become an old person.

Anyways, the freeform HTML was a major driver - it was geocities with less effort, which had also exited through a liquidity event and currently has a clone these days https://neocities.org/browse

In may day job I often see systems that have the opposite. Especially for database queries, developers tested on local machine with 100s of records and everything was quick and snappy and on production with mere millions of records I often see queries taking minutes up to a hour just because some developer didn't see need for creating indexes or created query in a way there is no way to even create any index that would work

This is a different topic and not always a skills issue. The stupid match for "productivity" and "velocity" means you have to cut corners.

Also Sometimes, it is poor communication. Just yesterday I saw some code that requests auth token before every request even though each bearer token comes with expires in (about twelve hours).

That’s true, but has little to do with distributed cloud architecture vs. single local instance.

Many. I regularly see systems built for "big data", built for scale using "serverless" and some proprietary cloud database (like DynamoDB), storing a few hundred megabytes total. 20 years ago we would've built this on PHP and MySQL and called it a day.

A LOT... especially here.

pretty cool comparisons. quite some differences there.

tangent, I remember reading some post called something like "Latency numbers every programmer should know" and being slightly ashamed when I could not internalize it.

You might enjoy Grace Hopper's leacture which includes this snippet: https://youtu.be/ZR0ujwlvbkQ?si=vjEQHIGmffjqfHBN&t=2706

Oh don't feel bad. I had to look up every one of those numbers

probably this one: https://gist.github.com/hellerbarde/2843375

RAM is not as fast in practice as the specs claim, because there is a lot of overhead in accessing it. I did some latency benchmarking on my M2 Max MBP when I got it last year. As long as the working set fits in L1 cache, read latency is ~2 ns. Then it starts increasing slowly, reaching ~10 ns at 10 MiB. Then there is a rapid rise to ~100 ns at 100 MiB, followed by slow growth until ~10 GiB. Then the latency starts increasing rapidly again, reaching ~330 ns at 64 GiB.

To be pedantic because its fun:

The random access Latency for L1 is indeed ~4 cycles, but RAM is more like 70ns+ (so you are .5 order of magnitude off, how dare you?).

These are my notes from an Anandtech article:

  RAM:      
            5950x:         79 ns (400 cycles)
            3950x:         86 ns (400 cycles)
            10900k:        71 ns
            Apple M1:      97 ns
            Apple M1 Max: 111 ns
            Apple A15P:   105 ns
            Apple A15E:   141 ns
            Apple A14P:   101 ns
            Apple A14E:   201 ns
            888 X1:       114 ns
  L1 Cache: 
            5950x:        4 cycles (0.8 ns @ 5 GHz)
            Apple M1:     3 cycles
            Apple M1 Max: 3 cycles

You're missing the purpose of the cache. At least for this argument it's mostly for network responses.

HDD was 10ms, which was noticeable for cached network request that needs to go back out on the wire. This was also bottle necked by IOPS, after 100-150 IOPS you were done. You could do a bit better with raid, but not the 2-3 orders of magnitude you really needed to be an effective cache. So it just couldn't work as a serious cache, the next step up was RAM. This is the operational environment which redis and such memory caches evolved.

40 us latency is fine for caching. Even the high load 500-600us latency is fine for the network request cache purpose. You can buy individual drives with > 1 million read IOPS. Plenty for a good cache. HDD couldn't fit the bill for the above reasons. RAM is faster, no question, but the lower latency of the RAM over the SSD isn't really helping performance here as the network latency is dominating.

Rails conference 2023 has a talk that mentions this. They moved from a memory based cache system to an SSD based cache system. The Redis RAM based system latency was 0.8ms and the SSD based system was 1.2ms for some known system. Which is fine. It saves you a couple of orders of magnitude on cost and you can do much much larger and more aggressive caching with the extra space.

Often times these RAM caching servers are a network hop away anyway, or at least a loopback TCP request. Making the question of comparing SSD latency to RAM totally irrelevant.

"I will simply have another box for redundancy" is already a system so complex that having it in or out of the cloud won't make a difference.

It really depends on business requirements. Real-time redundancy is hard. Taking backups at 15-min intervals and having the standby box merely pull down the last backup when starting up is much easier, and this may actually be fine for a lot of applications.

Unfortunately very few actually think about failure modes, set realistic targets, and actually test the process. Everyone thinks they need 100% uptime and consistency, few actually achieve it in practice (many think they do, but when shit hits the fan it uncovers an edge-case they haven't thought of), but it turns out that in most cases it doesn't matter and they could've saved themselves a lot of trouble and complexity.

So much this.

I'd github can afford the amount of downtime they do, it's likely that your business can afford 15 minutes of downtime every once in a while due to a failing server.

Also, the less servers you have overall, the least common a failure will be.

Backups and cold failover server are mandatory, but anything past that should be weighted on a rational cost/benefit analysis, and for most people the cost/benefit ratio just isn't enough to justify infrastructure complexity.

The reasons to switch away from cloud keep piling up.

We're doing some amount of on-prem, and I'm eager to do more.

I've previously worked for a place that ran most of their production network "on-prem". They had a few thousand physical machines spread across 6 or so colocation sites on three continents. I enjoyed that job immensely; I'd jump at the chance to build something like it from the ground up. I'm not sure if that actually makes sense for very many businesses though.

I'm getting that opportunity, however I expect it will be the last as most have migrated to the cloud and smaller companies are appealing to me; smallest company (of 5) I've worked for had 4.4k employees and large companies have the resources to roll their own.

Unless there is an onprem movement I expect cloud to be the future as maintaining the tech stack onprem is difficult and we need to nake decisions down to the hardware we order.

What’s a good small cloud competitor to AWS? For teams that just need two AZs to get HA and your standard stuff like VMs, k8s, etc.

Buy a server

I don't like this answer.

When I look at cloud, I get to think "finally! No more hardware to manage. No OS to manage". It's the best thing about the cloud, provided your workload is amenable to PaaS. It's great because I don't have to manage Windows or IIS. Microsoft does that part for me and significantly cheaper than it would be to employ me to do that work.

And now you have to manage the cloud instead. Which turns out to be more hassle and with no overlap with the actual problem you are trying to solve.

So not only do you spend time on the wrong thing you don't even know how it works. And the providers goals are not aligned either as all they care about is locking you in.

How is that better?

That may be your experience, but not mine.

Yes, the cloud is _different_ to manage and has some of the same fundamentals to overcome such as security and networking, but lacks some of the very large pain points of managing an OS, like updates, ancillary local services, local accounts, and so on.

I'm not sure why you would state that it doesn't solve the problem I'm invested in -- namely operating websites. It is the perfect cloud workload.

No more hardware to manage. No OS to manage

We must be using different clouds.

For some of the much higher-level services … maybe some semblance of that statement holds. But for VMs? Definitely not "no OS to manage" … the OS is usually on the customer. There might be OS-level agents from your cloud of choice that make certain operations easier … but I'm still on the hook for updates.

Even "No machine" is a stretch, though I've found this is much more dependent on cloud. AWS typically notices failures before I do, and by the time I notice something is up, the VM has been migrated to a new host and I'm none the wiser sans the reboot that cost. But other clouds I've been less lucky with: we've caught host failures well before the cloud provider, to an extent where I've wished there was a "vote of no confidence" API call I could make to say "give me new HW, and I personally think this HW is suss".

Even on higher level services like RDS, or S3, I've noticed failures prior to AWS … or even to the extent that I don't know that AWS would have noticed those failures unless we had opened the support ticket. (E.g., in the S3 case, even though we clearly reported the problem, and the problem was occurring on basically every request, we still had to provide example request IDs before they'd believe us. The service was basically in an outage as far as we could tell … though I think AWS ended up claiming it was "just us".)

That said, S3 in particular is still an excellent service, and I'd happily use it again. But cloud == 0 time on my part. It depends heavily on the cloud, and less heavily on the service how much time, and sometimes, it is still worthwhile.

I'm baffled after you quoted 'no OS to manage' why you'd start discussing virtual machines.

A cloud is just someone else's computer.

When you rent a bare metal server, you don't manage your hardware either. The failed parts are replaced for you. Unless you can't figure out what hardware configuration you need - which would be a really big red flag for your level of expertise.

A cloud is network and tools, less hardware. For example see this [1] Reddit thread discussing network in Hetzner. Any other bare-metal would have same challenges. Once you solve network security you have to deal with server access. People hired and fired, hardcoded SSH keys is a bad idea. Once you solve access you likely have AD, LDAP and SSO of some sort. Then backups, and automated test suite + periodical test recoveries. Then database and backups. Then secrets, does all members of your team know production db password? And so on and on.

Maybe TCO still favors bare-metal but you have to spend a lot of time on configuration.

[1] https://www.reddit.com/r/hetzner/comments/rjuzcs/securing_ne...

There is tooling to provide the same PaaS interface though, so your cost doing those things amounts to running OS updates.

Hetzner and Entrywan are pure-play cloud companies with good prices and support. Hetzner is based in Germany and Entrywan in the US.

Thanks for mentioning Entrywan, they look great from what I can tell on their site. Have you used their services? If so, I'm curious about your experiences with them.

My irc bouncer and two kubernetes clusters are running there. So far the service has been good.

Hetzner has a reputation for locking accounts for "identity verification" (Google "hetzner kyc" or "hetzner identity verification"). Might be worthwhile to go through a reseller just to avoid downtime like that.

Digital Ocean. But they don’t have concept of multi az as far as I know. But they have multiple data centers in same region. But I am not aware if there is any private networking between DCs in the same region.

Really needs more information for a good answer. How many req/s are you handling? Do you have dedicated devops? Write heavy, read heavy or pipeline/etl heavy? Is autoscaling a must or would you rather failure over a bank breaking cloud bill?

I like using scaleway for personal projects, but they are available only in Europe

DigitalOcean has been great, have managed k8s and generally good bang for your buck

AWS is pretty great and I think reasonably cheap, I would include any of the other large cloud players. The amount saved is just not reasonable enough for me, I would rather work with something that I have used over and over with.

Along with that, ChatGPT has knocked down most of the remaining barriers I have had when permissions get confusing in one of the cloud services.

Hetzner, OVH, Scaleway

I am _super_ impressed with Supabase. I can see how a lot of people may not like it given how reliant it is on Postgres, but I find it to be absolutely genius.

It basically allows me to forego having to make a server for the CRUD operations so I can focus on the actual business implications. My REST API is automatically managed for me (mostly with lightweight views and functions) and all of my other core logic is either spread out through edge functions or in a separate data store (Redis) where I perform the more CPU intensive operations related to my business.

There's some rough edges around their documentation and DX but I'm really loving it so far.

Seeing the really just puny "provisioned IOPS" numbers on hugely expensive cloud instances made me chuckle (first in disbelief, then in horror) when I joined a "cloud-first" enterprise shop in 2020 (having come from a company that hosted their own hardware at a colo).

It's no wonder that many people nowadays, esp. those who are so young that they've never experienced anything but cloud instances, seem to have little idea of how much performance you can actually pack in just one or two RUs today. Ultra-fast (I'm not parroting some marketing speak here - I just take a look at IOPS numbers, and compare them to those from highest-end storage some 10-12 years ago) NVMe storage is a big part of that astonishing magic.

NVMe has been ridiculously great. I'm excited to see what happens to prices as E1 form factor ramps up! Much physically bigger drives allows for consolidation of parts, a higher ratio of flash chips to everything else, which seems promising. It's more a value line, but Intel's P5315 is 15TB at a quite low $0.9/GB.

It might not help much with oops though. Amazing that we have PCIe 5.0 16GB/s and already are so near theoretical max (some lost to overhead), even on consumer cards.

Going enterprise for the drive-writes-per-day (DWPD) is 100% worth it for most folks, but I am morbidly curious how different the performance profile would be running enterprise vs non these days. But reciprocally the high DWPD drives (Kioxia CD8P-V for example is DWPD of 3) seems to often come with somewhat more mild sustained 4k write oops, making me think maybe there's a speed vs reliability tradeoff that could be taken advantage of from consumer drives in some cases; not sure who wants tons of iops but doesn't actually intend to hit their Total Drive Writes, but it save you some iops/$ if so. That said, I'm shocked to see the enterprise premium is a lot less absurd than it used to be! (If you can find stock.)

The main problem with consumer drives is the missing power loss protection (plp). M.2 drives just don't have space for the caps like an enterprise 2.5 u.2/u.3 drive will have.

This matters when the DB calls a sync and it's expecting the data to be written safely to disk before it returns.

A consumer drive basically stops everything until it can report success and your IOPS falls to like 1/100th of what the drive is capable of if it's happening alot.

An enterprise drive with plp will just report success knowing it has the power to finish the pending writes. Full speed ahead.

You can "lie" to the process at the VPS level by enabling unsafe write back cache. You can do it at the OS level by launching the DB with "eatmydata". You will get the full performance of your SSD.

In the event of power loss you may well end up in an unrecoverable corrupted condition with these enabled.

I believe that if you buy all consumer parts - an enterprise drive is the best place to up spend your money profitably on an enterprise bit.

My experience lately is that consumer drives will also lie and use a cache, but then drop your data on the floor if the power is lost or there’s a kernel panic / BSOD. (Samsung and others.)

Rumors of that. I've never actually seen it myself.

I can get it to happen easily. 970 Evo Plus. Write a text file and kill the power within 20 seconds or so, assuming not much other write activity. File will be zeroes or garbage, or not present on the filesystem, after reboot.

This happens for you after you invoked an explicit sync() (et al.) before the power cut?

Yep.

That is highly interesting and contrary to a number of reports I've read about the Samsung 970 EVO Plus Series (and experienced for myself) specifically! Can you share more details about your particular setup and methodology? (Specific model name/capacity, Firmware release, Kernel version, filesystem, mkfs and mount options, any relevant block layer funny business you are conciously setting would be of greatest interest.) Do you have more than one drive where this can happen?

Yeah, it happens on two of the 970 EVO Plus models. One on the older revision, and one on the newer. (I think there are only two?) It happens on both Linux and Windows. Uhh, I'm not sure about the kernel versions. I don't remember what I had booted at the time. On Windows I've seen it happen as far back as 1607 and as recently as 21H2. I've also seen it happen on someone else's computer (laptop.)

It's really easy to reproduce (at least for me?) and I'm pretty sure anyone can do it if they try to on purpose.

Only thing I ever have seen is some cheap Samsung drives slow to a crawl when their buffer fills or those super old Intel ssds that power loss to 8mb due to some firmware bug.

Eh, I've definitely seen it.

I buy Samsung drives relatively exclusively if that makes any difference.

All that to say though: this is why things like journalling and write-ahead systems exist. OS design is mostly about working around physical (often physics related) limitations of hardware and one of those is what to do if you get caught in a situation where something is incomplete.

The prevailing methodology is to paper over it with some atomic actions. For example: Copy-on-Write or POSIX move semantics (rename(2)).

Then some spiffy young dev comes along and turns off all of those guarantees and says they made something ultra fast (*cough*mongodb*cough*) then maybe claims those guarantees are somewhere up the stack instead. This is almost always a lie.

Also: Beware any database that only syncs to VFS.

Sadly the solution, a firmware variant with ZNS instead of the normal random write block device, just isn't on offer (please tell if I'm wrong; I'd love one!). Because with ZNS you can get away with tiny caps, large enough to complete the in-flight blocks (not the buffered ones, just those that are already at the flash chip itself), plus one metadata journal/ring buffer page to store write pointers and zone status for all zones touched since the last metadata write happened. Given that this should take about 100 μs, I don't see unannounced power loss really that problematic to deal with.

In theory the ATX PSU reports imminent power loss with a mandatory notice of no less than 1ms; this would easily be enough to finish in-flight writes and record the zone state.

It's no wonder that many people nowadays, esp. those who are so young that they've never experienced anything but cloud instances, seem to have little idea of how much performance you can actually pack in just one or two RUs today.

On the contrary, young people often show up having learned on their super fast Apple SSD or a top of the line gaming machine with NVMe SSD.

Many know what hardware can do. There’s no need to dunk on young people.

Anyway, the cloud performance realities are well know to anyone who works in cloud performance. It’s part of the game and it’s learned by anyone scaling a system. It doesn’t really matter what you could do if you build a couple RUs yourself and hauled them down to the data center, because beyond simple single-purpose applications with flexible uptime requirements, that’s not a realistic option.

On the contrary, young people often show up having learned on their super fast Apple SSD or a top of the line gaming machine with NVMe SSD.

Yes, this is often a big surprise. You can test out some disk-heavy app locally on your laptop and observe decent performance, and then have your day completely ruined when you provision a slice of an NVMe SSD instance type (like, i4i.2xlarge) and discover you're only paying for SATA SSD performance.

This doesn't stop at SSD's.

Spin up an E2 VM in Google Cloud and there's a good chance you'll get a nearly 9 year Broadwell architecture chip running your workload!

because beyond simple single-purpose applications with flexible uptime requirements, that’s not a realistic option.

I frequently hear this point expressed in cloud vs colo debates. The notion that you can't achieve high availability with simple colo deploys is just nonsense.

Two colo deploys in two geographically distinct datacenters, two active physical servers with identical builds (RAIDed drives, dual NICs, A+B power) in both datacenters, a third server racked up just sitting as a cold spare, pick your favorite container orchestration scheme, rig up your database replication, script the database failover activation process, add HAProxy (or use whatever built-in scheme your orchestration system offers), sprinkle in a cloud service for DNS load balancing/failover (Cloudflare or AWS Route 53), automate and store backups off-site and you're done.

Yes it's a lot of work, but so is configuring a similar level of redundancy and high availability in AWS. I've done it both ways and I prefer the bare metal colo approach. With colo you get vastly more bang for your buck and when things go wrong, you have a greater ability to get hands on, understand exactly what's going on and fix it immediately.

I doubt you’ll find anyone who disagrees that colo is much cheaper and that it’s possible to have failover with little to no downtime. Same with higher performance on bare metal vs a public cloud. Or at least I’ve never thought differently.

The difference is setting up all of that and maintaining it/debugging when something goes wrong is not a small task IMHO.

For some companies with that experience in-house I can understand doing it all yourself. As a solo founder and an employee of a small company we don’t have the bandwidth to do all of that without hiring 1+ more people which are more expensive than the cloud costs.

If we were drive-speed-constrained and getting that speed just wasn’t possible then maybe the math would shift further in favor of colo but we aren’t. Also upgrading the hardware our servers run on is fairly straightforward vs replacing a server on a rack or dealing with failing/older hardware.

Some of us are making a good living offboarding workloads from cloud onto bare metal with on-node NVMe storage.

Really? I'd like to do this as a job.

Are you hiring?

Cloud is great for prototyping or randomly elastic workloads, but it feels like people are pushing highly static workloads from on-prem to cloud. I'd love to be part of the change going the other way. Especially since the skills for doing so seem to have dried up completely.

Since then, several NVMe instance types, including i4i and im4gn, have been launched. Surprisingly, however, the performance has not increased; seven years after the i3 launch, we are still stuck with 2 GB/s per SSD.

AWS marketing claims otherwise:

    Up to 800K random write IOPS
    Up to 1 million random read IOPS
    Up to 5600 MB/second of sequential writes
    Up to 8000 MB/second of sequential reads

https://aws.amazon.com/blogs/aws/new-storage-optimized-amazo...

This is for 8 SSDs and a single modern PCIe 5.0 has better specs than this.

Is it? The line preceding the bullet list on that page seems to state otherwise:

“”

  Each storage volume can deliver the following performance (all measured using 4 KiB blocks):

  * Up to 8000 MB/second of sequential reads

“”

Just tested a i4i.32xlarge:

  $ lsblk
  NAME         MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
  loop0          7:0    0  24.9M  1 loop /snap/amazon-ssm-agent/7628
  loop1          7:1    0  55.7M  1 loop /snap/core18/2812
  loop2          7:2    0  63.5M  1 loop /snap/core20/2015
  loop3          7:3    0 111.9M  1 loop /snap/lxd/24322
  loop4          7:4    0  40.9M  1 loop /snap/snapd/20290
  nvme0n1      259:0    0     8G  0 disk 
  ├─nvme0n1p1  259:1    0   7.9G  0 part /
  ├─nvme0n1p14 259:2    0     4M  0 part 
  └─nvme0n1p15 259:3    0   106M  0 part /boot/efi
  nvme2n1      259:4    0   3.4T  0 disk 
  nvme4n1      259:5    0   3.4T  0 disk 
  nvme1n1      259:6    0   3.4T  0 disk 
  nvme5n1      259:7    0   3.4T  0 disk 
  nvme7n1      259:8    0   3.4T  0 disk 
  nvme6n1      259:9    0   3.4T  0 disk 
  nvme3n1      259:10   0   3.4T  0 disk 
  nvme8n1      259:11   0   3.4T  0 disk

Since nvme0n1 is the EBS boot volume, we have 8 SSDs. And here's the read bandwidth for one of them:

  $ sudo fio --name=bla --filename=/dev/nvme2n1 --rw=read --iodepth=128 --ioengine=libaio --direct=1 --blocksize=16m
  bla: (g=0): rw=read, bs=(R) 16.0MiB-16.0MiB, (W) 16.0MiB-16.0MiB, (T) 16.0MiB-16.0MiB, ioengine=libaio, iodepth=128
  fio-3.28
  Starting 1 process
  ^Cbs: 1 (f=1): [R(1)][0.5%][r=2704MiB/s][r=169 IOPS][eta 20m:17s]

So we should have a total bandwidth of 2.7*8=21 GB/s. Not that great for 2024.

I wonder if there is some tuning that needs to be done here, it seems suprising that the advertised rate would be this much off otherwise.

I would start with the LBA format, which is likely to be suboptimal for compatibility.

somehow I4g drives don't like to get formatted

    # nvme format /dev/nvme1 -n1 -f
    NVMe status: INVALID_OPCODE: The associated command opcode field is not valid(0x2001)
    # nvme id-ctrl /dev/nvme1 | grep oacs
    oacs      : 0

but the LBA format indeed is sus:

    LBA Format  0 : Metadata Size: 0   bytes - Data Size: 512 bytes - Relative Performance: 0 Best (in use)

It's a shame. The recent "datacenter nvme" standards involving fb, goog, et al mandate 4K LBA support.

it'd be great if you'd manage to throw together quick blogpost about i4g io perf, there obviously something funny going on and I imagine you guys could figure it out much easier than anybody else, especially if you are already having some figures in the marketing.

that's 16m blocks, not 4k

Last I checked, Linux splits up massive IO requests like that before sending them to the disk. But there's no benefit to splitting a sequential IO request all the way down to 4kB.

Can you addjust --blocksize to correspond to the block size on the device? And with/without --direct=1

If you still have this machine, I wonder if you can get this bandwidth in parallel across all SSDs? There could be some hypervisor-level or host-level bottleneck that means while any SSD in isolation will give you the observed bandwidth, you can't actually reach that if you try to access them all in parallel?

So if I'm reading it right, the quote from the original article that started this thread was ballpark correct?

we are still stuck with 2 GB/s per SSD

Versus the ~2.7 GiB/s your benchmark shows (bit hard to know where to look on mobile with all that line-wrapped output, and when not familiar with the fio tool; not your fault but that's why I'm double checking my conclusion)

Those claims are per device. There isn't even an instance in that family with 8 devices.

At my job at a telco, I had a 13 billion record file to scan and index for duplicates and bad addresses.

Consultants brought in to move our apps (some of which were Excel macros, others SAS scripts running on old desktop) to Azure. The Azure architects identified Postgres as the best tool. Consultants attempted to create a Postgres index in a small Azure instance but their tests would fail without completion (they were string concatenation rather than the native indexing function).

Consultants' conclusion: file too big for Postgres.

I disputed this. Plenty of literature out there on Pg handling bigger files. The Postgres (for Windows!) instance on my Core I7 laptop with an nVME drive could index the file about an hour. As an experiment I spun up a bare metal nVME instance on a Ryzen 7600 (lowest power, 6 core) Zen 4 CPU pc with a 1TB Samsung PCIe 4 nVME drive.

Got my index in 10 minutes.

I then tried to replicate this in Azure, upping the CPUs, memory, and to the nVME Azure CPU family (Ebsv5). Even at a $2000/mo level, I could not get the Azure instance any faster than one fifth (about an hour) of the speed of my bare metal experiment. I probably could have matched it eventually with more cores, but did not want to get called on the carpet for a ten grand Azure bill.

All this happened while I was working from home (one can't spin up an experimental bare metal system at a drop-in spot in the communal workroom).

What happened next I don't know, because I left in the midst of RTO fever. I was given the option of moving 1000 miles to commute to a hub office, or retire "voluntarily with severance." I chose the latter.

As someone who works with Azure daily, I am amazed not just at the initial consultant's conclusion (that is, alas, typical of folk who do not understand database engines), but also to your struggle with NVMe storage (I have some pretty large SQLite databases on my personal projects).

You should not have needed an Ebsv5 (memory-optimised) instance. For that kind of thing, you should only have needed a D-series VM with a premium storage data disk (or, if you wanted a hypervisor-adjacent, very low latency volume, a temp volume in another SKU).

Anyway, many people fail to understand that Azure Storage works more like a SAN than a directly attached disk--when you attach a disk volume to the VM, you are actually attaching a _replica set_ of that storage that is at least three-way replicated and distributed across the datacenter to avoid data loss. You get RAID for free, if you will.

That is inherently slower than a hypervisor-adjacent (i.e., on-board) volume.

Anyway, many people fail to understand that Azure Storage works more like a SAN than a directly attached disk--when you attach a disk volume to the VM, you are actually attaching a _replica set_ of that storage that is at least three-way replicated and distributed across the datacenter to avoid data loss. You get RAID for free, if you will.

I've said this a bit more sarcastically elsewhere in this thread, but basically, why would you expect people to understand this? Cloud is sold as abstracting away hardware details and giving performance SLAs billed by the hour (or minute, second, whatever). If you need to know significant details of their implementation, then you're getting to the point where you might as well buy your own hardware and save a bunch of money (which seems to be gaining some steam in a minor but noticeable cloud repatriation movement).

Well, in short, people need to understand that cloud is not their computer. It is resource allocation with underlying assumptions around availability, redundancy and performance at a scale well beyond what they would experience in their own datacenter.

And they absolutely must understand this to avoid mis-designing things. Failure to do so is just bad engineering, and a LOT of time is spent educating customers on these differences.

A case in point that aligns with is that I used to work with Hadoop clusters, where you would use data replication for both redundancy and distributed processing. Moving Hadoop to Azure and maintaining conventional design rules (i.e., tripling the amount of disks) is the wrong way do do things, because it isn't required neither for redundancy nor for performance (they are both catered for by the storage resources).

(Of course there are better solutions than Hadoop these days - Spark being one that is very nice from a cloud resource perspective - but many people have nine times the storage they need allocated in their cloud Hadoop clusters because of lack of understanding...)

I would think that lifting and shifting a Hadoop setup into the cloud would be considered an anti-pattern anyway; typically you would be told to find a managed, cloud-native solution.

You would be surprised at what corporate thinking and procurement departments actually think is best.

The cloud is also being sold as “don’t worry about data loss”.

To actually deliver on that promise while maintaining abstraction of just “dump your data on C:/ as you are used to”, there are compromises in performance that need to be taken. This is one of the biggest pitfalls of the cloud if you care more about performance than resiliency. Finding disks that don’t have such guarantees is still possible, just be aware of it.

I may have the "Ebsv5" series code incorrect. I'd look it up, but I don't have access to the subscription any longer.

What I chose ultimately was definitely "nVME attached" and definitely pricey. The "hypervisor-adjacent, very low latency volume" was not an obvious choice.

The best performing configuration did come from me--the db admin learning Azure on the fly--and not the four Azure architects nor the half dozen consultants with Azure credentials brought onto the project.

This matches my experience. Something about cloud systems makes them incredibly slow compared to real hardware. Not just the disk, the CPU is more limited too.

I fire up vCPU or dedicated or bare metal in the cloud, doesn't matter, I simply cannot match the equivalent compute of real hardware and it's not even close.

Isn't that expected? I would assume cloud stuff is slower because it's essentially an emulation of the real thing.

Why should it be expected? I'm being sold compute with quoted GHz CPU speeds, RAM and types of SSDs.

I would vaguely expect it to not match my workstation, sure, but all throughout this thread (and others) people have cited outrageous disparities i.e. 5x less performance that you'd expect even if you managed your expectations to e.g. 2x less due to the cloud compute not being a bare metal machine.

In other words, and to illustrate this with a bad example: I'd be fine paying for an i7 CPU and ending up at i5 speeds... but I'm absolutely not fine with ending up at Celeron speeds.

Ebsv5 does not have a local nvme. To get local drive you need to pick the version with a “d” in the name.

I think the obvious answer is there's not much demand, and keeping it "low" allows trickery and funny business with the virtualization layer (think: SAN, etc) that you can't do with "raw hardware speed".

Sure, but it does make me wonder what kind of speeds we are paying for if we can't even get raw hardware speeds...

Sounds like one more excuse for AWS to obfuscate any meaning in their billing structure and take control of the narrative.

How much are they getting away with by virtualization. (Think how banks use your money for loans and stuff)

You actually don't get to really see the internals other than IOPS which doesn't help when it's gatekept already.

The biggest "scam" if you can call it that is reducing all factors of CPU performance to "cores".

AWS is pretty transparent about what sort of cores you are exactly getting, and has different types available for different use-cases; typical example would something like r7iz that is aimed for peak single-threaded perf https://aws.amazon.com/ec2/instance-types/r7iz/

I'd argue the even bigger scam is charging for egress data transfer rather than just for the size of the pipe.

There is no trickery with AWS instance stores, they are honest to god local disks.

True but the funny business buys a lot of fault tolerance, and predictable performance if not maximum performance.

I ended up buying a SATA SSD for 50 euros to stick in an old laptop that I was already using as server and, my god, it is so much faster than the thing I was trying to run on digitalocean. The DO VPS barely beat the old 5400 rpm spinning rust that was in the laptop originally (the reason why I was trying to rent a fast, advertised-with-SSD, server). Doing this i/o task effectively in the cloud, at least with DO, seems to require putting it in RAM which was a bit expensive for the few hundred gigabytes of data I wanted to process into an indexed format

So there is demand, but I'm certainly not interested in paying many multiples of 50 euros over an expected lifespan of a few years, so it may not make economic sense for them to offer it to users like me at least. On the other hand, for the couple hours this should have taken (rather than the days it initially did), I'd certainly have been willing to pay that cloud premium and that's why I tried to get me one of these allegedly SSD-backed VPSes... but now that I have a fast system permanently, I don't think that was a wise decision of past me

it is not worth to use cloud if you need a lot of iops/bandwidth

heck, its not worth for anything besides scalability

dedicated servers are wayyyy cheaper

I'm not certain that's true if you look at TCO. Yes, you can probably buy a server for less than the yearly rent on the equivalent EC2 instance. But then you've got to put that server somewhere, with reliable power and probably redundant Internet connections. You have to pay someone's salary to set it up and load it to the point that a user can SSH in and configure it. You have to maintain an inventory of spares, and pay someone to swap it out if it breaks. You have to pay to put its backups somewhere.

Yeah, you can skip a lot of that if your goal is to get a server online as cheaply as possible, reliability be damned. As soon as you start caring about keeping it in a business-ready state, costs start to skyrocket.

I've worn the sysadmin hat. If AWS burned down, I'd be ready and willing to recreate the important parts locally so that my company could stay in business. But wow, would they ever be in for some sticker shock.

At least in the workstation segment cloud doesn't compete. We use Threadrippers + A6000 GPUs at work. Getting the equivalent datacenter-type GPUs and EPYC processors is more expensive, even after accounting for IT and utilization.

Where I live, a number of SMEs are doing this. It’s really not that costly, unless you are a tiny startup I guess.

as cheaply as possible, reliability be damned. As soon as you start caring about keeping it in a business-ready state, costs start to skyrocket.

The demand for five-nines is greatly exaggerated.

But then you've got to put that server somewhere, with reliable power and probably redundant Internet connections. You have to pay someone's salary to set it up and load it to the point that a user can SSH in and configure it. You have to maintain an inventory of spares, and pay someone to swap it out if it breaks.

There's a middle-ground between cloud and colocation. There are plenty of providers such as OVH, Hetzner, Equinix, etc which will do all of the above for you.

I'm not certain that's true if you look at TCO.

Sigh. This old trope from ancient history in internet time.

Yes, you can probably buy a server for less than the yearly rent on the equivalent EC2 instance.

Or a monthly bill... I can oft times buy a higher performing server for the cost of a rental for a single month.

But then you've got to put that server somewhere, with reliable power and probably redundant Internet connections

Power:

The power problem is a lot lower with modern systems because they can use a lot less of it per unit of compute/memory/disk performance. Idle power has improved a lot too. You don't need 700 watts of server power anymore for a 2 socket 8 core monster that is outclassed by a modern $400 mini-pc that maxes out at 45 watts.

You can buy server rack batteries now in a modern chemistry that'll go 20 years with zero maintenance. 4U sized 5kwh cost 1000-1500. EVs have pushed battery cost down a LOT. How much do you really need? Do you even need a generator if your battery just carries the day? Even if your power reliability totally sucks?

Network:

Never been easier to buy network transfer. Fiber is available in many places, even cable speeds are well beyond the past, and there's starlink if you want to be fully resistant to local power issues. Sure, get two vendors for redundancy. Then you can hit cloud-style uptimes out of your closet.

Overlay networks like tailscale make the networking issues within the reach of almost anyone.

Yeah, you can skip a lot of that if your goal is to get a server online as cheaply as possible, reliability be damned

Google cut it's teeth with cheap consumer class white box computers when "best practice" of the day was to buy expensive server class hardware. It's a tried and true method of bootstrapping.

You have to maintain an inventory of spares, and pay someone to swap it out if it breaks. You have to pay to put its backups somewhere.

Have you seen the size of M.2 sticks? Memory sticks? They aren't very big... I happened to like opening up systems and actually touching the hardware I use.

But yeah, if you just can't make it work or be bothered in the modern era of computing. Then stick with the cloud and the 10-100x premium they charge for their services.

I've worn the sysadmin hat. If AWS burned down, I'd be ready and willing to recreate the important parts locally so that my company could stay in business. But wow, would they ever be in for some sticker shock.

Nice. But I don't think it cost as much as you think. If you run apps on the stuff you rent and then compare it to your own hardware, it's night and day.

AWS docs and blogs describe the Nitro SSD architecture, which is locally attached with custom firmware.

The Nitro Cards are physically connected to the system main board and its processors via PCIe, but are otherwise logically isolated from the system main board that runs customer workloads.

https://docs.aws.amazon.com/whitepapers/latest/security-desi...

In order to make the [SSD] devices last as long as possible, the firmware is responsible for a process known as wear leveling.... There’s some housekeeping (a form of garbage collection) involved in this process, and garden-variety SSDs can slow down (creating latency spikes) at unpredictable times when dealing with a barrage of writes. We also took advantage of our database expertise and built a very sophisticated, power-fail-safe journal-based database into the SSD firmware.

https://aws.amazon.com/blogs/aws/aws-nitro-ssd-high-performa...

This firmware layer seems like a good candidate for the slowdown.

Yeah, I’m curious how they would respond to the claims in the article. In [1], they talk about aiming for low latency, for consistent performance (apparently other SSDs could stall at inopportune times), and support on-disk encryption. Latency is often in direct conflict with throughput (eg batching usually trades one for the other), and also matters a lot for plenty of filesystem or database tasks (indeed the OP links to a paper showing that popular databases, even column stores, struggle to use the full disk throughput, though I didn’t read why). Encryption is probably not the reason – dedicated hardware on modern chips can do AES at 50GB/s, though maybe it is if it increases latency? So maybe there’s something else to it like sharing between many vms on one host

[1] https://m.youtube.com/watch?v=Cxie0FgLogg

The Nitro chipset claims 100 GB/s encryption, so that doesn't seem to be the reason.

Yeah I wonder how Nitro balances the latency & bandwidth demands of multiple VMs while also minimizing memory cache misses on the CPU (I am assuming it uses DMA to talk to the main CPU cores)

I've noticed the firmware programming positions have been more common in their job listings lately.

Disclaimer: I work for OCI, opinion my own etc.

We offer faster NVMe drives in instances. Our E4 Dense shapes ship with SAMSUNG MZWLJ7T6HALA-00AU3, which supports Sequential Reads of 7000 MB/s, and Sequential Write 3800 MB/s.

From a general perspective, I would say the _likely_ answer to why AWS doesn't have faster NVMes at the moment is likely to be lack of specific demand. That's a guess, but that's generally how things go. If there's not enough specific demand being fed in through TAMs and the like for faster disks, upgrades are likely to be more of an after-thought, or reflecting supply chain.

I know there's a tendency when you engineer things, to just work around, or work with the constraints, and grumble amongst your team, but it's incredibly invaluable if you can make sure your account manager knows what shortcomings you've had to work around.

I very much expect AWS SSD / NVMe upgrades to be well thought-out ahead of time, and optimized for both upfront cost and for longevity / durability. Speed may be a third consideration.

Yeah, hardware and forecasting for cloud providers is basically the definition of deliberate.

If anything I'd guess it's a procurement issue, parity between regions is a big thing and it's hard to supply dozens of regions around the world with the latest hardware hotness

I work for OCI

Ah: Oracle cloud infra

https://blogs.oracle.com/cloud-infrastructure/post/announcin...

I keep forgetting Oracle is in the cloud business too.

"Make it rain", I guess :)

Doh, sorry. I'm usually good about not using that acronym!

There's a lot of talk about cloud network and disk performance in this thread. I recently benchmarked both Azure and AWS and found that:

- Azure network latency is about 85 microseconds.

- AWS network latency is about 55 microseconds.

- Both can do better, but only in special circumstances such as RDMA NICs in HPC clusters.

- Cross-VPC or cross-VNET is basically identical. Some people were saying it's terribly slow, but I didn't see that in my tests.

- Cross-zone is 300-1200 microseconds due to the inescapable speed of light delay.

- VM-to-VM bandwidth is over 10 Gbps (>1 GB/s) for both clouds, even for the smallest two vCPU VMs!

- Azure Premium SSD v1 latency varies between about 800 to 3,000 microseconds, which is many times worse than the network latency.

- Azure Premium SSD v2 latency is about 400 to 2,000 microseconds, which isn't that much better, because:

- Local SSD caches in Azure are so much faster than remote disk that we found that Premium SSD v1 is almost always faster than Premium SSD v2 because the latter doesn't support caching.

- Again in Azure, the local SSD "cache" and also the local "temp disks" both have latency as low as 40 microseconds, on par with a modern laptop NVMe drive. We found that switching to the latest-gen VM SKU and turning on the "read caching" for the data disks was the magic "go-fast" button for databases... without the risk of losing out data.

We investigated the various local-SSD VM SKUs in both clouds such as the Lasv3 series, and as the article mentioned, the performance delta didn't blow my skirt up, but the data loss risk made these not worth the hassle.

Interesting. And would you happen to have the numbers on the performance of the local SSD? Is it's read and write throughput up to the level of modern SSD's?

It's pretty much like how the article said. The cloud local SSDs are notably slower than what you'd get in an ordinary laptop, let alone a high-end server.

I'm not an insider and don't have any exclusive knowledge, but from reading a lot about the topic my impression is that the issue in both clouds is the virtualization overheads.

That is, having the networking or storage go through any hypervisor software layer is what kills the performance. I've seen similar numbers with on-prem VMware, Xen, and Nutanix setups as well.

Both clouds appear to be working on next-generation VM SKUs where the hypervisor network and storage functions are offloaded into 100% hardware, either into FPGAs or custom ASICs.

"Azure Boost" is Microsoft's marketing name for this, and it basically amounts to both local and remote disks going through an NVMe controller directly mapped into the memory space of the VM. That is, the VM OS kernel talks directly to the hardware, bypassing the hypervisor completely. This is shown in their documentation diagrams: https://learn.microsoft.com/en-us/azure/azure-boost/overview

They're claiming up to 3.8M IOPS for a single VM, which is 3-10x what you'd get out of a single NVMe SSD stick, so... not too shabby at all!

Similarly, Microsoft Azure Network Adapter (MANA) is the equivalent for the NIC, which will similarly connect the VM OS directly into the network, bypassing the hypervisor software.

I'm not an AWS expert, but from what I've seen they've been working on similar tech (Nitro) for years.

Makes a lot of sense! Yeah, seems like for the OP's performance-issue, you pretty much have the reason why it's happening (VM overhead) and solutions for it (bypassing the software layer using custom hardware like Azure Boost).

Thanks for the info!

This got me thinking about the recent Rails improvement on moving cache from RAM to SSD. The test they did from what I remembered was RAM at 0.8ms and SSD was ~1.5ms. Moving to SSD you could afford to cache 10x more data and still be cheaper. Now I wonder if the test results would be the same on cloud. Assuming they tested it locally.

They (Basecamp) recently moved off of cloud, so I’m wondering if they even care for their particular use case.

I’m sure this is configurable in general though?

got a link to the rails improvement?

nvm, found it: https://dev.37signals.com/solid-cache/

Would this be a consequence of the cloud providers not being on the latest technology CPU-wise?

At least I have the impression they are lagging, eg., still offering things like: z1d: Skylake (2017) https://aws.amazon.com/ec2/instance-types/z1d/ x2i: Cascade Lake (2019) and Ice lake (2021) https://aws.amazon.com/ec2/instance-types/x2i/

I have not been able to find instances powered by the 4th (Q1 2023) or 5th generation (Q4 2023) Xeons?

We solve large capacity expansion power market models that need as fast single-threaded performance as possible coupled with lots of RAM (32:1 ratio or higher ideal). One model may take 256-512 GB RAM, but not being able to use more than 4 threads effectively (interior point algorithms have very diminishing returns past this point)

Our dispatch models do not have the same RAM requirement, but you still wish to have the fastest single-threaded processors available (and then parallelize)

AWS was offering Sapphire Rapids instances before those CPUs became even publicly available

https://aws.amazon.com/about-aws/whats-new/2022/11/introduci...

You can find Intel Sapphire Rapids powered VM instances on GCE

Its worse than the article mentions. Because bandwidth isn't the problem its IOPS that are the problem.

Last time (about a year ago) I ran a couple random IO benchmarks against a storage optimized instances and the random IOPs behavior is closer to a large spinning RAID array than SSDs if the disk size is over some threshold.

IIRC, What it looks like is that there is a fast local SSD cache with a couple hundred GB of storage and then the rest is backed by remote spinning media.

Its one of the many reasons I have a hard time taking cloud optimization seriously, the lack of direct tiering controls means that database/etc style workloads are not going to optimize well and that will end up costing a lot of $$$$$.

So, maybe it was the instance types/configuration I was using, but <shrug> it was just something I was testing in passing.

  # fio --name=read_iops_test   --filename=/dev/nvme1n1 --filesize=1500G   --time_based --ramp_time=1s --runtime=15s   --ioengine=io_uring --fixedbufs --direct=1 --verify=0 --randrep
  eat=0   --bs=4K --iodepth=256 --rw=randread   --iodepth_batch_submit=256  --iodepth_batch_complete_max=256
  read_iops_test: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=256
  fio-3.32
  Starting 1 process
  Jobs: 1 (f=1): [r(1)][100.0%][r=2082MiB/s][r=533k IOPS][eta 00m:00s]
  read_iops_test: (groupid=0, jobs=1): err= 0: pid=34235: Tue Feb 20 22:57:00 2024
    read: IOPS=534k, BW=2086MiB/s (2187MB/s)(30.6GiB/15001msec)
      slat (nsec): min=713, max=255840, avg=31174.74, stdev=16248.45
      clat (nsec): min=1419, max=1175.6k, avg=443782.26, stdev=277389.66
      lat (usec): min=133, max=1240, avg=474.96, stdev=274.50
      clat percentiles (usec):
      |  1.00th=[  169],  5.00th=[  198], 10.00th=[  217], 20.00th=[  243],
      | 30.00th=[  265], 40.00th=[  285], 50.00th=[  306], 60.00th=[  334],
      | 70.00th=[  396], 80.00th=[  865], 90.00th=[  922], 95.00th=[  947],
      | 99.00th=[  996], 99.50th=[ 1012], 99.90th=[ 1045], 99.95th=[ 1057],
      | 99.99th=[ 1074]
    bw (  MiB/s): min= 2080, max= 2092, per=100.00%, avg=2086.72, stdev= 2.35, samples=30
    iops        : min=532548, max=535738, avg=534199.13, stdev=601.82, samples=30
    lat (usec)   : 2=0.01%, 100=0.01%, 250=23.06%, 500=50.90%, 750=0.28%
    lat (usec)   : 1000=24.90%
    lat (msec)   : 2=0.87%
    cpu          : usr=14.17%, sys=67.83%, ctx=156851, majf=0, minf=37
    IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=100.0%
      submit    : 0=0.0%, 4=7.8%, 8=11.3%, 16=39.7%, 32=30.6%, 64=10.5%, >=64=0.1%
      complete  : 0=0.0%, 4=5.3%, 8=9.5%, 16=40.3%, 32=32.4%, 64=12.4%, >=64=0.1%
      issued rwts: total=8010661,0,0,0 short=0,0,0,0 dropped=0,0,0,0
      latency   : target=0, window=0, percentile=100.00%, depth=256

  Run status group 0 (all jobs):
    READ: bw=2086MiB/s (2187MB/s), 2086MiB/s-2086MiB/s (2187MB/s-2187MB/s), io=30.6GiB (32.8GB), run=15001-15001msec

  Disk stats (read/write):
    nvme1n1: ios=8542481/0, merge=0/0, ticks=3822266/0, in_queue=3822266, util=99.37%

tldr: random 4k reads pretty much saturate the available 2GB/s bandwidth (this is on m6id)

Just for fun, ran the same workload on a locally-attached Gen4 enterprise-class 7.68TB NVMe SSD on "bare metal" (which is my home i9 system with an ecore/pcore situation so added cpus_allowed):

  sudo fio --name=read_iops_test   --filename=/dev/nvme0n1 --filesize=1500G   --time_based --ramp_time=1s --runtime=15s   --ioengine=io_uring --fixedbufs --direct=1 --verify=0 --randrepeat=0   --bs=4K --iodepth=256 --rw=randread   --iodepth_batch_submit=256  --iodepth_batch_complete_max=256 --cpus_allowed=0-7
  read_iops_test: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=256
  fio-3.28
  Starting 1 process
  Jobs: 1 (f=1): [r(1)][100.0%][r=6078MiB/s][r=1556k IOPS][eta 00m:00s]
  read_iops_test: (groupid=0, jobs=1): err= 0: pid=11085: Wed Feb 21 08:57:35 2024
    read: IOPS=1555k, BW=6073MiB/s (6368MB/s)(89.0GiB/15001msec)
      slat (nsec): min=401, max=93168, avg=7547.42, stdev=4396.47
      clat (nsec): min=1426, max=1958.2k, avg=154599.19, stdev=92730.02
       lat (usec): min=56, max=1963, avg=162.15, stdev=92.68
      clat percentiles (usec):
       |  1.00th=[   71],  5.00th=[   78], 10.00th=[   83], 20.00th=[   92],
       | 30.00th=[  100], 40.00th=[  111], 50.00th=[  124], 60.00th=[  141],
       | 70.00th=[  165], 80.00th=[  200], 90.00th=[  265], 95.00th=[  334],
       | 99.00th=[  519], 99.50th=[  603], 99.90th=[  807], 99.95th=[  898],
       | 99.99th=[ 1106]
     bw (  MiB/s): min= 5823, max= 6091, per=100.00%, avg=6073.70, stdev=47.56, samples=30
     iops        : min=1490727, max=1559332, avg=1554866.87, stdev=12174.38, samples=30
    lat (usec)   : 2=0.01%, 10=0.01%, 20=0.01%, 50=0.01%, 100=30.18%
    lat (usec)   : 250=58.12%, 500=10.55%, 750=1.00%, 1000=0.13%
    lat (msec)   : 2=0.02%
    cpu          : usr=25.41%, sys=74.57%, ctx=2395, majf=0, minf=58
    IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=100.0%
       submit    : 0=0.0%, 4=5.7%, 8=14.8%, 16=54.8%, 32=24.3%, 64=0.3%, >=64=0.1%
       complete  : 0=0.0%, 4=2.9%, 8=13.0%, 16=56.9%, 32=26.8%, 64=0.3%, >=64=0.1%
       issued rwts: total=23320075,0,0,0 short=0,0,0,0 dropped=0,0,0,0
       latency   : target=0, window=0, percentile=100.00%, depth=256
  
  Run status group 0 (all jobs):
     READ: bw=6073MiB/s (6368MB/s), 6073MiB/s-6073MiB/s (6368MB/s-6368MB/s), io=89.0GiB (95.5GB), run=15001-15001msec
  
  Disk stats (read/write):
    nvme0n1: ios=24547748/0, merge=1/0, ticks=3702834/0, in_queue=3702835, util=99.35%

And then again with IOPS limited to ~2GB/s:

  sudo fio --name=read_iops_test   --filename=/dev/nvme0n1 --filesize=1500G   --time_based --ramp_time=1s --runtime=15s   --ioengine=io_uring --fixedbufs --direct=1 --verify=0 --randrepeat=0   --bs=4K --iodepth=256 --rw=randread   --iodepth_batch_submit=256  --iodepth_batch_complete_max=256 --cpus_allowed=0-7 --rate_iops=534000
  read_iops_test: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=256
  fio-3.28
  Starting 1 process
  Jobs: 1 (f=1), 0-534000 IOPS: [r(1)][100.0%][r=2086MiB/s][r=534k IOPS][eta 00m:00s]
  read_iops_test: (groupid=0, jobs=1): err= 0: pid=11114: Wed Feb 21 08:59:30 2024
    read: IOPS=534k, BW=2086MiB/s (2187MB/s)(30.6GiB/15001msec)
      slat (nsec): min=817, max=88336, avg=41533.20, stdev=7711.33
      clat (usec): min=7, max=485, avg=93.19, stdev=39.73
       lat (usec): min=65, max=536, avg=134.72, stdev=37.83
      clat percentiles (usec):
       |  1.00th=[   32],  5.00th=[   41], 10.00th=[   47], 20.00th=[   59],
       | 30.00th=[   70], 40.00th=[   79], 50.00th=[   89], 60.00th=[   98],
       | 70.00th=[  110], 80.00th=[  122], 90.00th=[  145], 95.00th=[  167],
       | 99.00th=[  217], 99.50th=[  235], 99.90th=[  277], 99.95th=[  293],
       | 99.99th=[  334]
     bw (  MiB/s): min= 2084, max= 2086, per=100.00%, avg=2086.08, stdev= 0.38, samples=30
     iops        : min=533715, max=534204, avg=534037.57, stdev=97.91, samples=30
    lat (usec)   : 10=0.01%, 20=0.04%, 50=12.42%, 100=49.30%, 250=37.97%
    lat (usec)   : 500=0.28%
    cpu          : usr=11.48%, sys=27.35%, ctx=2278177, majf=0, minf=58
    IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.1%, >=64=100.0%
       submit    : 0=0.0%, 4=0.4%, 8=0.2%, 16=0.1%, 32=0.1%, 64=0.1%, >=64=99.3%
       complete  : 0=0.0%, 4=95.4%, 8=4.5%, 16=0.1%, 32=0.1%, 64=0.1%, >=64=0.0%
       issued rwts: total=8009924,0,0,0 short=0,0,0,0 dropped=0,0,0,0
       latency   : target=0, window=0, percentile=100.00%, depth=256

  Run status group 0 (all jobs):
     READ: bw=2086MiB/s (2187MB/s), 2086MiB/s-2086MiB/s (2187MB/s-2187MB/s), io=30.6GiB (32.8GB), run=15001-15001msec

  Disk stats (read/write):
    nvme0n1: ios=8543389/0, merge=0/0, ticks=934147/0, in_queue=934148, util=99.33%

edit: formatting...

There's a 4th option. Cost.

The fastest SSDs tend to also be MLC which tend to have much lower write life vs other technologies. This isn't unusual, increasing data density generally also makes it easier to increase performance. However, it's at the cost that the writes are typically done for a block/cell in memory rather than for single bits. So if one cell goes bad, they all fail.

But even if that's not the problem, there is a problem of upgrading the fleet in a cost effective mechanism. When you start introducing new tech into the stack, replacing that tech now requires your datacenters to have 2 different types of hardware on hand AND for the techs swapping drives to have a way to identify and replace that stuff when it goes bad.

Sure, MLC write life is worse than SLC, but significantly better than TLC and QLC

-edit: this comment was purely focused at your first sentence:

The fastest SSDs tend to also be MLC which tend to have much lower write life vs other technologies.

I'm not sure what you mean with "other technologies" in this case, SLC is indeed truly expensive for a significantly higher write life, and HDDs are debatable for their lifespan.

Serious question, for a consumer does it make any sense to compare SSD benchmarks? I assume the best and worst models give a user an identical experience in 99% of cases, and it is only prosumer activities (video? sustained writes?) which would differentiate them.

Yeah, that's pretty much the case. Cheap SSDs provide good enough performance for desktop use.

Has anyone disk benchmarks for M7gd (or C/R equivalents) instance stores? While probably not at the level of I4g, would still be interesting comparison.

Well, as hardware becomes more and more powerful, what's possible in a small footprint becomes bigger and bigger. And distributed software for disaggregated storage is becoming more accessible. You put these two together, running on-prem footprints at the scale of 50-100M$ capex makes a lot of sense. In my personal experience, at this scale, (if your cloud bill for compute+storage+local-network is $50M+/year), you can get 2-4x more on-prem private-cloud capacity for the same money. Of course, this only makes sense if you have an in-house software engineering team already and the marginal cost of adding another 50-100 engineers to build and operate this is strategically valuable to your business.

In big data AI space, this is exactly what's happening with the top 20th to 100th companies in the world right now.

I'm going with hybrid:

- 2011 X-25E 64GB (2W write and almost nothing read/idle) at 100.000 writes per bit for OS

- 2021 PM897 3.7TB (2.3 Watt (read) ¦ 3 Watt (write) ¦ 1.4 Watt (idle) down from the PM983 (8.7 Watt (read) ¦ 10.6 Watt (write) ¦ 4 Watt (idle)) for DB.

This way I can get the most robust solution, with largest DB at lowest power. They are both in a 8-core Atom Mini-ITX board at 25W TDP.

It's awfully similar to a lane of PCIe 4.0, if we're talking about instance storage. It reads like they've chosen to map each physical device to a single PCIe lane. Surely the AWS Nitro hardware platform has longer cycle times than PCIe. Note that once the instance type has multiple block devices exposed (i.e. im4gn.8xlarge or higher), striping across these will reach higher throughput (2 devices yields 4G/s, 4 yields 8G/s).

UpCloud has super fast MaxIOPS: https://upcloud.com/products/block-storage

Here's referral link with free credits: https://upcloud.com/signup/?promo=J3JYWZ

The article ends with the question about why the bandwidth is capped.

I think the answer could be mclock scheduler.

https://www.usenix.org/legacy/event/osdi10/tech/full_papers/...

Is read/write throughput the only difference? Eg I don’t know how latency compares, or indeed failure rates or whether fsync lies or not and writes won’t always survive power failures.

The main reason is that cloud providers charge for IOPS

As a database engineer who worked extensively with both i3 and i4 instances, I want to add that although i4 has lower IOPS, the latencies distribution of IO ops is an order of magnitude better.

IOPS indeed matters a lot, but so does latency! For our use case, it was much easier to saturate those disks than the old i3s, and we attribute it to the better latencies, making IO scheduling a lot more accurate.

The cloud really is a scam for those afraid of hardware