This was a huge technical problem I worked on at Google, and is sort of fundamental to a cloud. I believe this is actually a big deal that drives peoples' technology directions.
SSDs in the cloud are attached over a network, and fundamentally have to be. The problem is that this network is so large and slow that it can't give you anywhere near the performance of a local SSD. This wasn't a problem for hard drives, which was the backing technology when a lot of these network attached storage systems were invented, because they are fundamentally slow compared to networks, but it is a problem for SSD.
According to the submitted article, the numbers are from AWS instance types where the SSD is "physically attached" to the host, not about SSD-backed NAS solutions.
Also, the article isn't just about SSDs being no faster than a network. It's about SSDs being two orders of magnitude slower than datacenter networks.
It's because the "local" SSDs are not actually physically attached and there's a network protocol in the way.
I think you're wrong about that. AWS calls this class of storage "instance storage" [0], and defines it as:
There might be some wiggle room in "physically attached", but there's none in "storage devices located inside the host computer". It's not some kind of AWS-only thing either. GCP has "local SSD disks"[1], which I'm going to claim are likewise local, not over the network block storage. (Though the language isn't as explicit as for AWS.)
[0] https://aws.amazon.com/ec2/instance-types/
[1] https://cloud.google.com/compute/docs/disks#localssds
That's the abstraction they want you to work with, yes. That doesn't mean it's what is actually happening - at least not in the same way that you're thinking.
As a hint for you, I said "a network", not "the network." You can also look at public presentations about how Nitro works.
it sounds like you're trying to say "PCI switch" without saying "PCI switch" (I worked at Google for over a decade, including hardware division).
That is what I am trying to say without actually giving it out. PCIe switches are very much not transparent devices. Apparently AWS has not published anything about this, and doesn't have Nitro moderating access to "local" SSD, though - that I did get confused with EBS.
Why are you acting as if PCIe switches are some secret technology? It was extremely grating for me to read your comments.
Although it used them for years, the first mention by Google of PCIe switches was probably in the 2022 Aquila paper, which doesn't really talk about storage anyway...
I don't understand why you would expect Google to state that. They have been standard technology for almost 2 decades. You don't see google claiming they use jtag or using SPI flash or whatever. It's just not special.
Because the parent works/worked for Google, so obviously it must be super secret sauce that nobody has heard of. /s
Next up they’re going explain to us that iSCSI wants us to think it’s SCSI but it’s actually not!
AWS has stated that there is a "Nitro Card for Instance Storage"[0][1] which is a NVMe PCIe controller that implements transparent encryption[2].
I don't have access to an EC2 instance to check, but you should be able to see the PCIe topology to determine how many physical cards are likely in i4i and im4gn and their PCIe connections. i4i claims to have 8 x 3,750 AWS Nitro SSD, but it isn't clear how many PCIe lanes are used.
Also, AWS claims "Traditionally, SSDs maximize the peak read and write I/O performance. AWS Nitro SSDs are architected to minimize latency and latency variability of I/O intensive workloads [...] which continuously read and write from the SSDs in a sustained manner, for fast and more predictable performance. AWS Nitro SSDs deliver up to 60% lower storage I/O latency and up to 75% reduced storage I/O latency variability [...]"
This could explain the findings in the article - they only meared peak r/w, not predictability.
[0] https://perspectives.mvdirona.com/2019/02/aws-nitro-system/ [1] https://aws.amazon.com/ec2/nitro/ [2] https://d1.awsstatic.com/events/reinvent/2019/REPEAT_2_Power...
Like many other people in this thread, I think we disagree that a PCI switch means that an SSD "is connected over a network" to the host bus.
Now if you can show me two or more hosts connected to a box of SSDs through a PCI switch (and some sort of cool tech for coordinating between the hosts), that's interesting.
I've linked to public documentation that is pretty clearly in conflict with what you said. There's no wiggle room in how AWS describes their service without it being false advertising. There's no "ah, but what if we define the entire building to be the host computer, then the networked SSDs really are inside the host computer" sleight of hand to pull off here.
You've provided cryptic hints and a suggestion to watch some unnamed presentation.
At this point I really think the burden of proof is on you.
You are correct, and the parent you’re replying to is confused. Nitro is for EBS, not the i3 local NVMe instances.
Those i3 instances lose your data whenever you stop and start them again (ie migrate to a different host machine), there’s absolutely no reason they would use network.
EBS itself uses a different network than the “normal” internet, if I were to guess it’s a converged Ethernet network optimized for iSCSI. Which is what Nitro optimizes for as well. But it’s not relevant for the local NVMe storage.
The argument could also be resolved by just getting the latency numbers for both cases and compare them, on bare metal it shouldn't be more than a few hundred nanoseconds.
I see wiggle room in the statement you posted in that the SSD storage that is physically inside the machine hosting the instance might be mounted into the hypervised instance itself via some kind of network protocol still, adding overhead.
At minimum, the entire setup will be virtualized, which does add overhead.
Nitro "virtual NVME" device are mostly (only?) for EBS -- remote network storage, transparently managed, using a separate network backbone, and presented to the host as a regular local NVME device. SSD drives in instances such as i4i, etc. are physically attached in a different way -- but physically, unlike EBS, they are ephemeral and the content becomes unavaiable as you stop the instance, and when you restart, you get a new "blank slate". Their performance is 1 order of magnitude faster than standard-level EBS, and the cost structure is completely different (and many orders of magnitude more affordable than EBS volumes configured to have comparable I/O performance).
This is the way Azure temporary volumes work as well. They are scrubbed off the hardware once the VM that accesses them is dead. Everything else is over the network.
Both the documentation and Amazon employees are in here telling you that you're wrong. Can you resolve that contradiction or do you just want to act coy like you know some secret? The latter behavior is not productive.
The parent thinks that AWS' i3 NVMe local instance storage is using a PCIe switch, which is not the case. EBS (and the AWS Nitro card) use a PCIe switch, and as such all EBS storage is exposed as e.g. /dev/nvmeXnY . But that's not the same as the i3 instances are offering, so the parent is confused.
If the SSD is installed in the host server, doesn't that still allow for it to be shared among many instances running on said host? I can imagine that a compute node has just a handful of SSDs and many hundreds of instances sharing the I/O bandwidth.
How do these machines manage the sharing of one local SSD across multiple VMs? Is there some wrapper around the I/O stack? Does it appear as a network share? Geniuinely curious...
With Linux and KVM/QEMU, you can map an entire physical disk, disk partition, or file to a block device in the VM. For my own VM hosts, I use LVM and map a logical volume to the VM. I assumed cloud providers did something conceptually similar, only much more sophisticated.
Heh, you'd probably be surprised, there's some really cool cutting edge stuff being done in those data centers but a lot of what is done is just plan old standard server management without much in the way of tricks. Its just someone else does it instead of you and the billing department is counting milliseconds.
Do cloud providers document these internals anywhere? I'd love to read about that sort of thing.
Not generally, especially not the super generic stuff. Where they really excel is having the guy that wrote the kernel driver or hypervisor on staff. But a lot of it is just an automated version of what you'd do on a smaller scale
Files with reflinks are a common choice, the main benefit being: only storing deltas. The base OS costs basically nothing
LVM/block like you suggest is a good idea. You'd be surprised how much access time is trimmed by skipping another filesystem like you'd have with a raw image file
In say VirtualBox you can create a file backed on the physical disk, and attach it to the VM so the VM sees it as a NVMe drive.
In my experience this is also orders of magnitude slower that true direct access, ie PCIe pass-through, as all access has to pass through the VM storage driver and so could explain what is happening.
The storage driver may have more impact on VBox. You can get very impressive results with 'virtio' on KVM
Yeah I've yet to try that. I know I get a similar lack of performance with Bhyve (FreeBSD) using VirtIO, so it's not a given it's fast.
I have no idea how AWS run their VMs, was just saying a slow storage driver could give such results.
Oh, absolutely - not to contest that! There's a whole lot of academia on 'para-virtualized' and so on in this light.
That's interesting to hear about FreeBSD; basically all of my experience has been with Linux/Windows.
Probably NVME namespaces [0]?
[0]: https://nvmexpress.org/resource/nvme-namespaces/
Less fancy, quite often... at least on VPS providers [1]. They like to use reflinked files off the base images. This way they only store what differs.
1: Which is really a cloud without a certain degree of software defined networking/compute/storage/whatever.
AWS have custom firmware for at least some of their SSDs, so could be that
Instance storage is not networked. That's why it's there.
PCI bus, etc too
If you have one of the metal instance types, then you get the whole host, e.g. i4i.metal:
https://aws.amazon.com/ec2/instance-types/i4i/
On AWS yes, the older instances which I am familiar with had 900GB drives and they sliced that up into volumes of 600, 450, 300, 150, 75GB depending on instance size.
But they also tell you how much IOPS you get: https://docs.aws.amazon.com/AWSEC2/latest/WindowsGuide/stora...
the tests were for these local (metal direct connect ssds). The issue is not network overhead -- its that just like everything else in cloud the performance of 10 years ago was used as the baseline that carries over today with upcharges to buy back the gains.
there is a reason why vcpu performance is still locked to the typical core from 10 years ago when every core on a machine today in those data scenters is 3-5x or more speed basis. Its cause they can charge you for 5x the cores to get that gain.
vcpu performance is still locked to the typical core from 10 years ago
No. In some cases I think AWS actually buys special processors that are clocked higher than the ones you can buy.
You are talking about real CPU not virtual cpu
Generally each vCPU is a dedicated hardware thread, which has gotten significantly faster in the last 10 years. Only lambdas, micros, and nanos have shared vCPUs and those have probably also gotten faster although it's not guaranteed.
In fairness, there are a not insignificant number of workloads that do not benefit from hardware threads on CPUs [0], instead isolating processes along physical cores for optimal performance.
[0] Assertion not valid for barrel processors.
The parent claims that though aws uses better hardware, they bill in vcpus whose benchmarks are from a few years ago, so that they can sell more vcpu units per performant physical cpu. This does not contradict your claim that aws buys better hardware.
It's so obviously wrong that I can't really explain it. Maybe someone else can. To believe that requires a complete misunderstanding of IaaS.
That is transparently nonsense.
You can disprove that claim in 5 minutes, and it makes literally zero sense for offerings that aren't oversubscribed
AWS is so large, every concept of hardware is virtualized over a software layer. “Instance storage” is no different. It’s just closer to the edge with your node. It’s not some box in a rack where some AWS tech slots in an SSD. AWS has a hardware layer, but you’ll never see it.
Local SSD is part of the machine, not network attached.
You’re wrong. Instance local means SSD is physically attached to the droplet and is inside the server chassis, connected via PCIe.
Sourece: I work on nitro cards.
"Attached to the droplet"?
Droplets are what EC2 calls their hosts. Confusing? I know.
Yes! That is confusing! Tell them to stop it!
FYI it's not a AWS term, it's a DigitalOcean term.
I could not be more confused. Does EC2 quietly call their hosting machines "droplets"? I knew "droplets" to be a DigitalOcean team, but DigitalOcean doesn't have Nitro cards.
Now I'm wondering if that's where DO got the name in the first place
Surely "droplet" is a derivative of "ocean?"
Clouds (like, the big fluffy things in the sky) are made up of many droplets of liquid. Using "droplet" to refer to the things that make up cloud computing is a pretty natural nickname for any cloud provider, not just DO. I do imagine that DO uses "droplet" as a public product branding because it works well with their "Ocean" brand, though.
...now I'm actually interested in knowing if "droplet" is derived from "ocean", or if "Digital Ocean" was derived from having many droplets (which was derived from cloud). Maybe neither.
Clouds are water vapor, not droplets.
“Cloud: Visible mass of liquid droplets or frozen crystals suspended in the atmosphere“
https://en.wikipedia.org/wiki/Cloud
I believe AWS was calling them droplets prior to digital ocean.
digitalocean squad
No, that’s AWS.
That is more than likely a team-specific term being used outside of its context. FYI, the only place where you will find the term <droplet> used, is in the public-facing AWS EC2 API documentation under InstanceTopology:networkNodeSet[^1]. Even that reference seems like a slip of the tongue, but the GP did mention working on the Nitro team, which makes sense when you look at the EC2 instance topology[^2].
[^1]: https://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_I... [^2]: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/how-ec2-...
Depends on the cloud provider. Local SSDs are physically attached to the host on GCP, but that makes them only useful for temporary storage.
If you're at G, you should read the internal docs on exactly how this happens and it will be interesting.
Why would I lose all data on these SSDs when I initiate a power off of the VM on console, then?
I believe local SSDs are definitely attached to the host. They are just not exposed via NVMe ZNS hence the performance hit.
Your EC2 instance with instance-store storage when stopped can be launched on any other random host in the AZ when you power it back on. Since your rootdisk is an EBS volume attached across the network, so when you start your instance back up you're going to be launched likely somewhere else with an empty slot, and empty local-storage. This is why there is always a disclaimer that this local storage is ephemeral and don't count on it being around long-term.
I think the parent was agreeing with you. If the “local” SSDs _weren’t_ actually local, then presumably they wouldn’t need to be ephemeral since they could be connected over the network to whichever host your instance was launched on.
It is because on reboot you may not get the same physical server . They are not rebooting the physical server for you , just the VM
Same VM is not allocated for a variety of reasons , scheduled maintenance, proximity to other hosts on the vpc , balancing quiet and noisy neighbors so on.
It is not that the disk will always wiped , sometimes the data is still there on reboot just that there is no guarantee allowing them to freely move between hosts
Data persists between reboots, but not shutdowns:
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-inst...
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-inst...
Why are you protecting Google's internal architecture onto to AWS? Your Google mental model is not correct here
In most cases, they're physically plugged into a PCIe CEM slot in the host.
There is no network in the way, you are either misinformed or thinking of a different product.
Which is a weird sort of limitation. For any sort of you-own-the-hardware arrangement, NVMe disks are fine for long term storage. (Obviously one should have backups, but that’s a separate issue. One should have a DR plan for data on EBS, too.)
You need to migrate that data if you replace an entire server, but this usually isn’t a very big deal.
This is Hyrum’s law at play: AWS wants to make sure that the instance stores aren’t seen as persistent, and therefore enforce the failure mode for normal operations as well.
You should also see how they enforce similar things for their other products and APIs, for example, most of their services have encrypted pagination tokens.
Yes, that's what their purpose is in cloud applications: temporary high performance storage only.
If you want long term local storage you'll have to reserve an instance host.
They do this because they want SSDs to be in a physically separate part of the building for operational reasons, or what's the point in giving you a "local" SSD that isn't actually plugged into the real machine?
The reason for having most instances use network storage is that it makes possible migrating instances to other hosts. If the host fails, the network storage can be pointed at the new host with a reboot. AWS sends out notices regularly when they are going to reboot or migrate instances.
Their probably should be more local instance storage types for using with instances that can be recreated without loss. But it is simple for them to have a single way of doing things.
At work, someone used fast NVMe instance storage for Clickhouse which is a database. It was a huge hassle to copy data when instances were going to be restarted because the data would be lost.
Sure, I understand that, but this user is claiming that on GCP even local SSDs aren't really local, which raises the question of why not.
I suspect the answer is something to do with their manufacturing processes/rack designs. When I worked there (pre GCP) machines had only a tiny disk used for booting and they wanted to get rid of that. Storage was handled by "diskful" machines that had dedicated trays of HDDs connected to their motherboards. If your datacenters and manufacturing processes are optimized for building machines that are either compute or storage but not both, perhaps the more normal cloud model is hard to support and that pushes you towards trying to aggregate storage even for "local" SSD or something.
The GCE claim is unverified. OP seems to be referring to PD-SSD and not LocalSSD
GCE local SSDs absolutely are on the same host as the VM. The docs [0] are pretty clear on this, I think:
Disclosure: I work on GCE.
[0] https://cloud.google.com/compute/docs/disks/local-ssd
They're claiming so, but they're wrong.
This post on how Discord RAIDed local NVMe volumes with slower remote volumes might be on interest https://discord.com/blog/how-discord-supercharges-network-di...
We moved to running Clickhouse on EKS with EBS volumes for storage. It can better survive instances going down. I didn't work on it so don't how much slower it is. Lowering the management burden was big priority.
Are you saying that a reboot wipes the ephemeral disks? Or a stop the instance and start the instance from AWS console/api?
Reboot keeps the instance storage volumes. Restarting wipes them. Starting frequently migrates to new host. And the "restart" notices AWS sends are likely cause the host has a problem and need to migrate it.
The comment you’re responding to is wrong. AWS offers many kinds of storage. Instance local storage is physically attached to the droplet. EBS isn’t but that’s a separate thing entirely.
I literally work in EC2 Nitro.
That seems like a big opportunity for other cloud providers. They could provide SSDs that are actually physically attached and boast (rightfully) that their SSDs are a lot faster, drawing away business from older cloud providers.
For what kind of workloads would a slower SSD be a significant bottleneck?
I run very large database-y workloads. Storage bandwidth is by far the throughput rate limiting factor. Cloud environments are highly constrained in this regard and there is a mismatch between the amount of CPU you are required to buy to get a given amount of bandwidth. I could saturate a much faster storage system with a fraction of the CPU but that isn’t an option. Note that latency is not a major concern here.
This has an enormous economic impact. I once did a TCO study with AWS to run data-intensive workload running on purpose-built infrastructure on their cloud. AWS would have been 3x more expensive per their own numbers, they didn’t even argue it. The main difference is that we had highly optimized our storage configuration to provide exceptional throughput for our workload on cheap hardware.
I currently run workloads in the cloud because it is convenient. At scale though, the cost difference to run it on your own hardware is compelling. The cloud companies also benefit from a learned helplessness when it comes to physical infrastructure. Ironically, it has never been easier to do a custom infrastructure build, which companies used to do all the time, but most people act like it is deep magic now.
Thanks for the details!
Does this mean you're colocating your own server in a data center somewhere? Or do you have your own data center/running it off a bare metal server with a business connection?
Just wondering if the TCO included the same levels of redundancy and bandwidth, etc.
We were colocated in large data centers right on the major IX with redundancy. All of this was accounted for in their TCO model. We had a better switch fabric than is typical for the cloud but that didn’t materially contribute to cost. We were using AWS for overflow capacity when we exceeded the capacity of our infrastructure at the time; they wanted us to move our primary workload there.
The difference in cost could be attributed mostly to the server hardware build, and to a lesser extent the better scalability with a better network. In this case, we ended up working with Quanta on servers that had everything we needed and nothing we didn’t, optimizing heavily for bandwidth/$. We worked directly with storage manufacturers to find SKUs that stripped out features we didn’t need and optimized for cost per byte given our device write throughput and durability requirements. They all have hundreds of custom SKUs that they don’t publicly list, you just have to ask. A hidden factor is that the software was designed to take advantage of hardware that most enterprises would not deign to use for high-performance applications. There was a bit of supply chain management but we did this as a startup buying not that many units. The final core server configuration cost us just under $8k each delivered, and it outperformed every off-the-shelf server for twice the price and essentially wasn’t something you could purchase in the cloud (and still isn’t). These servers were brilliant, bulletproof, and exceptionally performant for our use case. You can model out the economics of this and the zero-crossing shows up at a lower burn rate than I think many people imagine.
We were extremely effective at using storage, and we did not attach it to expensive, overly-powered servers where the CPUs would have been sitting idle anyway. The sweet spot was low-clock high-core CPUs, which are typically at a low-mid price point but optimal performance-per-dollar if you can effectively scale software to the core count. Since the software architecture was thread-per-core, the core count was not a bottleneck. The economics have not shifted much over time.
AWS uses the same pricing model as everyone else in the server leasing game. Roughly speaking, you model your prices to recover your CapEx in 6 months of utilization. Ignoring overhead, doing it ourselves pulled that closer to 1.5-2 months for the same burn. This moves a lot of the cost structure to things like power, space, and bandwidth. We definitely were paying more for space and power than AWS (usually less for bandwidth) but not nearly enough to offset our huge CapEx advantage relative to workload.
All of this can be modeled out in Excel. No one does it anymore but I am from a time when it was common, so I have that skill in my back pocket. It isn’t nearly as much work as it sounds like, much of the details are formulaic. You do need to have good data on how your workload uses hardware resources to know what to build.
And this is one of the big "screts" AWS success: Shifting a lot of resource allocation and power from people with budgeting responsibility to developers who have usually never seen the budget or accounts, don't keep track, and at most retrospectively gets pulled in to explain line items in expenses, and obscuring it (to the point where I know people who've spent 6 figure amounts worth of dev time building analytics to figure out where their cloud spend goes... tooling has gotten better but is still awful)
I believe a whole lot of tech stacks would look very different if developers and architects were more directly involved in budgeting, and bonuses etc. were linked at least in part to financial outcomes affected by their technical choices.
A whole lot of claims to low cloud costs come from people who have never done actual comparisons and who seem to have a pathological fear of hardware, even when for most people you don't need to ever touch a physical box yourself - you can get maybe 2/3's of the savings with managed hosting as well.
You don't get the super-customized server builds, but you do get far more choice than with cloud providers, and you can often make up for the lack of fine-grained control by being able to rent/lease them somewhere where the physical hosting is cheaper (e.g. at a previous employer what finally made us switch to Hetzner for most new capacity was that while we didn't get exactly the hardware we wanted, we got "close enough" coupled with data centre space in their locations in Germany being far below data centre space in London - it didn't make them much cheaper, but it did make them sufficiently cheaper to outweigh the hardware differences with a margin sufficient for us to deploy new stuff there but still keep some of our colo footprint)
I tend some workloads that transform data grids of varying sizes. The grids are anon mmaps so that when mem runs out, they get paged out. This means processing stays mostly in-mem yet won't abort when mem runs tight. The processes that get hit by paging slow to a crawl though. Getting faster SSD means they're still crawling but crawling faster. Doubling SSD throughput would pretty much half the tail latency.
I see. Thanks for explaining!
Pretty much all work loads, work loads that are not affected would be the exception
This is already a thing. AWS instance store volumes are directly attached to the host. I’m pretty sure GCP and Azure also have an equivalent local storage option.
Next thing the other clouds will offer is cheaper bandwidth pricing, right?
I suspect you must be conflating several different storage products. Are you saying https://cloud.google.com/compute/docs/disks/local-ssd devices talk to the host through a network (say, ethernet with some layer on top)? Because the documentation very clearly says otherwise, "This is because Local SSD disks are physically attached to the server that hosts your VM. For this same reason, Local SSD disks can only provide temporary storage." (at least, I'm presuming that by physically attached, they mean it's connected to the PCI bus without a network in between).
I suspect you're thinking of SSD-PD. If "local" SSDs are not actually local and go through a network, I need to have a discussion with my GCS TAM about truth in advertising.
I don’t really agree with assuming the form of physical attachment and interaction unless it is spelled out.
If that’s what’s meant it will be stated in some fine print, if it’s not stated anywhere then there is no guarantee what the term means, except I would guess they may want people to infer things that may not necessarily be true.
"Physically attached" has had a fairly well defined meaning and i don't normally expect a cloud provider to play word salad to convince me a network drive is locally attached (like I said, if true, I would need to have a chat with my TAM about it).
Physically attached for servers, for the past 20+ years, has meant a direct electrical connection to a host bus (such as the PCI bus attached to the front-side bus). I'd like to see some alternative examples that violate that convention.
Ethernet cables are physical...
If that’s the game we’re going to play then technically my driveway is on the same road as the White House.
exactly. it's not about what's good for the consumer, it's about what they can do without losing a lawsuit for false advertising.
The NIC is attached to the host bus through the north bridge. But other hosts on the same ethernetwork are not considered to be "local". We dont need to get crazy about teh semantics to know that when a cloud provider says an SSD is locally attached, that it's closer than an ethernetwork away.
Believe it or not, superglue and a wifi module! /s
Local SSD is part of the machine.
For AWS there are EBS volumes attached through a custom hardware NVMe interface and then there's Instance Store which is actually local SSD storage. These are different things.
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/Instance...
EBS is also slower than local NVMe mounts on i3's.
Also, both features use Nitro SSD cards, according to AWS docs. The Nitro architecture is all locally attached -- instance storage to the instance, EBS to the EBS server.
What makes you think that?
I can attest to the fact that on EC2, "instance store" volumes are actually physically attached.
Do you have a link to explain this? I dont think its true.
This is incorrect.
Amazon offers both locally-attached storage devices as well as instance-attached storage devices. The article is about the latter kind.
Nope! Well not as advertised. There are instances, usually more expensive ones, where there are supposed to be local NVME disks dedicated to the instance. You're totally right that providing good I/O is a big problem! And I have done studies myself showing just how bad Google Cloud is here, and have totally ditched Google Cloud for providing crappy compute service (and even worse customer service).
instances can have block storage which is network attached, or local attached ssd/nvme. its 2 separate things.
At first you'd think maybe they can do a volume copy from a snapshot to a local drive on instance creation but even at 100gbps you're looking at almost 3 minutes for a 2TB drive.
Could that have to do with every operation requiring a round trip, rather than being able to queue up operations in a buffer to saturate throughput?
It seems plausible if the interface protocol was built for a device it assumed was physically local and so waited for confirmation after each operation before performing the next.
In this case it's not so much the throughput rate that matters, but the latency -- which can also be heavily affected by buffering of other network traffic.
Underlying protocol limitations wouldn't be an issue - the cloud provider's implementation can work around that. They're unlikely to be sending sequential SCSI/NVMe commands over the wire - instead, the hypervisor pretends to be the NVME device, but then converts to some internal protocol (that's less chatty and can coalesce requests without waiting on individual ACKs) before sending that to the storage server.
The problem is that ultimately your application often requires the outcome of a given IO operation to decide which operation to perform next - let's say when it comes to a database, it should first read the index (and wait for that to complete) before it knows the on-disk location of the actual row data which it needs to be able to issue the next IO operation.
In this case, there's no other solution than to move that application closer to the data itself. Instead of the networked storage node being a dumb blob storage returning bytes, the networked "storage" node is your database itself, returning query results. I believe that's what RDS Aurora does for example, every storage node can itself understand query predicates.
I've run CI/CD pipelines on EC2 machines with local storage, typically running Raid-0, btrfs, noaccestime. I didn't care if the filesystem got corrupt or whatever, I had a script that would rebuild it in under 30mins. In addition to the performance you're not paying by IOPs.
Why do they fundamentally need to be network attached storage instead of local to the VM?
They don't. Some cloud providers (i.e. Hetzner) let you rent VMs with locally attached NVMe, which is dramatically faster than network-attached even factoring in the VM tax.
Of course then you have a single point of failure, in the PCIe fabric of the machine you're running on if not the NVMe itself. But if you have good backups, which you should, then the juice really isn't worth the squeeze for NAS storage.
A network adds more points of failure. It does not reduce them.
A network attached, replicated storage hedges against data loss but increases latency; however most customers usually prefer higher latency to data loss. As an example, see the highly upvoted fly.io thread[1] with customers complaining about the same thing.
[1] https://news.ycombinator.com/item?id=36808296
Locally-attached, replicated storage also hedges against data loss.
RAID rebuild times make it an unviable option and customers typically expect problematic VMs to be live-migrated to other hosts with the disks still having their intended data.
The self hosted version of this is GlusterFS and Ceph, which have the same dynamics as EBS and its equivalents in other cloud providers.
With NVMe SSDs? What makes RAID unviable in that environment?
This depends, like all things.
When you say RAID, what level? Software-raid or hardware raid? What controller?
Let's take best-case:
RAID10, small enough (but many) NVMe drives and an LVM/Software RAID like ZFS, which is data aware so only rebuilds actual data: rebuilds will degrade performance enough potentially that your application can become unavailable if your IOPS are 70%+ of maximum.
That's an ideal scenario, if you use hardware raid which is not data-aware then your rebuild times depend entirely on the size of the drive being rebuilt and it can punish IOPs even more during the rebuild. But it will affect your CPU less.
There's no panacea. Most people opt for higher latency distributed storage where the RAID is spread across an enormous amount of drives, which makes rebuilds much less painful.
What I used to do was swap machines over from the one with failing disks to a live spare (slave in the old frowned upon terminology), do the maintenance and then replicate from the now live spare back if I had confidence it was all good.
Yes it’s costly having the hardware to do that as it mostly meant multiple machines as I always wanted to be able to rebuild one whilst having at least two machines online.
If you are doing this with your own hardware it is still less costly than cloud even if it mostly sits idle.
Cloud is approx 5x sticker cost for compute if its sustained.
Your discounts may vary, rue the day those discounts are taken away because we are all sufficiently locked in.
A network adds more points of failures but also reduces user-facing failures overall when properly architected.
If one CPU attached to storage dies, another can take over and reattach -- or vice-versa. If one network link dies, it can be rerouted around.
Using a SAN (which is what networked storage is, after all) also lets you get various "tricks" such as snapshots, instant migration, etc for "free".
Because even if you can squeeze 100TB or more of SSD/NVMe in a server, and there are 10 tenants using the machine, you're limited to 10TB as a hard ceiling.
What happens when one tenant needs 200TB attached to a server?
Cloud providers are starting to offer local SSD/NVMe, but you're renting the entire machine, and you're still limited to exactly what's installed in that server.
How is that different from how cores, mem and network bandwidth is allotted to tenants?
Because a fair number of customers spin up another image when cores/mem/bandwidth run low. Dedicated storage breaks that paradigm.
Also, adding, if I am on an 8 core machine and need 16, network storage can be detached from host A and connected to host B. In dedicated storage it must be fully copied over first.
It isn't. You could ask for network-attached CPUs or RAM. You'd be the only one, though, so in practice only network-attached storage makes sense business-wise. It also makes sense if you need to provision larger-than-usual amounts like tens of TB - these are usually hard to come by in a single server, but quite mundane for storage appliances.
Given AWS and GCP offer multiple sizes for the same processor version with local SSDs, I don't think you have to rent the entire machine.
Search for i3en API names and you'll see:
i3en.large, 2x CPU, 1250GB SSD
i3en.xlarge, 4x CPU, 2500GB SSD
i3en.2xlarge, 8x CPU, 2x2500GB SSD
i3en.3xlarge, 12x CPU, 7500GB SSD
i3en.6xlarge, 24x CPU, 2x7500GB SSD
i3en.12xlarge, 48x CPU, 4x7500GB SSD
i3en.24xlarge, 96x CPU, 8x7500GB SSD
i3en.metal, 96x CPU, 8x7500GB SSD
So they've got servers with 96 CPUs and 8x7500GB SSDs. You can get a slice of one, or you can get the whole one. All of these are the ratio of 625GB of local SSD per CPU core.
https://instances.vantage.sh/
On GCP you can get a 2-core N2 instance type and attach multiple local SSDs. I doubt they have many physical 2-core Xeons in their datacenters.
Link to this mythical hosting service that expects far less than 200TB of data per client but just pulls a sad face and takes the extra cost on board when a client demands it. :D
Redundancy, local storage is a single point of failure.
You can use local SSD’s as slow RAM, but anything on it can go away at any moment.
I've seen SANs get nuked by operator error or by environmental issues (overheated DC == SAN shuts itself down).
Distributed clusters of things can work just fine on ephemeral local storage (aka local storage). A kafka cluster or an opensearch cluster will be fine using instance local storage, for instance.
As with everything else.... "it depends"
Sure distributed clusters get back to network/workload limitations.
These days it's likely that your SAN is actually just a cluster of commodity hardware where the disks/SSDs have custom firmware and some fancy block shoveling software.
Reliability. SSDs break and screw up a lot more frequently and more quickly than CPUs. Amazon has published a lot on the architecture of EBS, and they go through a good analysis of this. If you have a broken disk and you locally attach, you have a broken machine.
RAID helps you locally, but fundamentally relies on locality and low latency (and maybe custom hardware) to minimize the time window where you get true data corruption on a bad disk. That is insufficient for cloud storage.
Sure, but there's plenty of software that's written to use distributed unreliable storage similar to how cloud providers write their own software (e.g. Kafka). I can understand if many applications are just need something like EBS that's durable but looks like a normal disk, but not so sure it's a fundamentally required abstraction.
The major clouds do offer VMs with fast local storage, such as SSDs connected by NVMe connections directly to the VM host machine:
- https://cloud.google.com/compute/docs/disks/local-ssd
- https://learn.microsoft.com/en-us/azure/virtual-machines/ena...
- https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ssd-inst...
They sell these VMs at a higher cost because it requires more expensive components and is limited to host machines with certain configurations. In our experience, it's also harder to request quota increases to get more of these VMs -- some of the public clouds have a limited supply of these specific types of configurations in some regions/zones.
As others have noted, instance storage isn't as dependable. But it can be the most performant way to do IO-intense processing or to power one node of a distributed database.
So much of this. The amount of times I've seen someone complain about slow DB performance when they're trying to connect to it from a different VPC, and bottlenecking themselves to 100Mbits is stupidly high.
Literally depending on where things are in a data center... If you're looking for closely coupled and on a 10G line on the same switch, going to the same server rack. I bet you performance will be so much more consistent.
I thought cloud was supposed to abstract this away? That's a bit of a sarcastic question from a long-time cloud skeptic, but... wasn't it?
Reality always beats the abstraction. After all, it's just somebody else's computer in somebody else's data center.
Which can cause considerable "amusement" depending on the provider - one I won't name directly but is much more centered on actual renting racks than their (now) cloud offering - if you had a virtual machine older than a year or so, deleting and restoring it would get you on a newer "host" and you'd be faster for the same cost.
Otherwise it'd stay on the same physical piece of hardware it was allocated to when new.
Amusing is a good description.
"Hardware degradation detected, please turn it off and back on again"
I could do a migration with zero downtime in VMware for a decade but they can't seamlessly move my VM to a machine that works in 2024? Great, thanks. Amusing.
Cloud providers have live migration now but I guess they don't want to guarantee anything.
It's better (and better still with other providers) but I naively thought that "add more RAM" or "add more disk" was something they would be able to do with a reboot at most.
Nope, some require a full backup and restore.
Resizing VMs doesn't really fit the "cattle" thinking of public cloud, although IMO that was kind of a premature optimization. This would be a perfect use case for live migration.
I have always been incredibly saddened that apparently the cloud providers usually have nothing as advanced as old VMware was.
Cloud makes provisioning more servers quicker because you are paying someone to basically have a bunch of servers ready to go right away with an API call instead of a phone call, maintained by a team that isn’t yours, with economies of scale working for the provider.
Cloud does not do anything else.
None of these latency/speed problems are cloud-specific. If you have on-premise servers and you are storing your data on network-attached storage, you have the exact same problems (and also the same advantages).
Unfortunately the gap between local and network storage is wide. You win some, you lose some.
Oh, I'm not a complete neophyte (in what seems like a different life now, I worked for a big hosting provider actually), I was just surprised that there was a big penalty for cross-VPC traffic implied by the parent poster.
It's more of a matter of adding additional abstraction layers. For example in most public clouds the best you can hope for is to place two things in the same availability zone to get the best performance. But when I worked at Google, internally they had more sophisticated colocation constraint than that: for example you can require two things to be on the same rack.
Aren’t 10G and 100G connections standard nowadays in data centers? Heck, I thought they were standard 10 years ago.
Bandwidth delay product does not help serialized transactions. If you're reaching out to disk for results, or if you have locking transactions on a table the achievable operations drops dramatically as latency between the host and the disk increases.
The typical way to trade bandwidth away for latency would, I guess, be speculative requests. In the CPU world at least. I wonder if any cloud providers have some sort of framework built around speculative disk reads (or maybe it is a totally crazy trade to make in this context)?
Often times it’s the app (or something high level) that would need speculative requests, which it may not be possible in the given domain.
I don’t think it’s possible in most domains.
I mean we already have readahead in the kernel.
This said the problem can get more complex than this really fast. Write barriers for example and dirty caches. Any application that forces writes and the writes are enforced by the kernel are going to suffer.
The same is true for SSD settings. There are a number of tweakable values on SSDs when it comes to write commit and cache usage which can affect performance. Desktop OS's tend to play more fast and loose with these settings and servers defaults tend to be more conservative.
You'd need the whole stack to understand your data format in order to make speculative requests useful. It wouldn't surprise me if cloud providers indeed do speculative reads but there isn't much they can do to understand your data format, so chances are they're just reading a few extra blocks beyond where your OS read and are hoping that the next OS-initiated read will fall there so it can be serviced using this prefetched data. Because of full-disk-encryption, the storage stack may not be privy to the actual data so it couldn't make smarter, data-aware decisions even if it wanted to, limiting it to primitive readahead or maybe statistics based on previously-seen patterns (if it sees that a request for block X is often followed by block Y, it may choose to prefetch that next time it sees block X accessed).
A problem in applications such as databases is when the outcome of an IO operation is required to initiate the next one - for example, you must first read an index to know the on-disk location of the actual row data. This is where the higher latency absolutely tanks performance.
A solution could be to make the storage drives smarter - have an NVME command that could say like "search in between this range for this byte pattern" and one that can say "use the outcome of the previous command as a the start address and read N bytes from there". This could help speed up the aforementioned scenario (effectively your drive will do the index scan & row retrieval for you), but would require cooperation between the application, the filesystem and the encryption system (typical, current FDE would break this).
400G is fairly normal thing in DCs nowadays
Datacenters are up to 400 Gbps and beyond (many places are adopting 1+ Tbps on core switching).
However, individual servers may still operate at 10, 25, or 40 Gbps to save cost on the thousands of NICs in a row of racks. Alternatively, servers with multiple 100G connections split that bandwidth allocation up among dozens of VMs so each one gets 1 or 10G.
Yes, but you have to think about contention. Whilst the Top of rack might have 2x400 gig links to the core, thats shared with the entire rack, and all the other machines trying to shout at the core switching infra.
Then stuff goes away, or route congested, etc, etc, etc.
Certainly true that SSD bandwidth and latency improvements are hard to match, but I don't understand why intra-datacenter network latency in particular is so bad. This ~2020-I-think version of the "Latency Numbers Everyone Should Know" says 0.5 ms round trip (and mentions "10 Gbps network" on another line). [1] It was the same thing in a 2012 version (that only mentions "1 Gbps network"). [2] Why no improvement? I think that 2020 version might have been a bit conservative on this line, and nice datacenters may even have multiple 100 Gbit/sec NICs per machine in 2024, but still I think the round trip actually is strangely bad.
I've seen experimental networking stuff (e.g. RDMA) that claims significantly better latency, so I don't think it's a physical limitation of the networking gear but rather something at the machine/OS interaction area. I would design large distributed systems significantly differently (be much more excited about extra tiers in my stack) if the standard RPC system offered say 10 µs typical round trip latency.
[1] https://static.googleusercontent.com/media/sre.google/en//st...
[2] https://gist.github.com/jboner/2841832
That document is probably deliberately on the pessimistic side to encourage your code to be portable across all kinds of "data centers" (however that is defined). When I previously worked at Google, the standard RPC system definitely offered 50 microseconds of round trip latency at the median (I measured it myself in a real application), and their advanced user-space implementation called Snap could offer about 10 microseconds of round trip latency. The latter figure comes from page 9 of https://storage.googleapis.com/gweb-research2023-media/pubto...
Google exceeded 100Gbps per machine long before 2024. IIRC it had been 400Gbps for a while.
Interesting. I worked at Google until January 2021. I see 2019 dates on that PDF, but I wasn't aware of snap when I left. There was some alternate RPC approach (Pony Express, maybe? I get the names mixed up) that claimed 10 µs or so but was advertised as experimental (iirc had some bad failure modes at the time in practice) and was simply unavailable in many of the datacenters I needed to deploy in. Maybe they're two names for the same thing. [edit: oh, yes, starting to actually read the paper now, and: "Through Snap, we created a new communication stack called Pony Express that implements a custom reliable transport and communications API."]
Actual latency with standard Stubby-over-TCP and warmed channels...it's been a while, so I don't remember the number I observed, but I remember it wasn't that much better than 0.5 ms. It was still bad enough that I didn't want to add a tier that would have helped with isolation in a particularly high-reliability system.
Snap was the external name for the internal project known as User Space Packet Service (abbreviated USPS) so naturally they renamed it prior to publication. I deployed an app using Pony Express in 2023 and it was available in the majority of cells worldwide. Pony Express supported more than just RPC though. The alternate RPC approach that you spoke of was called Void. It had been experimental for a long time and indeed it wasn't well known even inside Google.
If you and I still worked at Google I'd just give you an automon dashboard link showing latency an order of magnitude better than that to prove myself…
Interesting, thanks!
I believe you, and I think in principle we should all be getting the 50 µs latency you're describing within a datacenter with no special effort.
...but it doesn't match what I observed, and I'm not sure why. Maybe difference of a couple years. Maybe I was checking somewhere with older equipment, or some important config difference in our tests. And obviously my memory's a bit fuzzy by now but I know I didn't like the result I got.
50 microseconds is a lot. I'm looking at disk read latency on a bunch of baremetal servers (nothing fancy, just node_disk_read.* metrics from node-exporter) and one of the slowest fleets has a median disk read latency barely above 1 microsecond. (And that's with rather slow HDDs.)
Samsung 980 Pro SSDs, the early generation ones (they seem to later have replaced them with a different, likely worse architecture), have average latency 4k reads of 30~70~120 microseconds for single queue/70% of the IOPS the 120us gets you/maximum parallelism before latency goes through the roof.
The metrics you mention have to be pagecache hits. Basically all MLC NAND is in the double digit microseconds for uncontended random reads.
They likely are cache hits, indeed (any suggestion what other metrics would be more comparable?). Still, at the end of the day I don't care whether a disk operation was made fast by kernel caching or by some other optimization at a lower level, I only care about the final result. With public cloud virtualization there are more layers where something may go wrong, and good luck finding answers from Amazon or Microsoft if your performance turns out to be abysmal.
with such speed and CXL gaining traction (think ram and GPUs over network) why network SSD is still issue? you could have like one storage server per rack that would serve storage only for that particular rack
you could easily have like 40GB/s with some over provisioning / bucketing
Networks are not reliable, despite what you hear, so latency is used to mask re-tries and delays.
The other thing to note about big inter-DC links are heavily QoS'd and contented, because they are both expensive and a bollock to maintain.
Also, from what I recall, 40gig links are just parallel 10 gig links, so have no lower latency. I'm not sure if 100/400 gigs are ten/fourty lines of ten gigs in parallel or actually able to issue packets at 10/40 times a ten gig link. I've been away from networking too long
Of course, but even the 50%ile case is strangely slow, and if that involves retries something is deeply wrong.
You're right, but TCP doesn't like packets being dropped halfway through a stream. If you have a highly QoS'd link then you'll see latency spikes.
Again, I'm not talking about spikes (though better tail latency is always desirable) but poor latency in the 50%ile case. And for high-QoS applications, not batch stuff. The snap paper linked elsewhere in the thread shows 10 µs latencies; they've put in some optimization to achieve that, but I don't really understand why we don't expect close to that with standard kernel networking and TCP.
You can get similar results by looking at comparisons between DPDK and kernel networking. Most of the usual gap comes from not needing to context-switch for kernel interrupt handling, zero-copy abstractions, and busy polling (wherein you trade CPU for lower latency instead of sleeping between iterations if there's no work to be done).
https://talawah.io/blog/linux-kernel-vs-dpdk-http-performanc... goes into some amount of detail comparing request throughput of an unoptimized kernel networking stack, optimized kernel networking stack, and DPDK. I'm not aware of any benchmarks (public or private) comparing Snap vs DPDK vs Linux, so that's probably as close as you'll get.
Thanks for the link. How does this compare to the analogous situation for SSD access? I know there are also userspace IO stacks for similar reasons, but it seems like SSD-via-kernel is way ahead of network-via-kernel in the sense that it adds less latency per operation over the best userspace stack.
Great reading, thanks for the link on vanilla vs dpdk
40gig links are just parallel 10 gig links, so have no lower latency
That's not correct. Higher link speeds do have lower serialization latency, although that's a small fraction of overall network latency.
Modern data center networks don't have full cross connectivity. Instead they are built using graphs and hierarchies that provide less than the total bandwidth required for all pairs of hosts to be communicating. This means, as workloads start to grow and large numbers of compute hosts demand data IO to/from storage hosts, the network eventually gets congested, which typically exhibits as higher latencies and more dropped packets. Batch jobs are often relegated to "spare" bandwidth while serving jobs often get dedicated bandwidth
At the same time, ethernetworks with layered network protocols on top typically have a fair amount of latency overhead, that makes it much slower than bus-based direct-host-attached storage. I was definitely impressed at how quickly SSDs reached and then exceeded SATA bandwidth. nvme has made a HUGE difference here.
How much faster would the network need to get, in order to meet (or at least approach) the speed of a local SSD? are we talking about needing to 2x or 3x the speed, or by factors of hundreds or thousands?
The problem isn't necessarily speed, it's random access latency. What makes SSDs fast and "magical" is their low random-access latency compared to a spinning disk. The sequential-access read speed is merely a bonus.
Networked storage negates that significantly, absolutely killing performance for certain applications. You could have a 100Gbps network and it still won't match a direct-attached SSD in terms of latency (it can only match it in terms of sequential access throughput).
For many applications such as databases, random access is crucial, thus why nowadays' mid-range consumer hardware often outperforms hosted databases such as RDS unless they're so overprovisioned on RAM that the dataset is effectively always in there.
Um... why the hell does the network care whether I am doing random or sequential access? Your left that part out of your argument.
Ah sorry, my bad. You are correct that you can fire off many random access operations in parallel and get good throughput that way.
The problem is that this is not possible when the next IO request depends on the result of a previous one, like in a database where you must first read the index to know the location of the row data itself.
OK thanks yes that makes sense. Pipelining problems are real.
(The network indeed doesn't care, but bandwidth of dependent rather than independent accesses depends on latency)
100Gbps direct shouldn't be too bad, but it might be difficult to get anyone to sell it to you for exclusive usage in a vm...
The Samsung 990 in my desktop provides ~3.5 GB/s streaming reads, ~2 GB/s 4k random-access reads, all at a latency measured at around 20-30 microseconds. My exact numbers might be a little off, but that's the ballpark you're looking at, and a 990 is a relatively cheap device.
10GbE is about the best you can hope for from a local network these days, but that's 1/5th the bandwidth and many times the latency. 100GbE would work, except the latency would still mean any read dependencies would be far slower than local storage, and I'm not sure there's much to be done about that; at these speeds the physical distance matters.
In practice I'm having to architecture the entire system around the SSD just to not bottleneck it. So far ext4 is the only filesystem that even gets close to the SSD's limits, which is a bit of a pity.
Networking doesn't have to have high latency. You can buy network hardware that is able to provide sub-microsecond latency. Physical distance still matters, but 10% of typical NVMe latency gets you through a kilometer of fiber.
Around 4x-10x depending on how many SSDs you want. A single SSD is around the speed of a 100 Gbps Ethernet link.
SATA3 is 6 Gbit, so each VM on a machine multiplied by 6 Gbit. For NVMe, probably closer to 4-5x that. You’d need some serious interconnects to get a server rack access to un-bottlenecked SSD storage.
Dumb question. Why does the network have to be slow? If the SSDs are two feet away from the motherboard and there's an optical connection to it, shouldn't it be fast? Are data centers putting SSDs super far away from motherboards?
This is the theory that I would bet on because it lines up with their bottom line.
But the sentence right after undermines it.
Faster read speeds would give them a more enticing product without wearing drives out.
They may be limiting the read artificially to increase your resource utilization else where. If you have disk bottleneck then you would be more likely to use more instances. It is still about the bottom line.
That could be. But it's a completely different reason. If you summarize everything as "bottom line", you lose all the valuable information.
It’s not the network being slow, but dividing the available network bandwidth amongst all users, while also distributing the written data to multiple nodes reliably so that one tenant doesn’t hog resources is quite challenging. The pricing structure is meant to control resource usage; a discussion of the exact prices and how much profit AWS or any other cloud provider makes is a separate discussion.
What happens when your vm is live migrated 1000 feet away or to a different zone?
This is untrue of Local SSD (https://cloud.google.com/local-ssd) in Google Cloud. Local SSDs are PCIe peripherals, not network attached.
There are also multiple Persistent Disk (https://cloud.google.com/persistent-disk) offerings that are backed by SSDs over the network.
(I'm an engineer on GCE. I work directly on the physical hardware that backs our virtualization platform.)
It's notable that your second link has a screenshot for 24(!) NVMe SSDs totalling 9 terabytes, but the aggregate performance is 2.4M IOPS and 9.3 GB/s for reads. In other words, just 100K/400MB per individual SSD, which is very low these days.
For comparison, a single 1 TB consumer SSD can deliver comparable numbers (lower IOPS but higher throughput).
If I plugged 24 consumer SSDs into a box, I would expect over 30M IOPS and near the memory bus limit for throughput (>50 GB/s).
There's a quadrant of the market which is poorly served by the Cloud model of elastic compute: persistent local SSDs across shutdown and restart.
Elastic compute means you want to be able to treat compute hardware as fungible. Persistent local storage makes that a lot harder because the Cloud provider wants to hand out that compute to someone else after shutdown, so the local storage needs to be wiped.
So you either get ephemeral local SSDs (and have to handle rebuild on restart yourself) or network-attached SSDs with much higher reliability and persistence, but a fraction of the performance.
Active instances can be migrated, of course, with sufficient cleverness in the I/O stack.
GCE fares a little better than this:
- VMs with SSDs can (in general -- there are exceptions for things like GPUs and exceptionally large instances) live migrate with contents preserved.
- GCE supports a timeboxed "restart in place" feature where the VM stays in limbo ("REPAIRING") for some amount of time waiting for the host to return to service: https://cloud.google.com/compute/docs/instances/host-mainten.... This mostly only applies to transient failures like power-loss beyond battery/generator sustaining thresholds, software crashes, etc.
- There is a related feature, also controlled by the `--discard-local-ssd=` flag, which allows preservation of local SSD data on a customer initiated VM stop.
I should've aimed for more clarity in my original comment -- the first link is to locally attached storage. The second is network attached storage (what the GP was likely referring to, but not what is described in the article).
Persistent Disk is not backed by single devices (even for a single NVMe attachment), but by multiple redundant copies spread across power and network failure domains. Those volumes will survive the failure of the VM to which they are attached as well as the failure of any individual volume or host.
Makes me wonder if we're on the crux of a shift back to client-based software. Historically changes in the relative cost of computing components have driven most of the shifts in the computing industry. Cheap teletypes & peripherals fueled the shift from batch-processing mainframes to timesharing minicomputers. Cheap CPUs & RAM fueled the shift from minicomputers to microcomputers. Cheap and fast networking fueled the shift from desktop software to the cloud. Will cheap SSDs & TPU/GPUs fuel a shift back toward thicker clients?
There are a bunch of supporting social trends toward this as well. Renewed emphasis on privacy. Big Tech canceling beloved products, bricking devices, and generally enshittifying everything - a lot of people want locally-controlled software that isn't going to get worse at the next update. Ever-rising prices which make people want to lock in a price for the device and not deal with increasing rents for computing power.
I think a major limiting factor here for many applications is that mobile users are a huge portion of the user base. In that space storage, and more importantly battery life, are still at a premium. Granted the storage cost just seems to be gouging from my layman’s point of view, so industry needs might force a shift upwards.
Mobile devices are the desktop computers of the 2010s though. They are mostly used with very thick clients.
Not on AWS. Instance stores (what the article is about) are physical local disks.
Same for Google (the GP is incorrect about how GCE Local SSD works).
*As implemented in the public cloud providers.
You can absolutely get better than local disk speeds from SAN devices and we've been doing it for decades. To do it on-prem with flash devices will require NVMe over FC or Ethernet and an appropriate storage array. Modern all-flash array performance is measured in millions of IOPS.
Will there be a slight uptick in latency? Sure, but it's well worth it for the data services and capacity of an external array for nearly every workload.
A quick comparison of marketing slides lands you with 0.5 MOPS for FC-NVMe for Qlogic 2770 and 0.7 MOPS for PCIe Micron 9100 PRO, so the better speeds are not quite there, although spending quite a lot on server gear lands you near the performance of a workstation-grade drive from 2017.
Which is still not bad, when I was shopping around in 2018 no money could buy performance comparable to a locally-attached NVMe in a more professional/datacenter-ready form.
[1] https://media-www.micron.com/-/media/client/global/documents...
[2] https://www.marvell.com/content/dam/marvell/en/public-collat...
If the local drives are network drives(eg: SAN) then why are they ephemeral?
live vm migrations, perhaps
No they don't. I work for a cloud provider and I can guarantee that your SSD is local to your VM.
This is also true for GCE Local SSD: https://cloud.google.com/local-ssd
The GP is incorrect.
No, the i3/i4 VMs discussed in the blog have local SSDs. The network isn't the reason local SSDs are slow.
In other words, they're saving money. This is a fundamental problem with cloud providers. Value created from technological innovation is captured by the cloud provider and bare minimum is shared to reduce prices. The margins are ridiculous.
It does not fundamentally have to be. That's an architectural choice driven by cloud providers backing off from the "instances can die on you" choice that AWS started with and then realized customers struggled with, towards attempting to keep things running in the face of many types of hardware failures.
Which is ironic given that when building on-prem/colo'ed setups you'll replicate things to be prepared for unknown lengths of downtime while equipment is repaired or replaced, so this was largely cloud marketing coming bak to bite cloud providers' collective asses. Not wanting instances to die "randomly" for no good reason does not always mean wanting performance sacrifices for the sake of more resilient instances.
But AWS at least still offers plenty of instances with instance storage.
If I'm setting up my own database cluster, while I don't want it running on cheap consumer-grade hardware without dual power supplies and RAID, I also don't want to sacrifice SSD speed for something network-attached to survive a catastrophic failure when I'm going to have both backups, archived shipped logs and at least one replica anyway.
I'd happily pick network-attached storage for many things if it gets me increased resilience, but selling me a network-attached SSD, unless it replicates local SSD performance characteristics, is not competitive for applications where performance matters and I'm set up to easily handle system-wide failures anyway.
SANs can still be quite fast, and instance storage is fast, both of which are available in cloud providers
Forgive if stupid, but when netflix was doing all their edge content boxes, where they were putting machines much more latency-close to customers... is/does/can this model kinda work for SSDs in a SScDN type of network (client---->CDN->SScDN-------<>SSD?
Can you expound?
Once again getting a sensible chuckle on hn listening to the cloud crowd whine and complain about what we used to call "the SAN being slow" with the architects argument "the ssds will make things faster"
So do cloud vendors simply not use fast SSDs? If so I would expect the SSD manufacturers themselves to work on this problem. Perhaps they already are.
This is an interesting benchmark that compares disk read latency across different clouds across various disk configurations, including SSD's and EBS: https://github.com/scylladb/diskplorer
Google's disk perform quite poorly.
And how Discord worked around it: https://discord.com/blog/how-discord-supercharges-network-di...
I can see network attached SSDs having poor latency, but shouldn’t the networking numbers quoted in the article allow for higher throughput than observed?
Yeah this was my impression.
I am but an end user, but I noticed that disk IO for a certain app was glacial compared to a local test deployment, and I chalked it up to networking/VM overhead
I'm not sure which external or internal product you're talking about, but there are no networks involved for Local SSD on GCE: https://cloud.google.com/compute/docs/disks/local-ssd
Are you referring to PD-SSD? Internal storage usage?
Kind of funny but we use similar idea in Azure.
They don’t have to be. Architecturally there are many benefits to storage area networks but I have built plenty of systems which are self-contained, download a dataset to a cloud instance with a direct attached SSD, load it into a database and provide a different way.
Even assuming that "local" storage is a lie, hasn't the network gotten a lot faster? The author is only asking for a 5x increase at the end of the post.