return to table of content

S3 is files, but not a filesystem

leetrout
61 replies
12h37m

My big pet peeve is AWS adding buttons in the UI to make "folders".

It is also a fiction! There are no folders in S3.

When you create a folder in Amazon S3, S3 creates a 0-byte object with a key that's set to the folder name that you provided. For example, if you create a folder named photos in your bucket, the Amazon S3 console creates a 0-byte object with the key photos/. The console creates this object to support the idea of folders.

https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-...

riehwvfbk
37 replies
12h28m

Is that really so different from how folders work on other systems? A directory inode is just an inode.

klodolph
30 replies
12h14m

Yes. It is, in practice, incredibly different.

Imagine you have a file named /some/dir/file.jpg.

In a filesystem, there’s an inode for /some. It contains an entry for /some/dir, which is also an inode, and then in the very deepest level, there is an inode for /some/dir/file.jpg. You can rename /some to /something_else if you want. Think of it kind of like a table:

  +-------+--------+----------+-------+
  | inode | parent |     name |  data |
  +-------+--------+----------+-------+
  |     1 | (null) |     some | (dir) |
  |     2 |      1 |      dir | (dir) |
  |     3 |      2 | file.jpg |  jpeg |
  +-------+--------+----------+-------+
In S3 (and other object stores), the table is like this:

  +-------------------+------+
  | key               | data |
  +-------------------+------+
  | some/dir/file.jpg | jpeg |
  +-------------------+------+
The kind of queries you can do is completely different. There are no inodes in S3. There is just a mapping from keys to objects. There’s an index on these keys, so you can do queries—but the / character is NOT SPECIAL and does not actually have any significance to the S3 storage system and API. The / character only has significance in the UI.

You can, if you want, use a completely different character to separate “components” in S3, rather than using /, because / is not special. If you want something like “some:dir:file.jpg” or “some.dir.file.jpg” you can do that. Again, because / is not special.

riehwvfbk
10 replies
12h1m

Thank you, now I understand what the special 0-byte object refers to. It represents an empty folder.

Fair enough, basing folders on object names split by / is pretty inefficient. I wonder why they didn't go with a solution like git's trees.

inkyoto
5 replies
10h55m

[…] what the special 0-byte object refers to. It represents an empty folder.

Alas, no. It represents a tag, e.g. «folder/», that points to a zero byte object.

You can then upload two files, e.g. «folder/file1.txt» and «folder/file2.txt», delete the «folder/», being a tag, and still have the «folder/file1.txt» and «folder/file2.txt» file intact in the S3 bucket.

Deleting «folder/» in a traditional file system, on the other hand, will also delete «file1.txt» and «file2.txt» in it.

dchest
3 replies
9h49m

It's a matter of a client UI implementation. You can't delete a non-empty folder with POSIX API on common filesystems or FTP too.

However, there are file managers, FTP clients, and S3 clients that will do that for you by deleting individual files.

_flux
2 replies
7h51m

But if the S3 semantics are not helping you, e.g. with multiple clients doing copy/move/delete operations in the hierarchy you could still end up with files that are not in "directories".

So essentially an S3 file manager must be able to handle the situation where there are files without a "directory"—and that I assume is also the most common case as well for S3. Might just not have the "directories" in the first place.

klodolph
1 replies
4h30m

I have personally never seen the 0-byte files people keep talking about here. In every S3 bucket I’ve ever looked at, the “directories” don’t exist at all. If you have a dir/file1.txt and dir/file2.txt, there is NO such object as dir. Not even a placeholder.

_flux
0 replies
4h9m

Yeah, this post was the first one I had even heard of them.

cwillu
0 replies
8h5m

Deleting folder/ in a traditional file system will _fail_ if the folder is not empty. Userspace needs to recurse over the directory structure to unlink everything in it before unlinking the actual folder.

klodolph
2 replies
11h55m

Fair enough, basing folders on object names split by / is pretty inefficient. I wonder why they didn't go with a solution like git's trees.

What, exactly, is inefficient about it?

Think for a moment about the data structures you would use to represent a directory structure in a filesystem, and the data structures you would use to represent a key/value store.

With a filesystem, if you split a string /some/dir/file.jpg into three parts, “some”, “dir”, “file.jpg”, then you are actually making a decision about the tree structure. And here’s a question—is that a balanced tree you got there? Maybe it’s completely unbalanced! That’s actually inefficient.

Let’s suppose, instead, you treat the key as a plain string and stick it in a tree. You have a lot of freedom now, in how you balance the tree, since you are not forced to stick nodes in the tree at every / character.

It’s just a different efficiency tradeoff. Certain operations are now much less efficient (like “rename a directory” which, on S3, is actually “copy a zillion objects). Some operations are more efficient, like “store a file” or “retrieve a file”.

umanwizard
0 replies
9h40m

I think what you’re describing is simply not a hierarchical file system. It’s a different thing that supports different operations and, indeed, is better or worse at different operations.

afiori
0 replies
8h53m

I think it is fair to say that S3 (as named files) is not a filesystem and it is inefficient to use it directly as such for common filesystem use cases; the same way that you could say it for a tarball[0].

This does not make S3 a bad storage, just a bad filesystem, not everything needs to be a filesystem.

Arguably is it good that S3 is not a filesystem, as it can be a leaky abstraction eg in git you cannot have two tags name "v2" and "v2/feature-1" as you cannot have both a file and a folder with the same name.

For something more closely related to URLs than filenames forcing a filesystem abstraction is a limitation as "/some/url", "/some/url/", and "/some/url/some-default-name-decided-by-the-webserver" can be different.[1]

[0] where a different tradeoff is that searching a file by name is slower but reading many small files can be faster.

[1] maybe they should be the same, but enforcing it is a bad idea

gjvc
0 replies
4h46m

"folders" do not exist in S3 -- why do you keep insisting that they do?

They appear to exist because the key is split on the slash character for navigation in the web front-end. This gives the familiar appearance of a filesystem, but the implementation is at a much higher level.

fiddlerwoaroof
9 replies
12h9m

Except, S3 does let you query by prefix and so the keys have more structure than the second diagram implies: they’re not just random keys, the API implies that common prefixes indicate related objects.

klodolph
8 replies
12h6m

That’s kind of stretching the idea of “more structure” to the breaking point, I think. The key is just a string. There is no entry for directories.

the API implies that common prefixes indicate related objects.

That’s something users do. The API doesn’t imply anything is related.

And prefixes can be anything, not just directories. If you have /some/dir/file.jpg, then you can query using /some/dir/ as a prefix (like a directory!) or you can query using /so as a prefix, or /some/dir/fil as a prefix. It’s just a string. It only looks like a directory when you, the user, decide to interpret the / in the file key as a directory separator. You could just as easily use any other character.

hiyer
4 replies
11h58m

One operation where this difference is significant is renaming a "folder". In UNIX (and even UNIX-y distributed filesystems like HDFS) a rename operation at "folder" level is O(1) as it only involves metadata changes. In S3, renaming a "folder" is O(number of files).

pepa65
1 replies
7h40m

From reading the above, if you have a folder 'dir' and a file 'dir/file', after renaming 'dir' to 'folder', you would just have 'folder' and 'dir/file'.

klodolph
0 replies
2h40m

There is really no such thing as a folder in S3.

If you have something which is dir/file, then NORMALLY “dir” does not exist at all. Only dir/file exists. There is nothing to rename.

If you happen to have something which is named “dir”, then it’s just another file (a.k.a. object). In that scenario, you have two files (objects) named “dir” and “dir/file”. Weird, but nothing stopping you from doing that. You can also have another object named “dir///../file” or something, although that can be inconvenient, for various reasons.

okr
0 replies
8h4m

Imho, renaming "folders" on S3 results in copying and deleting O(number of files)

Someone
0 replies
7h1m

In S3, renaming a "folder" is O(number of files).

More like O(max(number of files, total file size)). You can’t rename objects in S3. To simulate a rename, you have to copy an object and then delete the old one.

Unlike renames in typical file systems, that isn’t atomic (there will be a time period in which both the old and the new object exist), and it becomes slower the larger the file.

fiddlerwoaroof
2 replies
11h45m

That’s something users do. The API doesn’t imply anything is related.

Querying ids by prefix doesn’t make any sense for a normal ID type. Just making this operation available and part of your public API indicates that prefixes are semantically relevant to your API’s ID type.

klodolph
0 replies
11h39m

“Prefix” is not the same thing as “directory”.

I can look up names with the prefix “B” and get Bart, Bella, Brooke, Blake, etc. That doesn’t imply that there’s some kind of semantics associated with prefixes. It’s just a feature of your system that you may find useful. The fact that these names have a common prefix, “B”, is not a particularly interesting thing to me. Just like if I had a list of files, 1.jpg, 10.jpg, 100.jpg, it’s probably not significant that they’re being returned sequentially (because I probably want 2.jpg after 1.jpg).

afiori
0 replies
8h48m

by this logic the file "foo/bar/" correspond to the filename "f:o:o:/:b:a:r:/" (using a different caracter as separator)

tuwtuwtuwtuw
3 replies
11h19m

"filesystem" is not a name reserved for Unix-style file systems. There are many types of file system which is not built on according to your description. When I was a kid, I used systems which didn't support directories, but it was still file systems.

It's an incorrect take that a system to manage files must follow a set of patterns like the ones you mentioned to be called "file system".

afiori
2 replies
8h45m

Terms evolve and now filesytem and "system of files" mean different things,

I would argue that not supporting folders or many other file operations make something not a filesystem today.

tuwtuwtuwtuw
0 replies
7h16m

You're free to argue whatever you want, but claiming that a file system should have folders as the parent commenter did, or support specific operations, seems a bit meaningless.

I could create a system not supporting folders because it relies on tags or something else. Or I could create a system which is write-only and doesn't support rename or delete.

These systems would be file systems according to how the term has been used for 40 (?) years at least. Just don't see any point in restricting the term to exclude random variants.

quickthrower2
0 replies
7h29m

Yeah hacker used to not mean someone hacking into a computer and breaking a password, then it did then now it means both that and a tech tinkerer.

Demiurge
3 replies
11h56m

Let’s start with the fact that you’re talking to an HTTP api… Even if S3 had web3.0 inodes, the querying semantics would not make sense. It’s a higher level API, because you don’t deal with blocks of magnetic storage and binary buffers. Of course s3 is not a filesystem, that is part of its definition, and reason to be…

klodolph
2 replies
11h52m

I think if you focus too narrowly on the details of the wire protocol, you’ll lose sight of the big picture and the semantics.

S3 is not a filesystem because the semantics are different from the kind of semantics we expect from filesystems. You can’t take the high-level API provided by a filesystem, use S3 as the backing storage, and expect to get good performance out of it unless you use a ton of translation.

Stuff like NFS or CIFS are filesystems. They behave like filesystems, in practice. You can rename files. You can modify files. You can create directories.

Demiurge
1 replies
5h16m

Right, the NFS/CIFS support writing blocks, but S3 basically does HTTP get and post verbs. I would say that these concepts are the defining difference. To call S3 a filesystem is not wrong in abstract, but it’s not different than calling Wordpress a filesystem, or DNS, or anything that stores something for you. Of course, it will be inefficient to implement a block write on top of any of these, that’s because you have to literally do it yourself. As in, download the file, edit it, upload again.

klodolph
0 replies
4h33m

I think the blocks are one part of it, and the other part is that S3 doesn’t support renaming or moving objects, and doesn’t have directories (just prefixes). Whenever I’ve seen something with filesystem-like semantics on top of S3, it’s done by using S3 as a storage layer, and building some other kind of view of the storage on top using a separate index.

For example, maybe you have a database mapping file paths to S3 objects. This gives you a separate metadata layer, with S3 as the storage layer for large blocks of data.

keithalewis
0 replies
11h49m

Even youngsters are yelling at clouds now. Just a different kind of cloud.

ithkuil
1 replies
9h44m

In S3 each file is identified with a full path.

Not only you cannot rename a single file, but you also cannot rename a "folder" (because that would imply a bulk rename on a large number of children of that "folder")

This is the fundamental difference between a first class folder and just a convention on prefixes of full path names.

If you don't allow renames, it doesn't really make sense to have each "folder" store the list of the children.

You can instead have a giant ordered map (some kind of b-tree) that allows you for efficient lookup and scanning neighbouring nodes.

lukeh
0 replies
9h2m

UMich LDAP server, upon which many were based, stored entrys’ hierarchical (distinguished) names with each entry, which I always found a bit weird. AD, eDirectory, and the OpenLDAP HDB backend don’t have this problem.

8organicbits
1 replies
10h13m

Another challenge is directory flattening. On a file system "a/b" and "a//b" are usually considered the same path. But on S3 the slash isn't a directory separator, so the paths are distinct. You need to be extra careful when building paths not to include double slashes.

Many tools end up handling this by showing a folder named "a" containing a folder named "" (empty string). This confuses users quite a bit. It's more than the inodes, it's how the tooling handles the abstraction.

hnlmorg
0 replies
10h8m

Coincidentally I ran into an issue just like this a week ago. A customer facing application failed because there was an object named “/foo/bar” (emphasis on the leading slash).

This created a prefix named “/“ which confused the hell out of the application.

erik_seaberg
0 replies
12h13m

You can create a simulated directory, and write a bunch of files in it, but you can't atomically rename it--behind the scenes each file needs to be copied from old name to new.

daynthelife
0 replies
12h21m

The payload still contains a list of other inodes though

solumunus
14 replies
12h21m

What exactly do you think a folder is? It’s just an abstraction for organising data.

klodolph
11 replies
12h3m

S3 doesn’t have that abstraction.

The console UI shows folders but they don’t actually exist in S3. They’re made up by the UI.

3weeksearlier
5 replies
11h53m

It sounds like they have that abstraction in the UI. But if the CLI and API don't have it too, that's weird.

klodolph
4 replies
11h43m

Yeah, the UI and CLI show you “folders”. It’s a client-side thing that doesn’t exist in the actual service. Behind the scenes, the clients are making specific types of queries on the object keys.

You can’t examine when a folder was created (it doesn’t exist in the first place), you can’t rename a folder (it doesn’t exist), you can’t delete a folder (again, it doesn’t exist).

throwitaway222
3 replies
11h22m

That's just an implementation detail of well known filesystems.

dathery
1 replies
10h44m

Yes, which is why it's not ideal to reuse the folder metaphor here. Users have an idea how directories work on well-known filesystems and get confused when these fake folders don't behave the same way.

throwitaway222
0 replies
7m

Are all your s3 keys opaque strings (like UUIDs)?, do you use / (slash) in your keys?

If you truly believe S3 has absolutely no connection to folders, you would answer Yes and No.

klodolph
0 replies
4h37m

I don’t think that’s a defensible standpoint.

Folders are an important part of the way most people use filesystems.

throwitaway222
3 replies
11h23m

Similarly the UI in linux is making up the notion of folders and files in them. But we don't say it doesn't exist.

kelnos
1 replies
10h40m

No, they're not made up. A folder (or directory) is a specific type of inode, just a file is.

S3 doesn't have folders. The UI fakes them by creating a 0-byte object (or file, if you will). It's a kludge.

klodolph
0 replies
4h36m

The UI will fake them without even creating the 0-byte object.

dathery
0 replies
10h47m

Directories actually exist on the filesystem, which is why you have to create them before use and they can exist and be empty. They don't exist in S3 and neither of those properties do, either. Similarly, common filesystem operations on directories (like efficiently renaming them, and thus the files under them) are not possible in S3.

Of course it can still be useful to group objects in the S3 UI, but it would probably be better to use some kind of prefix-centric UI rather than reusing the folder metaphor when it doesn't match the paradigm people are used to.

DonHopkins
0 replies
9h45m

Speaking of user interfaces with optical illustions about directory separators:

On the Mac, the Finder lets you have files with slashes in their names, even though it's a Unix file system underneath. Don't believe me? Go try to use the Finder to make a directory whose name is "Reports from 2024/03/10". See?

But as everyone knows, slash is the ONLY character you're not allowed to have in a file or directory name under Unix. It's enforced in the kernel at the system call inteface. There is absolutely no way to make a file with a slash in it. Yet there it is!

The original MacOS operating system used the ":" character to delimit directory names, instead of "/", so you could have files and directories with slashes in their names, justs not with colons in their names.

When Apple transitioned from MacOS to Unix, they did not want to freak out their users by reaming all their files.

So now try to use the Finder (or any app that uses the standard file dialog) to make a folder or file with a ":" in its name on a modern Mac. You still can't!

So now go into the shell and list out the parent directory containing the directory you made with a slash in its name. It's actually called "Reports from 2024:03:10"!

The Mac Finder and system file dialog user interfaces actually switche "/" and ":" when they show paths on the screen!

Try making a file in the shell with colons in it, then look at it in the finder to see the slashes.

However, back in the days of the old MacOS that permitted slashes in file names, there was a handy network gateway box called the "Gatorbox" that was a Localtalk-to-Ethernet AFP/NFS bridge, which took a subtly different approach.

https://en.wikipedia.org/wiki/GatorBox

It took advantage of the fact (or rather it triggered the bug) that the Unix NFS implementation boldly made an end-run around the kernel's safe system call interface that disallowed slashes in file names. So any NFS client could actually trick Unix into putting slashes into file names via the NFS protocol!

It appeared to work just fine, but then down the line the Unix "restore" command would totally shit itself! Of course "dump" worked just fine, never raising an error that it was writing corrupted dumps that you would not be able to read back in your time of need, so you'd only learn that you'd been screwed by the bug and lost all your files months or years later!

So not only does NFS stand for "No File Security", it also stands for "Nasty Forbidden Slashes"!

https://news.ycombinator.com/item?id=31820504

NFS originally stood for "No File Security".

The NFS protocol wasn't just stateless, but also securityless!

Stewart, remember the open secret that almost everybody at Sun knew about, in which you could tftp a host's /etc/exports (because tftp was set up by default in a way that left it wide open to anyone from anywhere reading files in /etc) to learn the name of all the servers a host allowed to mount its file system, and then in a root shell simply go "hostname foo ; mount remote:/dir /mnt ; hostname `hostname`" to temporarily change the CLIENT's hostname to the name of a host that the SERVER allowed to mount the directory, then mount it (claiming to be an allowed client), then switch it back?

That's right, the server didn't bother checking the client's IP address against the host name it claimed to be in the NFS mountd request. That's right: the protocol itself let the client tell the server what its host name was, and the server implementation didn't check that against the client's ip address. Nice professional protocol design and implementation, huh?

Yes, that actually worked, because the NFS protocol laughably trusted the CLIENT to identify its host name for security purposes. That level of "trust" was built into the original NFS protocol and implementation from day one, by the geniuses at Sun who originally designed it. The network is the computer is insecure, indeed.

[...]

From the Unix-Haters Handbook:

https://archive.org/stream/TheUnixHatersHandbook/ugh_djvu.tx...

Don't Touch That Slash!

UFS allows any character in a filename except for the slash (/) and the ASCII NUL character. (Some versions of Unix allow ASCII characters with the high-bit, bit 8, set. Others don't.)

This feature is great — especially in versions of Unix based on Berkeley's Fast File System, which allows filenames longer than 14 characters. It means that you are free to construct informative, easy-to-understand filenames like these:

1992 Sales Report

Personnel File: Verne, Jules

rt005mfkbgkw0 . cp

Unfortunately, the rest of Unix isn't as tolerant. Of the filenames shown above, only rt005mfkbgkw0.cp will work with the majority of Unix utilities (which generally can't tolerate spaces in filenames).

However, don't fret: Unix will let you construct filenames that have control characters or graphics symbols in them. (Some versions will even let you build files that have no name at all.) This can be a great security feature — especially if you have control keys on your keyboard that other people don't have on theirs. That's right: you can literally create files with names that other people can't access. It sort of makes up for the lack of serious security access controls in the rest of Unix.

Recall that Unix does place one hard-and-fast restriction on filenames: they may never, ever contain the magic slash character (/), since the Unix kernel uses the slash to denote subdirectories. To enforce this requirement, the Unix kernel simply will never let you create a filename that has a slash in it. (However, you can have a filename with the 0200 bit set, which does list on some versions of Unix as a slash character.)

Never? Well, hardly ever.

    Date: Mon, 8 Jan 90 18:41:57 PST 
    From: sun!wrs!yuba!steve@decwrl.dec.com (Steve Sekiguchi) 
    Subject: Info-Mac Digest V8 #3 5 

    I've got a rather difficult problem here. We've got a Gator Box run- 
    ning the NFS/AFP conversion. We use this to hook up Macs and 
    Suns. With the Sun as a AppleShare File server. All of this works 
    great! 

    Now here is the problem, Macs are allowed to create files on the Sun/ 
    Unix fileserver with a "/" in the filename. This is great until you try 
    to restore one of these files from your "dump" tapes, "restore" core 
    dumps when it runs into a file with a "/" in the filename. As far as I 
    can tell the "dump" tape is fine. 

    Does anyone have a suggestion for getting the files off the backup 
    tape? 

    Thanks in Advance, 

    Steven Sekiguchi Wind River Systems 

    sun!wrs!steve, steve@wrs.com Emeryville CA, 94608
Apparently Sun's circa 1990 NFS server (which runs inside the kernel) assumed that an NFS client would never, ever send a filename that had a slash inside it and thus didn't bother to check for the illegal character. We're surprised that the files got written to the dump tape at all. (Then again, perhaps they didn't. There's really no way to tell for sure, is there now?)

winwang
0 replies
12h12m

I'm having a lot of fun imagining this being said to a kid who's trying to buy some folders for school.

ahepp
0 replies
2h20m

Is it an abstraction for requesting the data you want, or an abstraction for storing the data in a retrievable manner?

nostrebored
2 replies
12h19m

Weird that it says folders now. I remember it being very strictly called a prefix when I was at AWS.

Izkata
0 replies
11h18m

The web console even collapses them like folders on slashes, further obfuscating how it actually works. I remember having to explain to coworkers why it was so slow to load a large bucket.

wkat4242
1 replies
8h34m

Hmm well there's no folders but if you interact with the object the URL does become nested. So in a sense it does behave exactly like a folder for all intents and purposes when dealing with it that way. It depends what API you use I guess.

I use S3 just as a web bucket of files (I know it's not the best way to do that but it's what I could easily obtain through our company's processes). But in this case it makes a lot of sense though I try to avoid making folders. But other people using the same hosting do use them.

raverbashing
0 replies
8h5m

Except stuff like s3 cli has all these weird names for normal filesystem items and you have to bang your head to try to figure it out what it all means

(also don't get me started on the whole s3api thing)

klodolph
0 replies
12h1m

I see you getting downvotes, but you’re speaking the honest truth, here.

highwaylights
0 replies
8h41m

This!

I’m fine with it, I actually appreciate the logic and simplicity behind it, but the amount of times I’ve tried to explain why “folders” on S3 keep disappearing while people stare at me like I’m an idiot is really frustrating.

(When you remove the last file in a “folder” on S3, the “folder” disappears, because that pattern no longer appears in the bucket k/v dictionary so there’s no reason to show it as it never existed in the first place).

halayli
0 replies
11h51m

I don't know why you are being downvoted, what you said is true and confuses many newcomers.

orf
56 replies
6h14m

And listing files is slow. While the joy of Amazon S3 is that you can read and write at extremely, extremely, high bandwidths, listing out what is there is much much slower. Slower than a slow local filesystem

This misses something critical. Yes, s3 has fast reading and writing, but that’s not really what makes it useful.

What makes it useful is listing. In an unversioned bucket (or one with no delete markers), listing any given prefix is essentially constant time: I can take any given string, in a bucket with 100 billion objects, and say “give me the next 1000 keys alphabetically that come after this random string”.

What’s more, using “/“ as a delimiter is just the default - you can use any character you want and get a set of common prefixes. There are no “directories”, ”directories” are created out of thin air on demand.

This is super powerful, and it’s the thing that lets you partition your data in various ways, using whatever identifiers you need, without worrying about performance.

If listing was just “slow”, couldn’t list on file prefixes and got slower proportional to the number of keys (I.e a traditional unix file system), then it wouldn’t be useful at all.

tjoff
13 replies
5h25m

Is listing really such a key feature that people use it as a database to find objects?

Have not used S3, but that is not how I imagined using it.

orf
8 replies
5h15m

Sure. It's kind of an index - limited to prefix-only searching, but useful.

Say you store uploads associated with a company and a user. You'd maybe naively store them as `[company-uuid]/[user-id].[timestamp]`.

If you need to list a given users (123) uploads after a given date, you'd list keys after `[company-uuid]/123.[date]`. If you need to list all users uploads, you'd list `[company-uuid]/123.`. If you need to get a set of all users who have photos, you'd list `[company-uuid]/` with a Delimiter set to `.`

The point is that it's flexible and with a bit of thought it allows you to "remove all a users uploads between two dates", "remove all a companies uploads" or "remove all a users uploads" with a single call. Or whatever specific stuff is important to your use-case, that might otherwise need a separate DB.

It's not perfect - you can't reverse the listing (i.e you can't get the latest photo for a given user by sorting descending for example), and needs some thought about your key structure.

tjoff
7 replies
5h5m

But surely you need to track that elsewhere anyway?

That some niche edge-case runs efficiently doesn't sound like a defining feature of S3. On the contrary many common operations map terrible to S3, so you kind of need the logic to be elsewhere.

kbolino
3 replies
3h10m

But surely you need to track that elsewhere anyway?

Why? If the S3 structure and listing is sufficient, I don't need to store anything else anywhere else.

Many use cases may involve other requirements that S3 can't meet, such as being able to find the same object via different keys, or being able to search through the metadata fields. However, if the requirements match up with S3's structure, then additional services are unnecessary and keeping them in sync with S3 is more hassle than it's worth.

tjoff
2 replies
2h47m

I agree, but something as simple (in functionality) as that ought to be an edge-case. Not a defining feature of S3.

orf
0 replies
2h21m

It’s fundamental to how S3 works and its ability to scale, so it is a defining feature of S3.

If you think wider, a bucket itself is just a prefix.

dekhn
0 replies
2h27m

it's a property of the system that I, as an architect, would seriously consider as part of my system's design. I've worked with many systems where iterating over items in order starting from a prefix is extremely cheap (sstables).

orf
2 replies
4h13m

My overall point can be summarised as this:

- Listing things is a very common operation to do.

- The POSIX api and the directory/file hierarchy it provides is a restrictive one.

- S3 does not suffer from this, you can recursively list and group keys into directories at “list time”.

- If you find yourself needing to list gigantic numbers of keys in one go, you can do better by only listing a subset. S3 isn’t a filesystem, you shouldn’t need to list 1k+ keys sequentially apart from during maintenance tasks.

- This is actually quite fast, compared to alternatives.

Whether or not you see a use case for this is sort of irrelevant: they exist. it’s what allows you to easily put data into s3 and flexibly group/scan it by specific attributes.

tjoff
1 replies
3h54m

Listing things is very common, so why would you outsource that to S3 when all your bookkeeping is elsewhere? It's not like you would ever rely on the POSIX API for that anyway, even for when your files actually are on a POSIX filesystem.

For sure, for maintenance tasks etc. it sounds quite useful. And good hygiene with prefixes sounds like a sane idea. But listing being a critical part of what "makes S3 useful"? That seems like an huge stretch that your points don't seem to address.

orf
0 replies
13m

It's not like you would ever rely on the POSIX API for that anyway, even for when your files actually are on a POSIX filesystem.

Because there is no POSIX api for this. Depending on your requirements and query patterns, you may not need a completely separate database that you need to keep in sync.

belter
3 replies
4h16m

No. The standard practice is to use a DynamoDB table as the index for your objects in S3.

This article misunderstood S3 and could as well have the title: "An Airplane is not a Car" :-)

macintux
2 replies
2h47m

I don't know that you can characterize that as a "standard practice".

Maybe it's widespread, but I've not encountered it.

ianburrell
0 replies
45m

That article is old. DynamoDB was used because of the old, weak consistency model of S3. Writes were atomic, but lists could return old results so needed consistent list of objects.

But in 2020, S3 changed to strong consistency model. There is no need to use DynamoDB now.

calpaterson
12 replies
5h9m

I have to say that I'm not hugely convinced. I don't really think that being able to pull out the keys before or after a prefix is particularly impressive. That is the basis for database indices going back to the 1970s after all.

Perhaps the use-cases you're talking about are very different from mine. That's possible of course.

But for me, often the slow speed of listing the bucket gets in the way. Your bucket doesn't have to get very big before listing the keys takes longer than reading them. I seem to remember that listing operations ran at sub-1mbps, but admittedly I don't have a big bucket handy right now to test that.

cuno
10 replies
4h1m

We and our customers use S3 as a POSIX filesystem, and we generally find it faster than a local filesystem for many benchmarks. For listing directories we find it faster than Lustre (a real high performance filesystem). Our approach is to first try listing directories with a single ListObjectV2 (which on AWS S3 is in lexicographic order) and if it hasn't made much progress, we start listing with parallel ListObjectV2. Once you start parallelising the ListObjectV2 (rather than sequentially "continuing") you get massive speedups.

crabbone
6 replies
3h32m

find it faster than a local filesystem for many benchmarks.

What did you measure? How did you compare? This claim seems very contrary to my experience and understanding of how things work...

Let me refine the question: did you measure metadata or data operations? What kind of storage medium is used by the filesystem you use? How much memory (and subsequently the filesystem cache) does your system have?

----

The thing is: you should expect, in the best case, something like 5 ms latency on network calls over the Internet in an ideal case. Within the datacenter, maybe you can achieve sub-ms latency, but that's hard. AWS within region but different zones tends to be around 1 ms latency.

This is while NVMe latency, even on consumer products, is 10-20 micro seconds. I.e. we are talking about roughly 100 times faster than anything going through the network can offer.

hnlmorg
2 replies
3h0m

EFS is ridiculously slow though. Almost to the point where I fail to see how it’s actually useful for any of the traditional use cases for NFS.

dekhn
1 replies
2h29m

if you turn all the EFS performance knobs up (at a high cost), it's quite fast.

hnlmorg
0 replies
1h58m

Faster, sure. But I wouldn’t got so far as to say it is fast

wenc
0 replies
47m

S3 is really high latency though. I store parquet files on S3 and querying them through DuckDB is much slower than file system because random access patterns. I can see S3 being decent if it’s bulk access but definitely not for random access.

This is why there’s a new S3 Express offering that is low latency (but costs more).

crabbone
0 replies
19m

The tests are very weird...

Normally, from someone working in the storage, you'd expect tests to be in IOPS, and the goto tool for reproducible tests is FIO. I mean, of course "reproducibility" is a very broad subject, but people are so used to this tool that they develop certain intuition and interpretation for it / its results.

On the other hand, seeing throughput figures is kinda... it tells you very little about how the system performs. Just to give you some reasons: a system can be configured to do compression or deduplication on client / server, and this will significantly impact your throughput, depending on what do you actually measure: the amount of useful information presented to the user or the amount of information transferred. Also throughput at the expense of higher latency may or may not be a good thing... Really, if you ask anyone who ever worked on a storage product about how they could crank up throughput numbers, they'd tell you: "write bigger blocks asynchronously". This is the basic recipe, if that's what you want. Whether this makes a good all around system or not... I'd say, probably not.

Of course, there are many other concerns. Data consistency is a big one, and this is a typical tradeoff when it comes to choosing between object store and a filesystem, since filesystem offers more data consistency guarantees, whereas object store can do certain things faster, while breaking them.

BTW, I don't think most readers would understand Lustre and similar to be the "local filesystem", since it operates over network and network performance will have a significant impact, of course, it will also put it in the same ballpark as other networked systems.

I'd also say that Ceph is kinda missing from this benchmark... Again, if we are talking about filesystem on top of object store, it's the prime example...

supriyo-biswas
2 replies
3h9m

Once you start parallelising the ListObjectV2 (rather than sequentially "continuing")

How are you "parallelizing" the ListObjectsV2? The continuation token can be only fed in once the previous ListObjectsV2 response has completed, unless you know the name or structure of keys ahead of time, in which listing objects isn't necessary.

johnmaguire
0 replies
3h2m

You're right that it won't work for all use cases, but starting two threads with prefixes A and M, for example, is one way you might achieve this.

cuno
0 replies
3h2m

For example, you can do separate parallel ListObjectV2 for files starting a-f and g-k, etc.. covering the whole key space. You can parallelize recursively based on what is found in the first 1000 entries so that it matches the statistics of the keys. Yes there may be pathological cases, but in practice we find this works very well.

orf
0 replies
4h24m

It depends on a few factors. The list objects call hides deleted and noncurrent versions, but it has to skip over them. Grouping prefixes also takes time, if they contain a lot of noncurrent or deleted keys.

A pathological case would be a prefix with 100 million deleted keys, and 1 actual key at the end. Listing the parent prefix takes a long time in this case - I’ve seen it take several minutes.

If your bucket is pretty “normal” and doesn’t have this, or isn’t versioned, then you can do 4-5 thousand list requests a second, at any given key/prefix, in constant time. Or or you can explicitly list object versions (and not skip deleted keys) also in constant time.

It all depends on your data: if you need to list all objects then yeah it’s gonna be slow because you need to paginate through all the objects. But the point is that you don’t have to do that if you don’t want to, unlike a traditional filesystem with a directory hierarchy.

And this enables parallelisation: why list everything sequentially, when you can group the prefixes by some character (i.e “-“), then process each of those prefixes in parallel.

The world is your oyster.

adrian_b
11 replies
5h45m

Since 30 years ago (starting with XFS in 1993, which was inspired by HPFS) all the good UNIX file systems implement the directories as some kind of B trees.

Therefore they do not get slower proportional to the number of entries and listing based on file prefixes is extremely fast.

orf
8 replies
5h39m

Yes they do. What APIs does Linux offer that allows you to list a directories contents alphabetically starting at a specific filename in constant time? You have to iterate the directory contents.

You can maybe use “d_off” with readdir in some way, but that’s specific to the filesystem. There’s no portable way to do this with POSIX.

Regardless of if you can do it with a single directory, you can’t do it for all files recursively under a given prefix. You can’t just ignore directories, or say that “for this list request, ‘-‘ is my directory separator”.

The use of b-trees in file systems is completely beside the point.

adrian_b
7 replies
5h33m

The POSIX API is indeed even older, so it is not helpful.

But as you say, there are filesystem-specific methods or operating-system specific methods to reach the true performance of the filesystem.

It is likely that for maximum performance one would have to write custom directory search functions using directly the Linux syscalls, instead of using the standard libc functions, but I would rather do that instead of paying for S3 or something like it.

orf
5 replies
5h28m

Yes. You could also just use a SQLite table with two columns (path, contents), then just query that. Or do any number of other things.

The question isn’t if it’s possible, because of course it is, the question is if it’s portable and well supported with the POSIX interface. Because if it’s not, then…

anamexis
4 replies
5h22m

The question isn’t if it’s possible, because of course it is, the question is if it’s portable and well supported with the POSIX interface. Because if it’s not, then…

Where did this goalpost come from? S3 is not portable or POSIX compliant.

orf
3 replies
5h14m

From the article we're commenting on, which is comparing the interface of S3 to the POSIX interface. Not any given filesystem + platform specific interface.

anamexis
2 replies
5h6m

The article does not mention POSIX, or anything about listing files, at all.

zaphar
0 replies
4h43m

The article starts out by making a comparison between the posix api filesystem calls and S3's api. The context is very much a comparison between those two api surface areas.

orf
0 replies
4h17m

It mistakenly mentions UNIX whilst referencing the POSIX filesystem API, and I literally quoted where it talks about listing in my original comment.

justincormack
0 replies
5h13m

There are no specific syscalls that you can use for this. The libc functions and the syscalls are extremely similar.

nh2
0 replies
5h4m

listing based on file prefixes is extremely fast

This functionality does not exist to my knowledge.

ext4 and XFS return directory entries in pseudo-random order (due to hashing), not lexicographically.

For an example, see e.g. https://righteousit.wordpress.com/2022/01/13/xfs-part-6-btre...

If you know a way to return lexicographical order directly from the file system, without the need to sort, please link it.

kbolino
0 replies
2h58m

Resolving random file system paths still gets slower proportional to their depth, which is not the case for S3, where the prefix is on the entire object key and not just the "basename" part of it, like in a filesystem.

jacobsimon
6 replies
5h44m

What is it about S3 that enables this speed, and why can’t traditional Unix file systems do the same?

orf
5 replies
5h30m

S3 doesn’t have directories, it could be thought of a flat + sorted list of keys.

UNIX (and all operating systems) differentiate between a file and a directory. To list the contents of a directory, you need to make an explicit call. That call might return files or directories.

So to list all files recursively, you need to list, sort, check if an entry is a directory, recurse”. This isn’t great.

bradleyjg
3 replies
5h20m

Code written against s3 is not portable either. It doesn’t support azure or gcp, much less some random proprietary cloud.

zaphar
0 replies
4h41m

GCP storage buckets implement the S3 api. You can treat them like they were an s3 bucket. Something I do all the time.

cuno
0 replies
3h47m

Actually we've found it's often much worse than that. Code written against AWS S3 using the AWS SDK often doesn't work on a great many "S3-compatible" vendors (including on-prem versions). Although there's documentation on S3, it's vague in many ways, and the AWS SDKs rely on actual AWS behaviour. We've had to deal with a lot of commercial and cloud vendors that subtly break things. This includes giant public cloud companies. In one case a giant vendor only failed at high loads, making it appear to "work" until it didn't, because its backoff response was not what the AWS SDK expected. It's been a headache that we've had to deal for cunoFS, as well as making it work with GCP and Azure. At the big HPC conference Supercomputing 2023, when we mentioned supporting "S3 compatible" systems, we would often be told stories about applications not working with their supposedly "S3 compatible" one (from a mix of vendors).

arcfour
0 replies
4h56m

I've seen several S3-compatible APIs and there are open-source clients. If anything it's the de-facto standard.

mechanicalpulse
0 replies
1h11m

Isn't that a limitation imposed by the POSIX APIs, though, as a direct consequence of the interface's representation of hierarchical filesystems as trees? As you've illustrated, that necessitates walking the tree. Many tools, I suppose, walk the tree via a single thread, further serializing the process. In an admittedly haphazard test, I ran `find(1)` on ext4, xfs, and zfs filesystems and saw only one thread.

I imagine there's at least one POSIX-compatible file system out there that supports another, more performant method of dumping its internal metadata via some system call or another. But then we would no longer be comparing the S3 and POSIX APIs.

gamache
3 replies
2h36m

...listing any given prefix is essentially constant time: I can take any given string, in a bucket with 100 billion objects, and say “give me the next 1000 keys alphabetically that come after this random string”.

I'm not sure we agree on the definition of "constant time" here. Just because you get 1000 keys in one network call doesn't imply anything about the complexity of the backend!

orf
2 replies
2h27m

Constant time irregardless of the number of objects in the bucket and irregardless of the initial starting position of your list request.

hobobaggins
1 replies
2h6m

The technical implementation is indeed impressive that it operates more-or-less within constant time, but probably very few use cases actually fit that narrow window, so this technical strength is moot when it comes to actual usage.

Since each request is dependent upon the position received in the last request, 1000 arbitrary keys on your 3rd or 1000th attempt doesn't really help unless you found your needle in the haystack in that request (and in that case the rest of that 1000 key listing was wasted.)

orf
0 replies
1h34m

You’re assuming you’re paginating through all objects from start to finish.

A request to list objects under “foo/“ is a request to list all objects starting with “foo/“, which is constant time irregardless of the number of keys before. Same applies for “foo/bar-“, or any other list request for any given prefix. There are no directories on s3.

nh2
1 replies
1h50m

The key difference between lexicographically keyed flat hierarchies, and directory-nested filesystem hierarchies, becomes clear based on this example:

    dir1/a/000000
    dir1/a/...
    dir1/a/999999
    dir1/b
On a proper hierarchical file file system with directories as tree interior nodes, `ls dir1/` needs to traverse and return only 2 entries ("a" and "b").

A flat string-indexed KV store that only supports lexicographic order, without special handling of delimters, needs to traverse 1 million dirents ("a/00000" throuh "a/999999") before arriving at "b".

Thus, simple flat hierarchies are much slower at listing the contents of a single dir: O(all recursive children), vs. O(immediate children) on a "proper" filesystem.

Lexicographic strings cannot model multi-level tree structures with the same complexities; this may give it the reputation of "listing files is slow".

UNLESS you tell the listing algorithm what the delimter character is (e.g. `/`). Then a lexicographical prefix tree can efficiently skip over all subtrees at the next `/`.

Amazon S3 supports that, with the docs explicitly mentioning "skipping over and summarizing the (possibly millions of) keys nested at deeper levels" in the `CommonPrefixes` field: https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-...

I have not tested whether Amazon's implemented actually saves the traversal (or whether it traverses and just returns less results), but I'd hope so.

nh2
0 replies
36m

For completeness: The orignal post says:

    S3 has no rename or move operation.
    Renaming is CopyObject and then DeleteObject.
    CopyObject takes linear time to the size of the file(s).
    This comes up fairly often when someone has written a lot of files
    to the wrong place - moving the files back is very slow.
This is right:

In a normal file system, renaming a directory is fast O(1), in S3 it's slow O(all recursive children).

And Amazon S3 has not added a delimiter-based function to reduce its complexity, even though that would be easily possible in a lexicographic prefix tree (re-rooting the subtree).

So here the original post has indeed found a case where S3 is much slower than a normal file system.

foldr
1 replies
4h0m

What makes it useful is listing.

I think 99% of S3 usage just consists of retrieving objects with known keys. It seems odd to me to consider prefix listing as a key feature.

bostik
0 replies
3h52m

When you embed the relevant (not necessarily that of object creation) timestamp as a prefix, it sure becomes one. Whether that prefix is part of the "path" (object/path/prefix/with/<4-digit year/)" or directly part of the basename (object/path/prefix/to/app-specific/files/<4-digit year>-<2-digit month>-....), being able to limit the search space server-side becomes incredibly useful.

You can try it yourself: list objects in a bucket prefix with lots of files, and measure the time it takes to list all of them vs. the time it takes to list only a subset of them that share a common prefix.

hayd
0 replies
4h40m

You can set up cloud watch events to trigger a lambda function to store meta data about the s3 file in a regular database. That way you can index it how you expect to list.

Very effective for our use case.

aeyes
0 replies
5h27m

And if for some reason you need a complete listing along with object sizes and other attributes you can get one every 24 hours with S3 inventory report.

That has always been good enough for me.

breckognize
25 replies
2h45m

I haven't heard of people having problems [with S3's Durability] but equally: I've never seen these claims tested. I am at least a bit curious about these claims.

Believe the hype. S3's durability is industry leading and traditional file systems don't compare. It's not just the software - it's the physical infrastructure and safety culture.

AWS' availability zone isolation is better than the other cloud providers. When I worked at S3, customers would beat us up over pricing compared to GCP blob storage, but the comparison was unfair because Google would store your data in the same building (or maybe different rooms of the same building) - not with the separation AWS did.

The entire organization was unbelievably paranoid about data integrity (checksum all the things) and bigger events like natural disasters. S3 even operates at a scale where we could detect "bitrot" - random bit flips caused by gamma rays hitting a hard drive platter (roughly one per second across trillions of objects iirc). We even measured failure rates by hard drive vendor/vintage to minimize the chance of data loss if a batch of disks went bad.

I wouldn't store critical data anywhere else.

Source: I wrote the S3 placement system.

rsync
9 replies
2h20m

"AWS' availability zone isolation is better than the other cloud providers."

Not better than all of them.

A geo-redundant rsync.net account exists in two different states (or countries) - for instance, primary in Fremont[1] and secondary in Denver.

"S3 even operates at a scale where we could detect "bitrot""

That is not a function of scale. My personal server running ZFS detects bitrot just fine - and the scale involved is tiny.

[1] he.net headquarters

breckognize
4 replies
2h7m

Backing up across two different regions is possible for any provider with two "regions" but requires either doubling your storage footprint or accepting a latency hit because you have to make a roundtrip from Fremont to Denver.

The neat thing about AWS' AZ architecture is that it's a sweet spot in the middle. They're far enough apart for good isolation, which provides durability and availability, but close enough that the network round trip time is negligible compared to the disk seek.

Re: bit rot, I mean the frequency of events. If you've got a few disks, you may see one flip every couple years. They happen frequently enough in S3 that you can have expectations about the arrival rate and alarm when that deviates from expectations.

logifail
2 replies
59m

The neat thing about AWS' AZ architecture is that it's a sweet spot in the middle

What may be less of a sweet spot is AWS' pricing.

emodendroket
1 replies
51m

Sending the data to /dev/null is the cheapest option if that’s all you care about.

logifail
0 replies
37m

Seems the snark detector just went off :)

Back on topic, I'd hope all of us would expect value for money for any and all services we recommend or purchase. Search for "site:news.ycombinator.com Away From AWS" to find dozens of discussions on how to save money by leaving AWS.

EDIT: just one article of the many I've read recently:

"What I’ve always found surprising about egress is just how expensive it is. On AWS, downloading a file from S3 to your computer once costs 4 times more than storing it for an entire month"

https://robaboukhalil.medium.com/youre-paying-too-much-for-e...

senderista
0 replies
31m

the network round trip time is negligible compared to the disk seek

Only for spinning rust, right?

Helmut10001
2 replies
2h18m

Agree.

S3 even operates at a scale where we could detect "bitrot" - random bit flips caused by gamma rays hitting a hard drive platter (roughly one per second across trillions of objects iirc).

I would expect any cloud provider to be able to detect bitrot these days.

senderista
1 replies
29m

I think the point the OP was trying to make is that they regularly detected bitrot due to their scale, not that they were merely capable of doing so.

Helmut10001
0 replies
8m

Ah, thank you. This makes more sense. And I think I remember reading about it once. Apologies for the misinterpretation!

medler
4 replies
2h14m

customers would beat us up over pricing compared to GCP blob storage, but the comparison was unfair because Google would store your data in the same building

I don’t think this is true. Per the Google Cloud Storage docs, data is replicated across multiple zones, and each zone maps to a different cluster. https://cloud.google.com/compute/docs/regions-zones/zone-vir...

singron
2 replies
1h57m

Google puts multiple clusters in a single building.

navaati
0 replies
44m

Flashback to that Clichy datacenter fire near Paris...

medler
0 replies
1h39m

Seems you’re right. They say each zone is a separate failure domain but you kind of have to trust their word on that.

treflop
3 replies
1h33m

What’s your experience like at other storage outfits?

I only ask because your post is a bit like singing praises for Cinnabon that they make their own dough.

The things that you mentioned are standard storage company activities.

Checksum-all-the-things is a basic feature of a lot of file systems. If you can already set up your home computer to detect bitrot and alert you, you can bet big storage vendors do it.

Keeping track of hard drive failure rates by vendor is normal. Storage companies publicly publish their own reports. The tiny 6-person IT operation I was in had a spreadsheet. Hell, I toured a friend’s friend’s major data center last year and he managed to find time to talk hard drive vendors. Now you. I get it — y’all make spreadsheets.

There are a lot of smart people working on storage outside AWS and long before AWS existed.

pclmulqdq
1 replies
41m

When I worked at Google in storage, we had our own figures of merit that showed that we were the best and Amazon's durability was trash in comparison to us.

As far as I can tell, every cloud provider's object store is too durable to actually measure ("14 9's"), and it's not a problem.

breckognize
0 replies
20m

9's are overblown. When cloud providers report that, they're really saying "Assuming random hard drive failure at the rates we've historically measured and how we quickly we detect and fix those failures, what's the mean time to data loss".

But that's burying the lede. By far the greatest risks to a file's durability are: 1. Bugs (which aren't captured by a durability model). This is mitigated by deploying slowly and having good isolation between regions. 2. An act of God that wipes out a facility.

The point of my comment was that it's not just about checksums. That's table stakes. The main driver of data loss for storage organizations with competent software is safety culture and physical infrastructure.

My experience was that S3's safety culture is outstanding. In terms of physical separation and how "solid" the AZs are, AWS is overbuilt compared to the other players.

fierro
0 replies
29m

it's well known and not debatable that Cinnabon is fire

supriyo-biswas
1 replies
2h34m

Checksumming the data is not based out of paranoia but simply as a result of having to detect which blocks are unusable in order to run the Reed-Solomon algorithm.

I'd also assume that a sufficient number of these corruption events are used as a signal to "heal" the system by migrating the individual data blocks onto different machines.

Overall, I'd say the things that you mentioned are pretty typical of a storage system, and are not at all specific to S3 :)

catlifeonmars
0 replies
1h25m

The S3 checksum feature applies to the objects, so that’s entirely orthogonal to erasure codes. Unless you know something I don’t and SHA256 has commutative properties. You’d still need to compute the object hash independent of any blocks.

Source: https://docs.aws.amazon.com/AmazonS3/latest/userguide/checki...

tracerbulletx
0 replies
2h36m

My first job was at a startup in 2012 where I was expected to build things at a scale way over what I really had the experience to do. Anyways the best choice I ever made was using RDS and S3 (and django).

staunch
0 replies
2h21m

Believe the hype.

I'd rather believe the test results.

Is there a neutral third-party that has validated S3's durability/integrity/consistency? Something as rigorous as Jepsen?

It'd be really neat if someone compared all the S3 compatible cloud storage systems in a really rigorous way. I'm sure we'd discover that there are huge scary problems. Or maybe someone already has?

spintin
0 replies
5m

Correct me if I'm wrong but bitrot only affects spinning rust since NAND uses ECC?

loeg
0 replies
1h12m

Not a public cloud, but storage at Facebook is similar in terms of physical infrastructure, safety culture, and scale.

donatj
11 replies
2h47m

And listing files is slow. While the joy of Amazon S3 is that you can read and write at extremely, extremely, high bandwidths, listing out what is there is much much slower. Slower than a slow local filesystem.

I was taken aback by this recently. At my coworkers request, I was putting some work into a script we have to manage assets in S3. It has a cache for the file listing, and my coworker who wrote it sent me his pre-populated cache. My initial thought was “this can’t really be necessary” and started poking.

We have ~100,000 root level directories for our individual assets. Each of those have five or six directories with a handful of files. Probably less than a million files total, maybe 3 levels deep at its deepest.

Recursively listing these files takes literally fifteen minutes. I poked and prodded suggestions from stack overflow and ChatGPT at potential ways to speed up the process and got nothing notable. That’s absurdly slow. Why on earth is it so slow?

Why is this something Amazon has not fixed? From the outside really seems like they could slap some B-trees on the individual buckets and call it a day.

If it is a difficult problem, I’m sure it would be for fascinating reasons I’d love to hear about.

catlifeonmars
5 replies
1h19m

S3 is fundamentally a key value store. The fact that you can view objects in “directories” is nothing more than a prefix filter. It is not a file system and has no concept of directories.

anonymous-panda
2 replies
32m

Directories make up a hierarchical filesystem, but it’s not a necessary condition. A filesystem at its core is just a way of organizing files. If you’re storing and organizing files in s3 then it’s a filesystem for you. Saying it’s “fundamentally a key value store” like it’s something different is confusing because a filesystem is just a key value store of path to contents of file.

Indeed there’s every reason to believe that a modern file system would perform significantly faster if the hierarchy was implemented as a prefix filter than actually maintaining the hierarchical data structures (at least for most operations). You can guess that this might be the case that file creation is extremely slow on modern file systems (on the order of hundreds or maybe thousands per second on a modern NVME disk that can otherwise do millions of IOPs and listing the contents of an extremely large directory is exceedingly slow)

senderista
0 replies
21m

A real hierarchy makes global constraints easier to scale, e.g. globally unique names or hierarchical access controls. These policies only need to scale to a single node rather than to the whole namespace (via some sort of global index).

catlifeonmars
0 replies
16m

In context of the comment I was addressing, it’s clear that filesystem means more than just a key value store. I’d argue that this is generally true in common vernacular.

Spivak
1 replies
39m

If I wanted to use S3 as a filesystem in the manner people are describing I would probably start looking at storing filesystem metadata in a sidecar database so you can get directory listings, permissions bits, xattrs and only have to round-trip to S3 when you need the content.

SOLAR_FIELDS
0 replies
20m

Isn't this essentially what systems like Minio and SeaweedFS do with their S3 integrations/mirroring/caching? What you describe sounds a lot like SeaweedFS Filer when backed by S3

anonymous-panda
1 replies
1h53m

I think it’s far more mundane a reason. You can list 10k objects per request and getting the next 10k requires the result of the previous request, so it’s all serial. That means to list 1M files, you’re looking at 100 back to back requests. Assuming a ping time of 50ms, that’s easily 5s of just going back and forth, not including the cost of doing the listing itself on a flat iteration. The cost of a 10k item list is about the cost of a write which is kinda slow. Additionally, I suspect each listing is a strongly consistent snapshot which adds to the cost of the operation (it can be hard to provide an inconsistent view).

I don’t think btrees would help unless you’re doing directory traversals, but even then I suspect that’s not that beneficial as your bottleneck is going to be the network operations and exposed operations. Ultimately, file listing isn’t that critical a use case and typically most use cases are accomplished through things like object lifecycles where you tell S3 what you want done and it does it efficiently at the FS layer for you.

tsimionescu
0 replies
21m

That's 5s of a 15m duration. I don't think it matters in the least.

returningfory2
0 replies
2h3m

Are you performing list calls sequentially? If you have O(100k) directories and are doing O(100k) requests sequentially, 15 minutes works out at O(10ms) per request which doesn’t seem that bad? (assuming my math is correct…)

perryizgr8
0 replies
33m

It's not a good model to think of S3 has having directories in a bucket. It's all objects. The web interface has a visual way of representing prefixes separated by slashes. But that's just a nice way to present the objects. Each object has a key, and that key can contain slashes, and you can think of each segment to be a directory for your ease of mind.

But that illusion breaks when you try to do operations you usually do with/on directories.

jamesrat
0 replies
49m

I implemented a solution by threading the listing. Get the files in the root then spin a separate process to do the recursion for each directory.

zmmmmm
9 replies
10h6m

The limitations of S3 (and all the cloud "file systems") are quite astonishing when you consider you're paying for it as a premium service.

Try to imagine your astonishment if a traditional storage vendor showed up and told you that their very expensive premium file system they had just sold you:

    - can't store log files because it can't append
      anything to an existing files
    - can't copy files more than 5GB
    - can't rename or move a file
 
When challenged on how you are supposed to make all your applications work with limitations like that, they glibly told you "oh you're supposed to rewrite them all".

umanwizard
2 replies
9h44m

Amazon doesn’t market S3 as a replacement for file systems, that’s why EBS exists.

Also, is S3 really “very expensive”? Relative to what?

vbezhenar
1 replies
9h9m

S3 usually is the cheapest storage, not only for Amazon, but for other clouds. I don’t understand why.

Cthulhu_
1 replies
9h16m

They're not filesystems though, they're object storage or key/value storage if you will. It's intended to store the log files for long term once they're full.

You can rename / move a file, but it involves copying and deleting the original; I don't understand why they don't have a shortcut for that, but it probably makes sense that the user of the service is aware of the process instead of hiding it.

I'm not sure about the 5GB limit, it's probably documented somewhere as to why that is; possibly, like tweets, having an upper limit helps them optimize things. Anyway there too there's tools, you can do multipart somethings and there's this official blogpost on the subject: https://aws.amazon.com/blogs/storage/copying-objects-greater...

Interesting to note maybe in the context of the post; copy, rename, moving large files, all that could be abstracted away, but that would hide the underlying logic - which might lead to inefficient usage of the service - and worse, make users think it's just a filesystem and use it accordingly, but it's not intended or designed for that use case.

gray_-_wolf
0 replies
5h4m

Current limit is 5TB. The 5GB is for a single upload, you can hover do multipart upload to get up to the maximum size of 5TB.

https://aws.amazon.com/s3/faqs/

throwaway290
0 replies
9h41m

It's for building things on top. If you want to rename/move/copy data, implement a layer that maps objects to "filenames" or any metadata you like (or use some lib). If you want to write logs, implement append and rotation. But I for example don't and won't need any of that and if it helps keep the API simpler and more reliable then I benefit.

being a conventional filesystem for S3 would be either a very leaky abstraction or completely different product

pjc50
0 replies
3h59m

It's not a filesystem, but it has better semantics for distributed operation because of it. Nobody talks about the locking semantics of S3 because it's at the blob level; that rules out whole categories of problems.

And that's also why you can't append. If you had multiple readers while appending, and appending to multiple replicas, guaranteeing that each reader would see a consistent only-forwards read of the append is extremely hard. So simply ban people from doing that and force them to use a different system designed for the purpose of logging.

Microservices. S3 is for blobs. If you want something that isn't a blob, use a different microservice.

ozim
0 replies
8h50m

These “file systems” are not file systems and I don’t understand why people expect them to be.

Some people are creating tools that make those services easier to synch with file systems but that is not intended use anyway.

inkyoto
0 replies
8h22m

S3 is an object storage, not a file system. The file system in AWS is called EFS. S3 is not positioned as a substitute for file systems, either.

3weeksearlier
9 replies
12h41m

I dunno, are features like partial file overwrites necessary to make something a filesystem? This reminds me of how there are lots of internal systems at Google whose maintainers keep asserting are not filesystems, but everyone considers them so, to the point where "_____ is not a filesystem" has become an inside joke.

fiddlerwoaroof
4 replies
12h11m

Yeah, it’s sort of funny how “POSIXish semantics” has become our definition of these things, when it’s just one kind of thing that’s been called a filesystem historically.

mickael-kerjean
2 replies
12h2m

Fun experiment I made with my mum, building a storage independent dropbox like UI [1] for anything that implement this interface:

  type IBackend interface {
    Ls(path string) ([]os.FileInfo, error)
    Cat(path string) (io.ReadCloser, error)
    Mkdir(path string) error
    Rm(path string) error
    Mv(from string, to string) error
    Save(path string, file io.Reader) error
    Touch(path string) error
  }
My mum really couldn't care less about the posix semantic as soon as she can see the pictures of my kid which happen to be on S3

[1] https://github.com/mickael-kerjean/filestash

wwalexander
0 replies
11h54m

Reducing things to basically the interface you laid out is the point of 9p [1], and is what Plan 9’s UNIX-but-distributed design was built on top of. Same inventor as Go! If you haven’t dived down the Plan 9 rabbit hole yet, it’s a beautiful and haunting vision of how simple cloud computing could have been.

[1] https://9fans.github.io/plan9port/man/man9/intro.html

MrJohz
0 replies
5h45m

I think this interface is less interesting than the semantics behind it, particularly when it comes to concurrency: what happens when you delete a folder, and then try and create a file in that folder at the same time? What happens when you move a folder to a new location, and during that move, delete the new or old folders?

Like yes, for your mum's use case, with a single user, it's probably not all that important that you cover those edge cases, but every time I've built pseudo-filesystems on top of non-filesystem storage APIs, those sorts of semantic questions have been where all the problems have hidden. It's not particularly hard to implement the interface you've described, but it's very hard to do it in such a way that, for example, you never have dangling files that exist but aren't contained in any folder, or that you never have multiple files with the same path, and so on.

CobrastanJorji
2 replies
10h44m

They are necessary because as soon as someone decides that S3 is a filesystem, they will look at the other cloud "filesystems," notice that S3 is cheaper than most of them, and then for some reason they will decide to run giant Hadoop fs stuff on it or mount a relational database on it or all other manner of stupidity. I guarantee you S3's customer-facing engineers are fielding multiple calls per week from customers who are angry that S3 isn't as fast as some real filesystem solution that the customer migrated from because S3 was cheaper.

When people decide that X is a filesystem, they try to use it like it's a local, POSIX filesystem, and that's terrible because it won't be immediately obvious why it's a stupid plan.

albert_e
1 replies
9h42m

If a customer makes an IT decision as big as running Hadoop or RDBMS with S3 as storage ... but does not consult at least a Associate level AWS Certified architect (who are doke a dozen) for at least one day worth of advice which is probably a couple of hundred dollars at most ...

Can we really blame AWS?

I am sure none of official AWS documentations or examples show such an architecture.

----

Amazon EMR can run Hadoop and use Amazon S3 as storage via EMR FS.

"S3 mountpoints" are a feature specifically for workloads that need to see S3 as a file system.

For block storage workloads there is EBS and EFS and FSx that AWS heavily advertises.

albert_e
0 replies
4h17m

*dime a dozen

(Apologies for typos. The "noprocrast" setting sometimes locks us out of HN right after submitting a comment. And it is now too late, not editable)

karmasimida
0 replies
8h47m

Exactly, especially when the concept of filesystem really is defined before the whole internet scale becomes a thing or reality.

Maybe S3 isn't a filesystem according to this definition, but does it really matter to make it one? I doubt it. The Elastic Filesystem is also an AWS product, but you can't really work as one as you have locally, any folder over 20k files basically will timeout if you do a ls. Does it make EFS a filesystem or not?

YouWhy
7 replies
11h59m

The article is well written, but I am annoyed at the attempt to gatekeep the definition of a filesystem.

Like literally any abstraction out there, filesystems are associated with a multitude of possible approaches with conceptually different semantics. It's a bit sophistic to say that Postgres cannot be run on S3 because S3 is not a filesystem; a better choice would have been to explore the underlying assumptions; (I suspect latency would kill the hypothetical use case of Postgres over S3 even if S3 had incorporated the necessary API semantics - could somebody more knowledgeable chime in?).

A more interesting venue to pursue would be - what other additions could be made to the S3 API to make it more usable on its own right - for example, why doesn't S3 offer more than one filename per blob? (e.g., a similar to what links do in POSIX)

zX41ZdbW
3 replies
11h31m

ClickHouse can work with S3 as a main storage. This is possible because a table is a set of immutable data parts. Data parts can be written once and deleted, possibly as a result of a background merge operation. S3 API is almost enough, except for cases of concurrent database updates. In this case, it is not possible to rely on S3 only because it does not support an atomic "write if not exists" operation. That's why external, strongly consistent metadata storage is needed, which is handled by ClickHouse Keeper.

afiori
1 replies
8h40m

Is a "write if not exists" atomic operation enouhg as a concurrency primitive for database locks?

mlhpdx
0 replies
2h4m

Conditional PUT would be a great addition to S3, indeed.

jillesvangurp
0 replies
6h56m

The notion of postgres not being able to run on s3 has more to do with the characteristics of how it works than with it not being a filesystem. After all, people have developed fuse drivers for s3 so they can actually pretend it's a filesystem. But using that to store a database is going to end in tears for the same reasons that using e.g. NFS for this is also likely to end in tears. You might get it to work but it won't be fast or even reliable. And since NFS actually stands for networked file system, it's hard to argue that NFS isn't a filesystem.

Whether something is or isn't a filesystem requires defining what that actually is. A system that stores files would be a simple explanation. Which is clearly something S3 is capable of. This probably upsets the definition gatekeepers for whatever more specific definitions they are guarding. But it has a nice simple logic to it.

It's worth considering that file systems have had a long history, weren't always the way they are now, and predate the invention of relational databases (like postgres). Technically before hard disks were invented in the fifties, we had no file systems. Just tapes and punch cards. A tape would consist a single blob of bits, which you'd load in memory. Or it would have multiple such blobs at known offsets. I had cassettes full of games for my commodore 64. But no disk drive. These blobs were called files but there was no file system. Sometime, after the invention of disks file systems were invented in the early sixties.

Hierarchical databases were common before relational databases and filesystems with directories are basically a hierarchical database. S3 lacking hierarchy as a simpler key value store clearly isn't a hierarchical database. But of course it's easy to mimic one simply by using / characters in the keys. Which is how the fuse driver probably fakes directories. And S3 even has APIs to listfiles with a common prefix. A bigger deal is the inability to modify files. You can only replace them with other files (delete and add). That kind of is a show stopper for a database. Replacing the entire database on every write isn't very practical.

defaultcompany
0 replies
1h36m

I’ve wondered this also because it can be handy to have multiple ways of accessing the same file. For example to obfuscate database uuids if they are used in the key. In theory you could implement soft links in AWS by just storing a file with the path to the linked file. But it would be a lot of manual work.

wodenokoto
5 replies
8h35m

Is there a generic name for these distributed cloud file storages?

AWS is S3, google is buckets, Azure is blob storage, the open source version is … ?

dexwiz
2 replies
8h30m

Object Storage

jeffbr13
1 replies
7h5m

I tend to go by Binary Large OBject (BLOB) storage to discern between this kind of object storage and “object” as in OOP. BLOB is also what databases call files stored in columns.

OJFord
0 replies
6h8m

When would that be confusing? As in what would an AWS service offering OOP object storage be/mean?

surajrmal
0 replies
3h42m

Google buckets is a bit off - the product is called Google storage. Buckets are also a term used by s3 and are equivalent to azure blob storage containers. They are an intermediary layer that determines attributes for the objects stored within it such as ACLs and storage class (and therefore cost and performance).

As to your question, object storage[1] seems to be the generic term for the technology. Internally they all rely on naming files based on the hash of their contents for quick lookup, deduplication, and avoiding name clashes.

1: https://en.wikipedia.org/wiki/Object_storage

gilbetron
0 replies
3h46m

"blob storage" is the usual generic term, even though Azure uses it explicitly. It's like calling adhesive bandages, "bandaids" even though that is a specific company's term.

hn72774
5 replies
10h6m

Filesystem software, especially databases, can't be ported to Amazon S3

Hudi, Delta, iceberg bridge that gap now. Databricks built a company around it.

Don't try to do relational on object storage on your own. Use one of those libraries. It seems simple but it's not. Late arriving data, deletes, updates, primary key column values changing, etc.

albert_e
3 replies
9h49m

There is specifically block storage service (EBS) and falvirs of it like EBS multi-attach and EFS that can ne used if there is a need to port software/databases to the cloud with low level filesystem support.

Why would we need to do it on object storage which addresses a different type of storage need.

Nevertheless there are projects like EMRFS and S3 file system mount points that try to provide files stem interfaces to workloads that need to see S3 as a filesystem.

hn72774
1 replies
3h14m

S3 is better for large datasets. It's cheaper and handles large file sizes with ease.

It has become a de-facto standard for distributed, data-intensive workloads like those common with spark.

A key benefit is decoupling the data from the compute so that they can scale independently. EBS is tightly coupled to iops and you pay extra for that.

(Source: a long time working in data engineering)

albert_e
0 replies
1h21m

Yes and I also believe:

Experienced Spark / Data Engineering teams would not assume S3 is readily useable as a filesystem.

This [1] seems like a good guide on how to configure spark for working with Cloud object stores, while recognizing the limitations and pitfalls.

[1]: https://spark.apache.org/docs/latest/cloud-integration.html

---

Amazon EMR offers a managed way to run hadoop or spark clusters and it implements an "EMR FS" [2] system to interface with S3 as storage.

[2]: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-fs.h...

AWS Glue is another option which is "serverless" ETL. Source and Destination can be S3 data lakes read through a data catalog (hive or glue data catalog). During processing AWs Glue can optionally use S3 [3,4,5] for shuffle partition.

[3]: https://aws.amazon.com/blogs/big-data/introducing-amazon-s3-...

[4]: https://docs.aws.amazon.com/glue/latest/dg/monitor-spark-shu...

[5]: https://aws.amazon.com/blogs/big-data/introducing-the-cloud-...

albert_e
0 replies
4h15m

*flavors

*can be used

*file system

(Apologies for typos. The "noprocrast" setting sometimes locks us out of HN right after submitting a comment. And it is now too late, not editable)

8n4vidtmkvmk
0 replies
9h4m

I still don't understand why you'd want to do it in the first place. Just by some contiguous storage.

cynicalsecurity
5 replies
6h46m

Backblaze B2 is worth mentioning while we are speaking of S3. I'm absolutely in love with their prices (3 times lower than of S3). (I'm not their representative).

silvertaza
2 replies
6h32m

With every alternative, the prevailing issue is the fact that your data is as safe as the company your data is with. But I think this can be remedied by doubly external backups.

didgeoridoo
0 replies
6h21m

B2 having an S3-compatible API available makes this particularly easy :)

OJFord
0 replies
6h9m

Backblaze is like if Amazon spun AWS S3 out as its own business (and it added some backup helper tooling as a result) though, I wouldn't really worry any more about it. You could write a second copy to S3 Glacier Deep Archive (using B2 for instant access when you wanted to restore or on a new device) and still be much cheaper.

overstay8930
1 replies
2h3m

We liked B2 but not enough to pay for IPv4 addresses, insane they advertise as a multi-cloud solution but basically kill any chance at adoption when NAT gateways and IPv4 charges are everywhere. We would literally save money paying B2 bandwidth fees (high read low write) but not when being pushed through a NAT64 gateway, or paying an hourly charge just to be able to access B2.

Kwpolska
0 replies
1h51m

How could they launch a cloud service like this and not have IPv6 in 2015? What other basic things did they cheap out on?

nickcw
4 replies
8h59m

Great article - would have been useful to read before starting out on the journey of making rclone mount (mount your cloud storage via fuse)!

After a lot of iterating we eventually came up with the VFS layer in rclone which adapts S3 (or any other similar storage system like Google Cloud Storage, Azure Blob, Openstack Swift, Oracle Object Storage, etc) into a POSIX-ish file system layer in rclone. The actual rclone mount code is quite a thin layer on top of this.

The VFS layer has various levels of compatibility "off" where it just does directory caching. In this mode, like the article states you can't read and write to a file simultaneously and you can't write to the middle of a file and you can only write files sequentially. Surprisingly quite a lot of things work OK with these limitations. The next level up is "writes" - this supports nearly all the POSIX features that applications want like being able to read and write to the same file at the same time, write to the middle of the file, etc. The cost for that though is a local copy of the file which is uploaded asynchronously when it is closed.

Here are some docs for the VFS caching modes - these mirror the limitations in the article nicely!

https://rclone.org/commands/rclone_mount/#vfs-file-caching

By default S3 doesn't have real directories either. This means you can't have a directory with no files in, and directories don't have valid metadata (like modification time). You can create zero length files ending in / which are known as directory markers and a lot of tools (including rclone) support these. Not being able to have empty directories isn't too much of a problem normally as the VFS layer fakes them and most apps then write something into their empty directories pretty quickly.

So it is really quite a lot of work trying to convert something which looks like S3 into something which looks like a POSIX file system. There is a whole lot of smoke and mirrors behind the scene when things like renaming an open file happens and other nasty corner cases like that.

Rclone's lower level move/sync/copy commands don't bother though and use the S3 API pretty much as-is.

If I could change one thing about S3's API I would like an option to read the metadata with the listings. Rclone stores modification times of files as metadata on the object and there isn't a bulk way of reading these, you have to HEAD the object. Or alternatively a way of setting the Last-Modified on an object when you upload it would do too.

klauspost
1 replies
7h0m

If I could change one thing about S3's API I would like an option to read the metadata with the listings.

Agree. In MinIO (disclaimer: I work there) we added a "secret" parameter (metadata=true) to include metadata and tags in listings if the user has the appropriate permissions. Of course it being an extension it is not really something that you can reliably use. But rclone can of course always try it and use it if available :)

You can create zero length files ending in /

Yeah. Though you could also consider "shared prefixes" in listings as directories by itself. That of course makes directories "stateless" and unable to exist if there are no objects in there - which has pros and cons.

Or alternatively a way of setting the Last-Modified on an object when you upload it would do too.

Yes, that gives severe limitations to clients. However it does make the "server" time the reference. But we have to deal with the same limitation for client side replication/mirroring.

My personal biggest complaint is that there isn't a `HeadObjectVersions` that returns version information for a single object. `ListObjectVersions` is always going to be a "cluster-wide" operation, since you cannot know if the given prefix is actually a prefix or an object key. AWS recently added "GetObjectAttributes" - but it doesn't add version information, which would have fit in nicely there.

nickcw
0 replies
6h31m

Agree. In MinIO (disclaimer: I work there) we added a "secret" parameter (metadata=true) to include metadata and tags in listings if the user has the appropriate permissions. Of course it being an extension it is not really something that you can reliably use. But rclone can of course always try it and use it if available :)

Is this "secret" parameter documented somewhere? Sounds very useful :-) Rclone knows when it is talking to Minio so we could easily wedge that in.

My personal biggest complaint is that there isn't a `HeadObjectVersions` that returns version information for a single object. `ListObjectVersions` is always going to be a "cluster-wide" operation, since you cannot know if the given prefix is actually a prefix or an object key

Yes that is annoying having to do a List just to figure out which object Version is being referred to. (Rclone has this problem when using --s3-list-version).

Hakkin
1 replies
7h57m

If I could change one thing about S3's API I would like an option to read the metadata with the listings. Rclone stores modification times of files as metadata on the object and there isn't a bulk way of reading these, you have to HEAD the object. Or alternatively a way of setting the Last-Modified on an object when you upload it would do too.

I wonder if you couldn't hack this in by storing the metadata in the key name itself? Obviously with the key length limit of 1024 you would be limited in how much metadata you could store, but it's still quite a lot of space, even taking into account the file path. You could use a deliminator that would be invalid in a normalized path, like '//', for example: /path/to/file.txt//mtime=1710066090

You would still be able to fetch "directories" via prefixes and direct files by using '<filename>//' as the prefix.

This kind of formatting would probably make it pretty incompatible with other software though.

nickcw
0 replies
6h26m

I think that is a nice idea - maybe something we could implement in an overlay backend. However people really like the fact that the object they upload with rclone arrive with the filenames they had originally on s3, so I think the incompatible with other software downside would make it unattractive for most users.

throwaway892238
3 replies
10h49m

The "simple" in S3 is a misnomer. S3 is not actually simple. It's deep.

Simple doesn't mean "not deep". It means having the fewest parts needed in order to accomplish your requirements.

If you require a distributed, centralized, replicated, high-availability, high-durability, high-bandwidth, low-latency, strongly-consistent, synchronous, scalable object store with HTTP REST API, you can't get much simpler than S3. Lots of features have been added to AWS S3 over the years, but the basic operation has remained the same.

svat
1 replies
2h28m

It means having the fewest parts needed in order to accomplish your requirements.

That is exactly what "deep" means, in the terminology of this post (from Ousterhout's book A Philosophy of Software Design). Simple means "not complex" (see also Rich Hickey's talk Simple Made Easy: https://www.infoq.com/presentations/Simple-Made-Easy/), while "deep" means providing/having a lot of internally-complex functionality via a small interface. The latter is a better description of S3 (which is what you seem to be saying too) than "simple" which would mean there isn't much to it.

throwaway892238
0 replies
35m

Hickey's definition of simple is wrong. It's not the opposite of complex at all. They are not opposites, nor mutually exclusive.

  - Easy is when something does not require much effort.
  - Simple means the least complex it can be and still work.
  - Complex means there are lots of components.
These are all quite different concepts:

  - Easy is a concept that distinguishes the amount of work needed to use a solution
  - Simple is a concept that distinguishes whether or not there is an excess number of interacting properties in a system
  - Complex is a concept describing the quality of having a number of interacting properties in a system
Hickey's talk is useful in terms of thinking about software, but it also contains many over-generalizations which are incorrect and lead to incorrect thinking about things that aren't software. (Even some of his declarations about software are wrong)

"Deep", in the context of software complexity, probably only makes sense in terms of describing the number of layers involved in a piece of technology. You could make something have many layers, and it could still be simple, or be complex, or easy.

ahepp
0 replies
2h27m

In terms the article puts forth, I would almost argue that simple implies deep (and the associated “narrow” interface).

dmarinus
3 replies
12h0m

I talked to people at AWS who work in RDS Aurora and they hinted they use S3 internally as a backend for MySQL and PostgreSQL.

readyman
1 replies
11h23m

Big if true. That was definitely not in the AWS cert I took lol.

WatchDog
0 replies
29m

Maybe for snapshots, but certainly not for live data.

type_Ben_struct
2 replies
10h19m

Tools like LucidLink and Weka go a way to making S3 even more of a “file system”. They break files into smaller chunks (S3 objects) which helps with partial writes, reads and performance. Alongside tiering of data from S3 to disk when needed for performance.

hnlmorg
0 replies
9h38m

I don’t know a whole lot about LucidLink but Weka basically uses S3 as a dataplane for their own file system.

hiAndrewQuinn
2 replies
10h2m

I feel like I understand the lasting popularity of the humble FTP fileserver a bit better now. Thank you.

jugg1es
1 replies
3h8m

oh but amazon offers SFTP on top of S3 so you don't have to miss out.

hiAndrewQuinn
0 replies
2h53m

If it's offered on top of S3, though, doesn't it still have all the same issues of needing to totally overwrite files?

chubot
2 replies
2h14m

Filesystem software, especially databases, can't be ported to Amazon S3

This seems mistaken. Porting databases that run on local disk to S3 seems like a good way to get a lashing from https://aphyr.com/

Can any databases do it correctly?

If so, I doubt they work with the model of partial overwrites. They probably have to do something very custom, and either sacrifice a lot of tail latency, or their uptime is capped by the uptime of a single AWS availability zone. Doesn't seem like a great design.

(copy of lobste.rs comment)

est31
1 replies
1h47m

My employer (Neon) offers Postgres databases that run on top of a couple of caching layers at the end of which there is S3: https://neon.tech/docs/introduction/architecture-overview

Directly exposing every write to S3 gives you the partial overwrite issues as described. But one can collect a bunch of traffic and push state to S3 once it reaches a threshold. Instead, a few writes in the postgres WAL are held outside of S3 in a replicated on-disk cache.

chubot
0 replies
1h7m

Thanks for the link.

But I searched the docs for "durability" and got zero results. Before I use anything like this, I'd like to see what durability settings are used:

https://www.postgresql.org/docs/current/non-durability.html

Litestream documents the their data loss window, it seems like Neon should too:

https://litestream.io/tips/

By default, Litestream will replicate new changes to an S3 replica every second. During this time where data has not yet been replicated, a catastrophic crash on your server will result in the loss of data in that time window.

I also searched for "data loss" and got zero results -- this is important because Neon is almost certainly sacrificing durability for performance.

chrisblackwell
2 replies
3h3m

Random note: Has anyone noticed how fast the author's webpage is? I know it's static, but I mean it's fast even for the DNS lookup. I would love to know what they have on.

overstay8930
0 replies
2h2m

Full stack Cloudflare is really fast

adverbly
0 replies
2h52m

The response headers include

server: cloudflare

You said it though - the reason is that its static without any js/frameworks/SPA round trip requests.

alphazard
2 replies
3h1m

S3 is not even files, and definitely not a filesystem.

The thing I would expect from a file abstraction is mutability. I should be able to edit pieces of a file, grow it, shrink it, read and write at random offsets. I shouldn't have to go back up to the root, or a higher level concept once I have the file in hand. S3 provides a mutable listing of immutable objects, if I want to do any of the mutability business, I need to make a copy and re-upload. As originally conceived, the file abstraction finds some sectors on disk, and presents them to the client as a contiguous buffer. S3 solves a different problem.

Many people misinterpret the Good Idea from UNIX "everything is a file" to mean that everything should look like a contiguous virtual buffer. That's not what the real Good Idea is. Really: everything can be listed in a directory, including directories. There will be base leaves, which could be files, or any object the system wants to present to a process, and there will be recursive trees (which are directories). The directories are what make the filesystem, not the type of a particular leaf. Adding a new type of leaf, like a socket or a frame buffer, or whatever, is almost boring, and doesn't erode the integrity of the real good idea. Adding a different kind of container like a list, would make the structure of the filesystem more complex, and that would erode the conceptual integrity.

S3 doesn't do any of these things, and that's fine. I just want a place to put things that won't fit in the database, and know they won't bitrot when I'm not looking. The desire to make S3 look more like a filesystem comes from client misunderstanding of what it's good at/for, and poor product management indulging that misunderstanding instead of guarding the system from it.

thinkharderdev
0 replies
52m

S3 is not even files, and definitely not a filesystem.

I agree. To me the correct analog for S3 is a block storage device (a very weird one where blocks can be any size and can have a key associated with them) and not a filesystem. A filesystem is an abstraction that sits on top of a block storage device and so an "S3 filesystem" would have to be an abstraction that sits on top of S3 as the underlying block storage.

akerl_
0 replies
2h55m

How do read-only filesystems align with your definition?

MatthiasPortzel
2 replies
4h12m

This article was an epiphany for me because I realized I've been thinking of the Unix filesystem as if it has two functions: read_file and write_file. (And then getting frustrated with the filesystem APIs in programming languages.)

markhahn
1 replies
54m

So you came from an S3 or other put-get world, and found actual filesystems odd?

I suppose that's not so different from a WMP user's epiphany when they discover processes, shells, etc.

MatthiasPortzel
0 replies
18m

Well I’m used to an application-level view of the file system.

A document editor or text editor opens files and saves files, but these are whole-document operations. I can’t open a document in Sublime Text without reading it, and I can’t save part of a file without saving all of it. So it’s not obvious that these would be different at an OS level.

As the post points out, there are uses for Unix’s sub-file-level read-and-write commands, but I’ve never needed them.

tison
1 replies
6h55m

It's ever discussed in https://github.com/apache/arrow-rs/issues/3888 for comparing object_store in Apache Arrow to the APIs provided by Apache OpenDAL.

Briefly, Apache OpenDAL is a library providing FS-like APIs over multiple storage backends, including S3 and many other cloud storage.

A few database systems, such as GreptimeDB and Databend, use OpenDAL as a better S3 SDK to access data on cloud storage.

Other solutions exist to manage filesystem-like interfaces over S3, including Alluxio and JuiceFS. Unlike Apache OpenDAL, Alluxio and JuiceFS need to be deployed standalone and have a dedicated internal metadata service.

Lucasoato
0 replies
5h27m

I'm not sure if Alluxio could be substituted by OpenDAL as a local cache layer for TrinoDB.

remram
1 replies
2h48m

I am currently pondering this exact problem. I want to run a file-sharing web application (think: NextCloud) but I don't want to use expensive block storage or the dedicated server's disk space for the files, as some of them will be accessed infrequently.

I am wondering if s3fs/rclone-mount is sufficient, or if I should use something like JuiceFS that adds random-access, renaming, etc on top of it. Are those really necessary APIs for my use case? Is there only one way to find out?

(The app doesn't have native S3 support)

cuno
0 replies
1h12m

It depends on if you want to expose filesystem semantics or metadata to applications using it. For example random access writes are done by ffmpeg, which is a workhorse of the media industry, but most things can't handle that or are too slow. We had to build our own solution cunoFS to make it work properly at high speeds.

d-z-m
1 replies
2h15m

S3 is a cloud filesystem, not an object-whatever. [...]I think the idea that S3 is really "Amazon Cloud Filesystem" is a bit of a load bearing fiction.

Does anyone actually think this? I have never encountered anyone who has described S3 in these terms.

teaearlgraycold
0 replies
47m

Not sure if the author is aware of EFS

ahepp
1 replies
2h17m

Are filesystems the correct abstraction to build databases on? Isn’t a filesystem a database in a way? Is there a reason to build a database on top of a filesystem abstraction rather than a block abstraction?

To say you can’t build an efficient database on top of S3 makes sense to me. S3 is already a certain kind of data-storing abstraction optimized for certain usages. If you try and build another data-storing abstraction optimized for incompatible usages on top of that, you are going to have a difficult time.

d0gsg0w00f
0 replies
2h12m

In my $dayjob as cloud architect I sometimes suggest S3 as an alternative to pulling massive JSON blobs from RDS Postgres/Redis etc. As long as their latency minimums are high enough there's no reason you can't.

BirAdam
1 replies
1h29m

Underneath the software, there’s still a filesystem with files.

If you stand up an S3 instance with Ceph, you still have a filesystem on spinning rust or fancy SSDs. There’s just a bunch of stuff on top of that. It’s cool, but to say that there’s no filesystem is simply what the customer or middle person sees, not what is actually happening.

seabrookmx
0 replies
45m

S3 actually uses a completely custom system[1] for writing bytes to disk. I haven't seen much in the way of details on the on-disk format but I certainly wouldn't assume it resembles a normal filesystem.

[1]: https://aws.amazon.com/blogs/storage/how-automated-reasoning...

tutfbhuf
0 replies
54m

S3 is obviously not a filesystem in the sense of a POSIX filesystem. And I would argue it is not a filesystem, even if we were to relax POSIX filesystem semantics (do not implement the full spec). But what is certainly possible is to span a filesystem on top of S3. It is basically possible to span a filesystem on anything that can store data. You can even go crazy for demonstration purposes and put a filesystem on top of YouTube (there are some tech demos for that on GitHub).

I think a better question is whether there are any good filesystem implementations on top of S3. There are many attempts like s3fs-fuse[^1] or seaweedfs[^2], but I have not heard many stories about their use at scale from big companies. Just recently there was a post here about cunoFS[^3]. It is a startup that implements a POSIX-compliant (supports symlinks, hard links (emulated), UIDs & GIDs, permissions, random writes, etc.) filesystem on top of S3/AZ/GCP storage and claims to have really good performance. I think only time will tell if it works out in practice for companies to use S3 as a filesystem through fs implementations on top of S3.

[^1]: https://github.com/s3fs-fuse/s3fs-fuse

[^2]: https://github.com/seaweedfs/seaweedfs

[^3]: https://news.ycombinator.com/item?id=39640307

svat
0 replies
2h25m

It's nice to see Ousterhout's idea of module depth (the main idea from his A Philosophy of Software Design) getting more mainstream — mentioned in this article with attribution only in "Other notes", which suggests the author found it natural enough not to require elaboration. Being obvious-in-hindsight like this is a sign of a good idea. :-)

The concept of deep vs shallow modules comes from John Ousterhout's excellent book. The book is [effectively] a list of ideas on software design. Some are real hits with me, others not, but well worth reading overall. Praise for making it succinct.
somedudetbh
0 replies
59m

Amazon S3 is the original cloud technology: it came out in 2006. "Objects" were popular at the time and S3 was labelled an "object store", but everyone really knows that S3 is for files. S3

Alternative theory: everyone who worked on this knew that it was not a filesystem and "object store" is a description intended to describe everything else pointed out in this post.

"Objects were really popular" is about objects as software component that combines executable code with local state. None of the original S3 examples were about "hey you can serialize live objects to this store and then deserialize them into another live process!" It was all like "hey you know how you have all those static assets for your website..." "Objects" was used in this sense in databases at the time in the phrase "binary large object" or "blob". S3 was like "hey, stuff that doesn't fit in your database, you know...objects...this is a store for them."

This is meant to describe precisely things like "listing is slow" because when S3 was designed, the launch usecases assumed an index of contents existed _somewhere else_, because, yeah, it's not a filesystem. it's an object store.

sbussard
0 replies
2h52m

It’s been a while, but I really like the way google handles its file system internally. No confusion.

jkoudys
0 replies
2h48m

I absolutely loved this article. Super well written with interesting insights.

inkyoto
0 replies
12h28m

S3 is a tagged versioned object storage with file like semantics implemented in the AWS SDK (via AWS S3 API's). The S3 object key is the tag.

Files and folders are used to make S3 buckets more approachable to those who either don't know or don't want to know what it actually is, and one day they get a surprise.

igtztorrero
0 replies
5h12m

Check out kopia.io is a backup software that uses S3 to store files by blocks or pages.

You can browse, search and sort the files and directories of the different snapshot or versions of the file.

I love it !

For me it's a file system in S3.

Bonus: you must use a key, to encrypt the files.

globular-toast
0 replies
9h27m

A filesystem is an abstraction built on a block device. A block device just gives you a massive array of bytes and lets you read/write from them in blocks (e.g. write these 300 bytes at position 273041).

A block device itself is an abstraction built on real hardware. "Write these 300 bytes" really means something like "move needle on platter 2 to position 6... etc"

S3 is just a different abstraction that is also built on raw storage somehow. It's a strictly flat key-object store. That's it. I don't know why people have a problem with this. If you need "filesystem stuff" then implement it in your app, or use a filesystem. You only need to append? Use a database to keep track of the chain of appends and store the chunks in S3. Doesn't work for you? Use something else. Need to "copy"? Make a new reference to the same object in your db. Doesn't work for you? Use something else.

S3 works for a lot of people. Stop trying to make it something else.

And stop trying to change the meaning of super well-established names in your field. A filesystem is described in text books everywhere. S3 is not a filesystem and never claimed to be one.

Oh and please study a bit of operating system design. Just a little bit. It really helps and is great fun too.

gjvc
0 replies
9h11m

JFC the people on this thread missing the difference between object storage and a blocks-and-inodes filesystem is alarming

finalhacker
0 replies
6h7m

S3 not implementate vfs api, but you can treat it as a software defined storage filesystem. Just like Ceph.

there are so many applications depends on file storage, such as Mysql. But horizontal scale for those app still difficult in many case. Replace from vfs api to s3 storage perhaps is trending in my experience.

arvindamirtaa
0 replies
3h58m

Like Gmail is emails but not IMAP. It's fine. We have seen that these kinds of wrappers work pretty well most of the time considering the performance and simplicity they bring in building and managing these systems.

Twirrim
0 replies
12h4m

S3 is a key value store. Just happens to be able to store really large values.