HN comments for: Things you wish you didn't need to know about S3

coolgoose

114 replies

1d13h

2024-05-31 04:38:39 UTC

A lot of them are interesting points, but I am not sure I agree with the complaint the file system is case sensitive.

That's how it should be and I am annoyed at macos for not having it.

smt88

72 replies

1d13h

2024-05-31 05:06:27 UTC

That's how it should be

Why? Windows is also not case-sensitive, so it's not like there's a near-universal convention that S3 is ignoring.

Case sensitivity in file names is surprising even to non-technical people. If someone says they sent you "Book Draft 1.docx" and you check your email to find "Book draft 1.docx," you don't say, "Hey! I think you sent me the wrong file!"

Casing is usually not meaningful even in written language. "Hi, how are you?" means the same thing as "hi, how are you?" Uppercase changes meaning only when distinguishing between proper and common nouns, which is rarely a concern we have with file names anyway.

lultimouomo

17 replies

1d12h

2024-05-31 05:46:40 UTC

Case insensitive matching is a surprisingly complicated, locale-dependent affair. Should I.txt and i.txt match? (Note that the first file is not named I.txt).

Case insensitive filesystems make about as much sense as ASCII-only filenames.

jodrellblank

11 replies

1d6h

2024-05-31 11:29:56 UTC

And yet case insensitive file name matching / string matching is one of my favourite windows features. It’s enormously convenient. An order of magnitude more convenient than the edge cases it causes me.

People aren’t ASCII or UTF-8 machines; “e” and “E” are the same character, that they are different ASCII codes is a behind the scenes implementation detail.

(That said, S3 isn’t a filesystem, it’s more like a web hashtable key-to-blob storage)

kaashif

5 replies

1d5h

2024-05-31 13:01:50 UTC

“e” and “E” are the same character

They don't look like the same character to me. A character is a written symbol. These are different symbols.

What definition of "character" are you using where they're the same character?

I haven't ruled out that I am wrong, this is a naive comment.

tacostakohashi

2 replies

1d5h

2024-05-31 13:13:35 UTC

You are confusing characters with glyphs. A glyph is a written symbol.

taeric

0 replies

1d3h

2024-05-31 14:34:11 UTC

And you seem to be conflating characters and letters. There are fewer letters in the standard alphabet than we have characters for the same, largely because we do distinguish between some letter forms.

I suppose you could imagine a world where we don't, in fact, do this with just the character code. Seems fairly different from where we are, though?

kaashif

0 replies

1d4h

2024-05-31 13:31:10 UTC

I thought that if they're different glyphs they're different characters.

Surely the fact that they're represented differently in ASCII means ASCII regards them as different characters?

Whether they're different glyphs or not depends on the font.

jodrellblank

0 replies

16h54m

2024-06-01 01:33:32 UTC

When you press the "E" key on a US keyboard and "e" comes out, do you return the keyboard because it's broken? If not, then you know what definition I'm using even if I misnamed it.

Zambyte

0 replies

19h47m

2024-05-31 22:40:57 UTC

Are the words hello and HELLO spelled differently? I am pretty squarely in the camp that filesystems should be case sensitive (perhaps with an insensitive shell on top), but I would not consider those two words as having a different spelling. To me that means they are the same sequence of characters.

Zambyte

2 replies

1d4h

2024-05-31 13:47:52 UTC

It’s enormously convenient. An order of magnitude more convenient than the edge cases it causes me.

Can you elaborate on this?

jodrellblank

1 replies

16h23m

2024-06-01 02:04:35 UTC

Every single time I type a path or filename (or server name) in the shell, or in Windows explorer, or in a file -> open or save dialog, I don't trip over capitalization. If I want to glob files with an 'ecks' in the name I can write *x* and not have to do it twice for *x* and *X*.

When I look at a directory listing and it has "XF86Config", I read it in my head as "ecks eff eight six config" not "caps X caps F num eight num six initial cap Config" and I can type what I read and don't have to double-check if it's config or Config.

Tab completion works if I type x<tab> instead of blanking on me and making me double check and type X<tab>.

Case sensitivity is like walking down a corridor and someone hitting you to a stop every few steps and saying "you're walking Left Right Left Right but you should be walking Right Left Right Left".

Case insensitivity is like walking down a corridor.

In PowerShell, some cmdlets are named like Add-VpnConnection where the initialism drops to lowercase after the first letter, others like Get-VMCheckpoint where the initialism stays capitalised, others mixed like Add-NetIPHttpsCertBinding where IP is caps but HTTPS isn't - any capitalisation works for running them or searching them with get-command or tab-completing them. I don't have to care. I don't have to memorise it, type it, pay attention to it, trip over it, I don't have to care!.

"A programming language is low level when its programs require attention to the irrelevant." - Alan Perlis.

DNS names - ping GOOGLE.COM works, HTTPS://NEWS.YCOMBINATOR.COM works in a browser, MAC addresses are rendered with caps or lowercase hex on different devices, so are IPv6 addresses in hex format, email addresses - firstname.lastname or Firstname.Lastname is likely to work. File and directory access behaving the same means it's less bother. In Vim I :set ignorecase.

In PowerShell even string equality check is case insensitive by default, string match and split too. When I'm doing something like searching a log I want to see the english word 'error' if it's 'error' or 'ERROR' or 'Error' and I don't know what it is.

If I say the name of a document to a person I don't spell out the capitalisation. I don't want to have to do that to the computer, especially because there is almost no reason to have "Internal site 2 Network Diagram" and "INTERNAL site 2 network diagram" and "internal site 2 NETWORK DIAGRAM" in the same folder (and if there were, I couldn't easily keep them apart in my head).

All the time in command prompt shell, I press shift less often, type less, change directories and work with files more smoothly with less tripping over hurdles and being forced to stop and doublecheck what I'm tripping over when I read "word" and typed "word" and it didn't work.

On the other hand, the edge cases it causes me are ... well, I can't think of any because I don't want to put many files differing only by case in one directory. Maybe uncompressing an archive which has two files which clash? I can't remember that happening. Maybe moving a script to a case sensitive system? I don't do that often. In PowerShell, method calls are case insensitive. C# has "string".StartsWith() and JavaScript has .startsWith() and PowerShell will take .startswith() or .StartsWith or .Startswith or anything else. That occasionally clashes if there's a class with the same name in different case but that's rare, even.

In short, the computer pays attention to trivia so I don't have to. That's the right way round. It's about the best/simplest implementation of Do What I Mean (DWIM) that's almost always correct and almost never wrong.

jasomill

0 replies

13h7m

2024-06-01 05:20:24 UTC

If I want to glob files with an 'ecks' in the name I can write x* and not have to do it twice for x and X.*

Adding

  shopt -s nocaseglob

to ~/.bashrc makes globbing case-insensitive in bash[1].

Tab completion works if I type x<tab> instead of blanking on me and making me double check and type X<tab>.

Adding

  set completion-ignore-case on

to ~/.inputrc makes completion case-insensitive in bash (and other programs that use libreadline)[2].

Both options are independent of file system case-sensitivity.

[1] https://www.gnu.org/software/bash/manual/html_node/The-Shopt...

[2] https://tiswww.cwru.edu/php/chet/readline/readline.html#inde...

lultimouomo

1 replies

1d6h

2024-05-31 12:10:37 UTC

People aren’t ASCII or UTF-8 machines; “e” and “E” are the same character

They are the same character to you, a native speaker of a Western language written in a latin script. They are the same to you because you are, in fact, an ASCII machine. Many many people in the world are not.

jodrellblank

0 replies

16h58m

2024-06-01 01:29:10 UTC

They are the same to me, they are different in ASCII, therefore I am not an ASCII machine. To me, the person using the computer to do work. Not the person wanting to do extra work to support the computer's internal leaky abstractions of data storage.

Your position, the position of too many people, is that I a native speaker of English etc. should not be allowed to have a computer working how English works because somewhere, someone else is different. This is like saying I shouldn't be allowed an English spell checker because there are other people who speak other languages.

HeatrayEnjoyer

3 replies

1d11h

2024-05-31 07:24:47 UTC

How would locale matter?

ckolkey

1 replies

1d10h

2024-05-31 07:40:32 UTC

Off the top of my head, in turkish, `i` doesn't become `I`, it becomes `İ`. And `ı` is the lower case version of `I`

Dylan16807

0 replies

17h16m

2024-06-01 01:11:26 UTC

You don't need to decide how to upper or lower case a character to be insensitive to case, though. Treating them all as matching isn't a terrible option.

GolDDranks

0 replies

1d10h

2024-05-31 07:34:49 UTC

For example, it depends on the locale if the capitalized form of ß is ß or SS.

segfaltnh

0 replies

1d3h

2024-05-31 15:15:28 UTC

Complicated for who? I've little pity for developers and kernels ease of life as a user.

josephcsible

13 replies

1d13h

2024-05-31 05:20:58 UTC

If someone says they sent you "Book Draft 1.docx" and you check your email to find "Book draft 1.docx," you don't say, "Hey! I think you sent me the wrong file!"

But you also wouldn't say that if they sent "Book - Draft 1.docx", "Book Draft I.docx", "BookDraft1.docx", "Book_Draft_1.docx", or "Book Draft 1.doc", and surely you wouldn't want a filesystem to treat all of them as the same.

quickslowdown

4 replies

1d12h

2024-05-31 05:58:48 UTC

This is a personal reason, but the reason I prefer case sensitive directory names is I can make "logical groupings" for things. So, my python git directory might have "Projects/" and "Packages/," and the capitalization not only makes them stand out as a sort of "root path" for whatever's underneath, but the capitalization makes me conscious of the commands I'm typing with that path. I can't just autopilot a path name, I have to consciously hit shift when tab completion stops working.

That might sound like a dumb reason, but it's kept me from moving things into the wrong directory, or accidentally removing a directory multiple times in the past.

I also use Windows regularly and it really isn't a hindrance, so maybe I wouldn't actually be bothered if everything was case sensitive.

pwagland

2 replies

1d5h

2024-05-31 12:47:17 UTC

TBF, you don't need case sensitive FS for that, just case retaining is enough. And then have the option on how to sort it.

josephcsible

1 replies

1d2h

2024-05-31 16:15:02 UTC

Don't you need case sensitivity for this part?

I can't just autopilot a path name, I have to consciously hit shift when tab completion stops working.

On a system that's case retaining but not case sensitive, wouldn't "pr" autocomplete to "Projects"?

zoky

0 replies

17h29m

2024-06-01 00:58:46 UTC

No, MacOS doesn’t do that. `cat Foo` and `cat foo` will both work, but only the first one will tab complete if the file is called `Foo`.

notjoemama

0 replies

1d2h

2024-05-31 15:51:06 UTC

I like it! That's a great idea.

To me, this sounds like a great practice for terminal environments but may be less intuitive when using file system apps. I could easily overlook a single letter capitalization in a GUI view of many directories. Maybe it's because at a terminal the "view" into the file system is narrow?

Now I'm wondering how I can use this in my docker images. I mean that might irritate devops. Well, maybe they'll like it too. Man, thanks for posting this.

stronglikedan

3 replies

1d3h

2024-05-31 15:00:17 UTC

Capitalization isn't part of grammar. Those examples are different strings of characters altogether.

otteromkram

1 replies

1d2h

2024-05-31 15:52:57 UTC

I'll augment your statement by noting that punctuation is also not part of grammar.

selenography

0 replies

1d1h

2024-05-31 17:22:47 UTC

Another classic counterexample: "This book is dedicated to my parents, Ayn Rand, and God." "This book is dedicated to my parents, Ayn Rand and God."

selenography

0 replies

1d1h

2024-05-31 17:06:06 UTC

The classic, if crude, counterexample: "I helped my uncle Jack off a horse."

(The uncapitalized version doesn't just have different semantics; it has a completely different parse-tree!)

ooterness

2 replies

1d2h

2024-05-31 15:28:55 UTC

You have to draw the line somewhere, but I do appreciate when the UI sorts "Book draft 2" before "Book draft 11". That requires nontrivial tokenization logic and inference, but simple heuristics can be right often enough to be useful.

On that note, ASCIIbetical sort is never the right answer. There is a special place in hell for any human-facing UI that sorts "Zook draft 1" between "Book draft 1" and "book draft 1".

yencabulator

0 replies

21h5m

2024-05-31 21:22:53 UTC

And that line, at least for sorting, belongs firmly outside the filesystem.

Sorting is locale-dependent. Whether a letter-with-dots sorts next to letter-without-dots or somewhere completely different has no correct global answer.

saghm

0 replies

1d1h

2024-05-31 16:52:39 UTC

I think there's a pretty big difference between how the UI orders things and how the filesystem treats things as equivalent. A filesystem treating names case sensitively doesn't prevent the UI from tokenizing the names in any other arbitrary way

cmcconomy

0 replies

1d12h

2024-05-31 05:34:31 UTC

you called it - those are different situations all right

Lutger

11 replies

1d9h

2024-05-31 08:56:45 UTC

There are just not the same characters. A filesystem should not have an opinion on what strings of characters _mean_ the same. It is the wrong level of abstraction.

filenames might even not be words at all, and surely not limited to English. We shouldn't implement rules and conventions from spoken English at a filesystem level, certainly not S3.

MacOS and Windows are just wrong about this.

jodrellblank

4 replies

1d7h

2024-05-31 10:34:57 UTC

Windows doesn’t have it at the file system layer, NTFS is case sensitive. Windows has it at the Win32 subsystem layer, see replies and comments here:

https://superuser.com/questions/364057/

marcosdumay

1 replies

1d3h

2024-05-31 14:39:10 UTC

That's way worse than just putting it on the file system.

Now you have hidden information, that you can't ever change, and may or may not impact whatever you are doing.

jodrellblank

0 replies

16h12m

2024-06-01 02:15:23 UTC

What hidden information that you can't ever change?

lelanthran

1 replies

1d6h

2024-05-31 11:44:01 UTC

Windows doesn’t have it at the file system layer, NTFS is case sensitive.

I think the common phrasing is "case-aware, not case-sensitive".

jasomill

0 replies

20h29m

2024-05-31 21:58:11 UTC

No, NTFS has always been at least optionally case sensitive; current Windows versions even allow case-sensitivity to be controlled on a per-directory basis[1], which even works for (some) Win32 programs:

  Microsoft Windows [Version 10.0.22631.3593]
  (c) Microsoft Corporation. All rights reserved.
  
  C:\Users\jtm>mkdir foo
  
  C:\Users\jtm>fsutil file setCaseSensitiveInfo foo
  Case sensitive attribute on directory C:\Users\jtm\foo is enabled.
  
  C:\Users\jtm>echo bar > foo\bar.txt
  
  C:\Users\jtm>echo Bar > foo\Bar.txt
  
  C:\Users\jtm>dir foo
   Volume in drive C is Aristotle-Win
   Volume Serial Number is E4AE-428B
  
   Directory of C:\Users\jtm\foo
  
  2024-05-31  17:55    <DIR>          .
  2024-05-31  17:55    <DIR>          ..
  2024-05-31  17:55                 6 Bar.txt
  2024-05-31  17:55                 6 bar.txt
                 2 File(s)             12 bytes
                 2 Dir(s)  41,524,133,888 bytes free
  
  C:\Users\jtm>type foo\bar.txt
  bar
  
  C:\Users\jtm>type foo\Bar.txt
  Bar

[1] https://learn.microsoft.com/en-us/windows/wsl/case-sensitivi...

ozim

3 replies

1d8h

2024-05-31 10:16:37 UTC

You look from technical perspective. From average person perspective, even files are too much technicality to deal with.

As a user I want my work to be preserved, I want to view my photos and I want system to know where is my funny foto of my dog I did last Christmas.

As a developer I need an identifier for a resource and I am not going to let user decide on the Id of the resource, I put files in system as GUID and keep whatever user feels as metadata.

Exposing average people to the filesystem is wrong level of abstraction. That is why iOS and Android apps are going that way - but as I myself am used to dealing with files it annoys me that I cannot have that level of control, but I accept that I am quite technical.

graemep

2 replies

1d7h

2024-05-31 11:12:32 UTC

Dealing with files used to be something everyone interacting with computers had to do. It is something average people can do.

I think too much abstraction is a mistake and adds a lot of unneeded complexity.

People should learn something about technology they use. If you want to drive, you need understand how steering wheels work, if you want to drive a manual car (usual where I live and have lived) then you need to know how to work a gear stick and the effect of changing gear.

mgkimsal

1 replies

1d3h

2024-05-31 15:12:54 UTC

used to be something everyone interacting with computers had to do

There were far fewer people 'interacting with computers' at that level years ago.

graemep

0 replies

19h37m

2024-05-31 22:50:42 UTC

Everyone with an office job was still a lot of people though.

frizlab

1 replies

1d8h

2024-05-31 10:04:15 UTC

And so should we be able to have “é.txt” and “é.txt” in the same directory (with a different UTF-8 normalization?) What encoding should we use BTW?

I’m not advocating for case-insensitive fs (literally the first thing I do when I get a Mac is reformat it to be on a case-sensitive fs), but things are not that simple either.

marcosdumay

0 replies

1d3h

2024-05-31 14:41:33 UTC

And so should we be able to have “é.txt” and “é.txt” in the same directory

That's what Linux does.

It does create some problems that seem to never happen on practice, while it avoids some problems that seem to happen once in a while. So yeah, I'd say it's a good idea.

JohnFen

6 replies

1d2h

2024-05-31 15:59:54 UTC

Why? Windows is also not case-sensitive, so it's not like there's a near-universal convention that S3 is ignoring.

Not sure why what Windows does is relevant to this, honestly. Personally, I strongly prefer case sensitivity with filenames, but the lack of it isn't a dealbreaker or anything.

CamperBob2

5 replies

1d2h

2024-05-31 16:03:45 UTC

What are some of the advantages of case sensitivity? Are you saying you actually want to save "Book draft 1.docx" and "Book Draft 1.docx" as two separate files? That just sounds like asking for trouble.

JohnFen

4 replies

1d2h

2024-05-31 16:14:28 UTC

The advantages that I value are that case sensitivity means I can use shorter filenames, it makes it easier to generate programmatic filenames, and I can use case to help in organizing my files.

Are you saying you actually want to save "Book draft 1.docx" and "Book Draft 1.docx" as two separate files?

That's a situation where sensitivity can cause difficulty, yes, but for me personally, that's a minor confusion that is easy to avoid or correct. Everything is a tradeoff, and for me, putting up with that annoyance is well worth the benefits of case sensitivity.

I do totally understand that others will have different tradeoffs that fit them better. I'm not taking away from that at all. But saying "case sensitivity is undesirable" in a broad sense is no more accurate than saying "case sensitivity is desirable" in a broad sense.

Personally, I think the ideal tradeoff is for the filesystem to be case sensitive, but have the user interfaces to that file system be able to make everything behave as case-insensitive if that's what the user prefers.

Dylan16807

3 replies

17h5m

2024-06-01 01:22:29 UTC

Even with only one case, just four characters is enough for a million files. How much benefit are you really getting from case sensitivity?

marssaxman

1 replies

14h54m

2024-06-01 03:33:27 UTC

Unicode case folding is a complicated algorithm, and its definition is subject to change with updated Unicode versions. It's nice not to have to worry about that.

Dylan16807

0 replies

13h39m

2024-06-01 04:48:25 UTC

Okay, but I don't think this has anything to do with the use case JohnFen mentioned or my questions about it.

If your goal is super easy filename generation then you're probably not going to leave ASCII.

And if you do go beyond ASCII for filename packing/generating, then you should instead use many thousands of CJK characters that don't have any concept of case at all. Bypass the question of case sensitivity entirely.

JohnFen

0 replies

5h52m

2024-06-01 12:35:03 UTC

Enough that I prefer it. If that were the only advantage, I'd only slightly prefer it. But being able to use case as a differentiator in filenames intended for me to read is something I find even more valuable.

A filesystem not being case sensitive isn't a dealbreaker or anything. I just prefer case sensitivity because it increases flexibility and readability for me, and has no downsides that I consider significant.

blahgeek

4 replies

1d12h

2024-05-31 05:57:09 UTC

No offense, but I think that's a very western-centric view. Your example only make sense when the user is familiar to English (or other western languages, I guess). To me personally, I find it strange that "D.txt" and "d.txt" means the same file, since they are two very different characters. Likewise, I think you would also go crazy if I tell you "ア.txt" and "あ.txt" means the same file (which is hiragana and katakana for A respectively, which in a sense is equivalent to uppercase and lowercase in Japanese), or "一.txt" and "壹.txt" means the same file (which both means number 1 in Chinese, we call the latter one literally "uppercase number")

ClumsyPilot

1 replies

1d10h

2024-05-31 08:16:29 UTC

Those are all the same, I don’t see an issue

n_plus_1_acc

0 replies

1d7h

2024-05-31 11:06:26 UTC

What if Unicode updates some capitalization rules in the next version, and after an OS updates some filenames now collide and one of the is inaccessible?

taeric

0 replies

1d3h

2024-05-31 14:37:48 UTC

Agreed, and you could even take this into "1.txt" being the same as "One.txt". Which, I mean, fair that I would expect a speech system to find either if I speak "One dot t x t". But, it would also find "Won.txt" and trying to bridge the phonetic to the symbolic is going to be obviously fraught with trouble.

JohnFen

0 replies

1d2h

2024-05-31 16:06:16 UTC

To me personally, I find it strange that "D.txt" and "d.txt" means the same file, since they are two very different characters.

As a native English speaker, I agree with this.

koolba

3 replies

1d7h

2024-05-31 10:50:50 UTC

Casing is usually not meaningful even in written language. "Hi, how are you?"

How about: “pay bill” vs “pay Bill”?

“Usually” in the context of automated systems design is a recipe for disaster.

Computers store bytes, not characters that may just happen to mean similar things. Shall we merge ümlauts? How to handle ß?

lathiat

1 replies

1d5h

2024-05-31 12:58:32 UTC

Case Preserving and Case Sensitive are subtly two different things. Most case insensitive file systems are case preserving and whatever the UTF8 equivalent is I forget the name.

nostrebored

0 replies

1d3h

2024-05-31 15:02:21 UTC

But the gps point is that assuming you know the semantic meaning of the case and if retention is enough is silly.

Assuming case insensitivity is bizarre.

ooterness

0 replies

1d3h

2024-05-31 15:18:44 UTC

Perfect is the enemy of good. It is quite acceptable to streamline the easy cases now and the hard cases later or never.

sergeykish

2 replies

1d8h

2024-05-31 09:58:09 UTC

If someone says they sent you "Book Draft 1.docx" and you check your email to find "Ⓑⓞⓞⓚ Ⓓⓡⓐⓕⓣ ①.ⓓⓞⓒⓧ", "฿ØØ₭ ĐⱤ₳₣₮ 1.ĐØ₵Ӿ" - these are different files.

throwaway211

1 replies

1d4h

2024-05-31 13:28:45 UTC

I have a feeling you enjoyed that character set lookup. I know I did seeing it.

Izkata

0 replies

1d2h

2024-05-31 15:45:51 UTC

Ages ago on Flowdock at work (a chat webapp kind of like Slack that no longer exists), I used the circle ones for a short time as my nickname, and no one could @ me.

zarzavat

1 replies

1d9h

2024-05-31 09:28:02 UTC

File systems are not user interfaces. They are interfaces between programs and storage. Case insensitive is much better for programs.

The user shell can choose however it wants to handle file names, a case sensitive file system does not prevent the shell from handling file names case insensitively.

zarzavat

0 replies

1d6h

2024-05-31 12:25:41 UTC

case insensitive is much better for programs

Can’t edit my comment. I mean case sensitive is better for programs, of course.

up2isomorphism

0 replies

1d12h

2024-05-31 06:12:08 UTC

Then why don’t you just always write in lower case?

rzwitserloot

0 replies

1d4h

2024-05-31 13:36:08 UTC

Also note that 'are these 2 words case insensitively equal' is impossible without knowing what locale rules to apply. And given that people's personal names tend to have the property that any locale rules that must be applied are _the locale that their name originates from_, and that no repository of names I am aware of stores locale along with the name, that means what you want, is impossible.

In line with case insensitivity, do you think `müller` and `muller` should boil down to for example the same username for login purposes?

That's... tricky. In german, the standard way to transliterate names to strict ASCII would be to turn `müller` into `mueller`. In swiss german that is in fact mandatory. Nobody in switserland is named `müller` but you'll find loads of `mueller`s. Except.. there _are_ `müller` in switzerland - probably german citizens living ther.

So, just normalize `ü` to `ue`, easy, right? Except that one doesn't reverse all that well, but that's probably allright. But - no. In other locales, the asciification of `ü` is not `ue`. For example, `Sjögren` is swedish and that transliterates to `sjogren`, not `sjoegren`.

Bringing it back to casing: Given the string `IJSSELMEER`, if I want to title case that, the correct output is presumably `IJsselmeer`. Yes, that's an intentional capital I capital J. Because it's a dutch word and that's how it goes. In an optimal world, there is a separate unicode glyph for the dutch IJ as a single letter so we can stick with the simple rule of 'to title case a string, upper case the first glyph and lowercase all others, until you see a space glyph, in which case, uppercase the next'. But the dutch were using computers fairly early on and went with using the I and the J (plain ascii) for this stuff.

And then we get into well trodden ground: In turkish, there is both a dotted and a dotless i. For... reasons they use plain jane ascii `i` for lowercase dotted i and plain jane ascii `I` for uppercase dotless I. But they have fancy non-ascii unicode glyphs for 'dotted capital I' and 'dotless lowercase i'.

So, __in turkish__, `IZMIR` is not case-insensitive equal to `izmir`. Instead, `İZMIR` and `izmir` are equal.

I don't know how to solve this without either bringing in hard AI (as in, a system that recognizes 'müller' as a common german surname and treats it as equal to 'mueller', but it would not treat `xyzmü` equal to `xyzmue` - and treats IZMIR as not equal to izmir, because it recognizes it as the name of a major turkish city and thus applies turkish locale rules), or decreeing to the internet: "get lost with your fancypants non-US/UKian weird word stuff. Fix your language or something" - which, well, most cultures aren't going to like.

'files are case insensitive' sidesteps alllllll of this.

re-thc

0 replies

1d12h

2024-05-31 06:01:15 UTC

you don't say, "Hey! I think you sent me the wrong file!"

You do! Why not?

It's a big trap. A lot of counterfeit, spam, phishing etc go by this method. You end up buying a fake brand or getting tricked.

raincole

0 replies

1d10h

2024-05-31 07:29:10 UTC

Casing is usually not meaningful even in written language. "Hi, how are you?" means the same thing as "hi, how are you?" Uppercase changes meaning only when distinguishing between proper and common nouns, which is rarely a concern we have with file names anyway.

The number of spaces is usually not meaningful in written language. "Hi, how are you?" means the same thing as "Hi, how are you ?". I don't think it's a good reason to make file system ignore space characters.

pkulak

0 replies

1d12h

2024-05-31 05:31:47 UTC

Yeah, but that little bit of user friendliness ruins the file system for file system things. Now you need “registries” and other, secondary file systems to do file system things because you can’t even use base64 in file names. Make your file browsing app case insensitive, if that’s what you want. Don’t build inferiority down to the core.

paulddraper

0 replies

1d5h

2024-05-31 13:22:48 UTC

Why?

Because it introduces extra complexity.

Now, "Cache" and "cache" are the same, but also...different because you'd care if Cache suddenly became cache.

fortran77

0 replies

1d12h

2024-05-31 05:47:05 UTC

Why? Windows is also not case-sensitive, so it's not like there's a near-universal convention that S3 is ignoring.

You can enable case sensitivity for directories or disks, but this is usually done for special cases, like git repos

pavlov

14 replies

1d13h

2024-05-31 05:26:38 UTC

Case insensitive is how humans think about names. “John” and “New York” are the same identifiers as “john” and “new york”. It would be pretty weird if someone insisted that their passport is invalid because the name is printed in all caps and that’s not their preferred spelling.

IMO the best thing would be to call Unix-style case-sensitive file names something else. But it’s obviously too late for that.

bigstrat2003

5 replies

1d11h

2024-05-31 06:47:56 UTC

Agreed. I think case sensitivity in Unix filesystems is actually a pretty poor design decision. It prioritizes what is convenient for the computer (easy to compare file paths) over what makes sense for the user (treating file paths the same way human intuition does).

sham1

0 replies

1d4h

2024-05-31 14:02:01 UTC

But the thing is that the file system doesn't need to be case-insensitive for your system to support human intuition! As others have said, people don't look at and use filesystems, they use programs that interface with the filesystem. You can absolutely have a case-sensitive system that nonetheless lets you search files in a case-insensitive manner, for example. After all, to make searches efficient, you might want to index your file structure, and while doing that, you might as well also have a normalised file name within the index you search against.

Now, as you said, UNIX did the choice that's easier for computers. And for computers, case-insensitive filesystems would be worse. There are things that are definitely strange about UNIX filesystems (who doesn't love linefeeds in file names!?), but case-sensitivity is not one of them.

kaashif

0 replies

1d5h

2024-05-31 13:11:30 UTC

I don't know if that's right. The most obvious way two characters can be the same is if they actually look exactly the same i.e. are homoglyphs https://en.wikipedia.org/wiki/Homoglyph

But no filesystem I am aware of is actually homoglyph insensitive.

Case insensitive filesystems picked one arbitrary form of intuition (and not even the .oat obvious one) in one language (English) and baked that into the OS at a somewhat deep level.

You say "human intuition" - are those using different writing systems nonhuman then?

howerj

0 replies

1d5h

2024-05-31 12:54:15 UTC

Except that is not true, it is sometimes convenient, and sometimes very inconvenient and not wanted. My reasoning for file systems that are case sensitive is the following:

1. Some people want file systems to case sensitive. 2. Case sensitive is easier to implement. This is very much not a trivial thing. Case insensitivity only really makes sense for ASCII.

In the camp of wanting case insensitivity:

1. Some people want file systems to be case insensitive.

There is more in favor of case sensitivity.

JackSlateur

0 replies

1d7h

2024-05-31 11:11:29 UTC

But end users do not speak to filesystems.

Programs speak to filesystems.

7bit

0 replies

1d2h

2024-05-31 16:03:48 UTC

In Germany there is a lowercase letter ß. It actually is a ligature of the letters s and z. It does not have an uppercase variant, because there is no word that begins with it. One word would be Straße. If you write that all in uppercase, it technically becomes STRASZE, although you almost always see STRASSE. But if you write that all in lowercase without substituting SS with ß, you are making a mistake. And although Switzerland is a german-speaking country, they have different spelling and rarely use ß -- if not ever.

This is just one of many cases, where case-insensitiy would give more trouble than it's worth. And others pointed out similar cases with the Turkish language in this post.

lalaithion

3 replies

1d11h

2024-05-31 06:54:23 UTC

The word “Turkey” is not the same as “turkey”, “August” is not the same as “august”, and “Muse” is not the same as “muse”. https://en.m.wikipedia.org/wiki/Capitonym

scosman

0 replies

1d6h

2024-05-31 11:59:05 UTC

They might be at the beginning of a sentence (depends on the reason for capitalization).

It’s more like identifier reuse, on a case insensitive “system”.

“John” isn’t the same as “John” if I’m talking about two separate Johns.

pavlov

0 replies

1d10h

2024-05-31 07:32:09 UTC

Yet "TURKEY" is not a separate word from "Turkey" and "turkey". Ultimately context disambiguates these words, not capitalization.

justincormack

0 replies

1d9h

2024-05-31 09:01:12 UTC

And “polish” and “Polish” are not even pronounced the same.

Hamuko

3 replies

1d11h

2024-05-31 06:41:04 UTC

Humans will also treat "Jyväskylä", "Jyvaskyla" and "Jyvaeskylae" as the same identifiers but I don't think that's a good basis for file storage to have those be the same filenames.

pavlov

1 replies

1d10h

2024-05-31 07:34:13 UTC

In the era of Unicode, this battle is pretty much lost. Several different code point sequences can produce the glyph 'ä', and user input can contain any of these. You need to normalize anyway.

glandium

0 replies

1d7h

2024-05-31 10:38:38 UTC

And macOS does that normalization at the filesystem level.

justincormack

0 replies

1d9h

2024-05-31 09:02:35 UTC

Passport offices care and may object.

sholladay

6 replies

1d13h

2024-05-31 05:03:14 UTC

macOS is case preserving, though. To me, it’s the best of both worlds. You can stylize your file names and they will be respected, but you don’t have to remember how they are stylized when you are searching for or processing them because search is case insensitive.

self_awareness

2 replies

1d12h

2024-05-31 05:51:48 UTC

Maybe macOS is case-preserving, but it's not encoding-preserving. If you create a file using a composed UTF-8 "A", the filesystem layer will decompose the string to another form, and create the filename using that decomposed form "B". Of course, "A" and "B" when compared can be completely different (even when compared using case insensitivity enabled), yet will point to the same file.

More info here: https://eclecticlight.co/2021/05/08/explainer-unicode-normal...

halostatue

1 replies

1d1h

2024-05-31 16:38:45 UTC

macOS (Darwin) has always written filenames as NFD via the macOS APIs. The underlying POSIX-ish APIs may not do NFD, but Finder and every native macOS GUI program gets files in NFD format.

self_awareness

0 replies

21h32m

2024-05-31 20:55:15 UTC

And this is different from what I wrote how exactly?

Btw, this has nothing to do with POSIX vs Finder, it's a filesystem driver trait, at least for HFS+, but probably for APFS as well.

alt227

1 replies

1d6h

2024-05-31 12:13:49 UTC

IMO this is the worst possible solution, as what you are seeing is not what you are getting. You do not actually know what is being stored on the file system, and your searches are fuzzy rather than precise.

filleduchaos

0 replies

1d2h

2024-05-31 16:04:01 UTC

You do not actually know what is being stored on the file system

This makes no sense to me. Did the user's file explorer (whether GUI or via commands like `ls`) suddenly disappear?

smt88

0 replies

1d13h

2024-05-31 05:06:43 UTC

Windows is also case-insensitive but case-preserving

josephcsible

4 replies

1d13h

2024-05-31 05:18:00 UTC

macOS has case sensitivity. It's just off by default and is a major pain to turn on. You have to either reinstall from scratch onto a case-sensitive partition, or change the "com.apple.backupd.VolumeIsCaseSensitive" xattr from 0 to 1 in a Time Machine backup of your whole system and then restore everything from it.

JonathonW

2 replies

1d12h

2024-05-31 05:28:55 UTC

You shouldn't do this if you value things working, though-- this is a pretty rare configuration (you have to go way out of your way to get it), so many developers won't test with it and it's not unheard of for applications to break on case-sensitive filesystems.

If you absolutely need case-sensitivity for a specific application or a specific project, it's worth seeing if you can do what you need to do within a case-sensitive disk image. It may not work for every use-case where you might need a case-sensitive FS, but if it does work for you, it avoids the need to reinstall to make the switch to a case-sensitive FS, and should keep most applications from misbehaving because the root FS is case-sensitive.

josephcsible

0 replies

1d12h

2024-05-31 05:46:25 UTC

Most things work fine, but it will break (or at least did break at one point) Steam, Unreal Engine, Microsoft OneDrive, and Adobe Creative Cloud. I'm rather surprised about the first two, since they both support Linux with case-sensitive filesystems. I took the opposite approach as you, though: making my root filesystem case-sensitive and creating a case-insensitive disk image if I ever needed those broken programs.

danielheath

0 replies

1d12h

2024-05-31 06:19:18 UTC

I keep a case sensitive volume around to checkout code repositories into. For everything else I prefer it insensitive, but my code is being deployed to a case sensitive fs.

jimvdv

0 replies

1d7h

2024-05-31 11:21:53 UTC

I just mount a case sensitive Apple File System disk image at ~/code, works well

ivanhoe

4 replies

1d7h

2024-05-31 11:26:47 UTC

That's how it should be

Why exactly? I'm not aware of any benefits of filenames being case-sensitive, it just opens a room for tons of very common mistakes that literally can't happen otherwise. It's not like in coding where it helps enforce the code style and thus aids readability - and even in programming it was a source of PITA to solve bugs before IDEs became smart enough to catch typos in var names. One thing I loved in Pascal the most is that it didn't care about the case, unlike the C.

capitol_

2 replies

1d5h

2024-05-31 12:35:49 UTC

The case-sensitivity algorithm needs a locale as input in order to correctly calculate the case conversion rules.

The most common example is probably that i (U+0069 LATIN SMALL LETTER I) and I (U+0049 LATIN CAPITAL LETTER I) transform into each other in most locales, but not all. In locales az and tr (the Turkic languages), i uppercases to İ (U+0130 LATIN CAPITAL LETTER I WITH DOT ABOVE), and I lowercases to ı (U+0131 LATIN SMALL LETTER DOTLESS I).

case-insensitive is all fine if you only handle text that consist of A-Za-z, but as soon as you want to write software that works for all languages it becomes a mess.

yencabulator

0 replies

20h36m

2024-05-31 21:51:58 UTC

Minor nitpick: case-insensitive comparison is a separate problem from case conversion, and IIRC a little simpler. Still locale-specific.

jerf

0 replies

1d4h

2024-05-31 13:39:21 UTC

This is the main point, and almost all the other chatter is not particularly relevant. A dumb computer and a human can agree with "files are case sensitive and sometimes that's a bit weird but computers are weird sometimes". If there was indeed exactly one universal way to have case insensitivity it would be OK. Case insensitive file systems date from when there was. Everything was English and case folding in English is easy. Problem solved. But that doesn't work today. And having multiple case folding rules is essentially as unsolvable a problem as the problems that arise from case sensitivity, except they're harder for humans to understand, including programmers.

Simple and wrong is better than complicated and wrong and also the wrong is shoved under the carpet until it isn't.

Though you still ought to declare a Unicode normalization on the file system. Which would be perfectly fine if it weren't for backwards compatibility.

codeflo

0 replies

1d6h

2024-05-31 11:44:58 UTC

Except at the UI layer (where you can easily offer suggestions and do fuzzy search), the opposite is true. There are so many different ways to do case-insensitive string comparisons, and it's so easy to forget to do that in one place, that case-insensitivity just leads to ton of bugs (some of which will be security critical).

For example, did you know that Microsoft SQL Server treats the columns IS_ADMIN and is_admin as either the same or two different columns depending on the database locale (because e.g. Turkish distinguishes between i and I)? That's at least a potential security bug right there.

pjmlp

3 replies

1d12h

2024-05-31 06:11:46 UTC

UNIX is one of the few OSes that went down that path.

Others do offer the option if one is so inclined, and also prepared to deal with legacy software that expects otherwise.

Which is also the case with macOS, because although it is a UNIX, OS X had to catter to the Mac OS developer community used to HFS and HFS+.

codeflo

2 replies

1d6h

2024-05-31 11:46:24 UTC

Curiously, iOS and iPadOS file systems are case-sensitive. There's less legacy there, so they opted to do the correct thing.

larkost

1 replies

1d1h

2024-05-31 17:15:32 UTC

Please don't call this "the correct thing". Please recognize that there are multiple, valid, points of view. What you meant is "the thing I like".

codeflo

0 replies

20h55m

2024-05-31 21:32:43 UTC

It's not "the thing I like", it's the better tradeoff. It's less complex and thus more secure (due to reduced API surface and fewer opportunities to make lookup mistakes or to mistakenly choose the wrong out of dozens of kinds of case-insensitive comparison in a security decision). It's also potentially faster, and more compatible with other Unixes.

yencabulator

0 replies

20h30m

2024-05-31 21:57:39 UTC

[delayed]

segfaltnh

0 replies

1d3h

2024-05-31 15:13:03 UTC

Case sensitive filesystems are a mistake.

imadj

0 replies

1d1h

2024-05-31 16:43:11 UTC

I agree 100%.

From a technical implementation pov 'A' & 'a' are well established as different characters (ascii, unicode, etc). Regardless of personal preference, I don't understand how can a developer/Sys admin be surprised and even frustrated that a file system is case sensitive.

The developer is still free to abstract this away for the end user when it makes sense such as search results

dagrz

0 replies

1d11h

2024-05-31 07:08:00 UTC

Author here. There's no complaint. It's an observation rather than an absolute good or bad. It's something you have the consider in designing your application.

andruby

0 replies

1d3h

2024-05-31 15:01:37 UTC

You can format disks in MacOS to be case sensitive.

pachico

9 replies

1d12h

2024-05-31 06:23:33 UTC

I have the feeling that the entire case (in)-sensitive discussions are usually too much English-centric.

Allow me to iterate: I have the feeling way too many language discussions, especially in IT, are too much English-centric.

teitoklien

5 replies

1d11h

2024-05-31 07:10:52 UTC

too much English-centric.

Pretty glad about it considering how much more simpler ASCII was to work with compared to Unicode.

I say it as a non native english speaker, programming has so many concepts and stuff already, its best not to make it more complex by adding a 101 different languages to account for.

Unicode and Timezone, the two things that try to bring more languages and cultures to be accounted for while programming and look what happens, it creates the most amount of pain for everyone including non native english programmers.

I dont want to write computer programs in my non-english native tongue, if that means i’ll have to start accounting for every major language while im programming.

Its fine that IT discussions are so English-centric. Diversity is more complexity, and no one owns the english language, its just a tool used by people to communicate, thanks to that universal language, I can express my thoughts to most people in India, China, Japan, South America, etc due to having 1 common language.

They all own the english language too, the moment they decided to speak in it.

No need to bring diversity politics in IT.

Best to keep it technical.

regentbowerbird

1 replies

1d8h

2024-05-31 09:30:59 UTC

No need to bring diversity politics in IT.

Politics is just "how people think things should be". Therefore politics are everywhere not because people _bring_ them everywhere but because they arise from everything.

Your comment is in fact full of politics, down to your opinion that politics shouldn't be included in this discussion.

thanks to that universal language, I can express my thoughts to most people in India, China, Japan, South America, etc due to having 1 common language

Personally my impression is that native speakers just run circles around everyone else during meetings and such. Being _truly_ comfortable in the language, mastering social cues, being able to confidently and fluently express complex ideas, mean that they effectively take over the room. In turn that means they will hold more power in the company, rise in rank more quickly, get paid more, etc.. There's an actual, significant consequence here.

Plus, anglos usually can't really speak another language, so since they don't realize how hard it is they tend to think their coworkers are idiots and will stick to do things with other anglos rather than include everyone.

Diversity is more complexity

In a vacuum I agree, but within the context of your comment this is kinda saying "your existence makes my life too complex, please stop being different and join the fold"; and I can't agree with that sentiment.

BirAdam

0 replies

1d6h

2024-05-31 12:00:18 UTC

You raise an interesting point about the nature of politics. I’ve been thinking about this a bit, but it seems to me that radical/revolutionary politics are talking about how people want things to be while quotidian political ideas are more about how people ought to do a few things. The distinction here being people’s timelines and depth of thought. If a policy has some seriously bad consequences, people may not notice because they weren’t really thinking of things should be, just the narrower thought of how a thing out to be done (think minimum wage driving automation rather than getting people a better standard of living, or immigration control driving police militarization). Of course, for most politicians, I am not sure either of these are correct. I think for politicians, politics is just the study of their own path to power; they likely don’t care much about whether it’s how things are done or how things ought to be so long as they are the ones with the power.

I don’t know that this comment really ads anything to the conversation, but I do find it all interesting.

Edit: also, on topic, languages are fun. The world is boring when everything is in one language. Languages also hold information in how they structure things, how speakers of that language view the world, and so on, and in those ways they are important contributors to diversity of thought.

squaresmile

0 replies

1d7h

2024-05-31 10:44:46 UTC

thanks to that universal language, I can express my thoughts to most people in India, China, Japan, South America, etc due to having 1 common language.

My lazy ass wishes that English is enough to access those communities too. There are many cool and interesting developers, projects and communities only or mostly working in their native languages. One of the major motivation for me to learn Chinese now is to access those communities.

orphea

0 replies

1d9h

2024-05-31 08:40:52 UTC

  > considering how much more simpler ASCII was to work with compared to Unicode.

And elemental algebra is more simple than differential calculus.

ASCII being simpler just means it is not adequate to represent innate complexity that human languages have. Unicode is not complex because of "diversity politics", whatever that means. It is because languages are complex.

The same story with time zones: they are as complex as time is.

mirchibajji

0 replies

1d5h

2024-05-31 12:31:33 UTC

I'm not sure why you characterise this as political.

I wish the "case" was a modifier like Italic, bold. It would have been easier to _not_ have separate ASCII codes for upper and lower-case letters in the first place. What are your thoughts on MS Word using different characters for opening and closing quotes?

reddalo

2 replies

1d11h

2024-05-31 06:47:13 UTC

Speaking of non-English cultures, do Japanese case insensitive systems differentiate between hiragana and katakana?

Because, in some ways, the two syllabaries remind me of uppercase and lowercase alphabets.

thaumasiotes

0 replies

1d9h

2024-05-31 08:29:00 UTC

They're more like distinguishing between "o" and “ℴ”.

Which is where the European idea of "capital letters" originates, but not how we think about them today.

presentation

0 replies

1d3h

2024-05-31 14:50:07 UTC

A lot of the time forms will specifically ask for hiragana or katakana, or specify full width or half width characters.

But basically it’s a mess there too

dale_glass

9 replies

1d11h

2024-05-31 06:32:32 UTC

The case sensitivity one is easy, here's a thing that's more likely to be entirely unintuitive:

S3 paths are fake. Yes, it accepts uploads to "/builds/1/installer.exe", and yes, you can list what's in /builds, but all of that is a simulation. What you actually did was to upload a file literally named '/builds/1/installer.exe' with the '/' as part of the name.

So, "/builds/1//installer.exe" and "/builds//1/installer.exe" are also possible to upload and entirely different files. Because it's the name of a key, there's no actual directories.

cnity

3 replies

1d9h

2024-05-31 08:40:21 UTC

Aside from resolution of paths to some canonical version (e.g. collapsing redundant /'s in your example), what actually is an "actual directory" other than a prefix?

ncruces

1 replies

1d9h

2024-05-31 08:59:14 UTC

An actual directory knows its direct children.

An actual directory A with 2 child directories A/B and A/C, each of which with 1 million files, doesn't need to go through millions of entries in an index to figure out the names of B and C.

It's also not possible for A and A/ to exist independently, and for A to be a PDF, A/ an MP4 and A/B a JPG under the A/ ”directory". All of which are possible, simultaneously, with S3.

marcosdumay

0 replies

1d3h

2024-05-31 14:50:39 UTC

Just to add, "directory" is something you use to look-up information.

The thing is called "directory" exactly because you go there to look what its children are.

mock-possum

0 replies

23h34m

2024-05-31 18:53:03 UTC

An actual directory is a node in a tree. It may be the child of a parent and it may have siblings - its children may in turn be parents with children of their own.

reddalo

1 replies

1d11h

2024-05-31 06:44:59 UTC

You're right, unless you use the new S3 "Directory buckets" [1], which make the entire thing even more confusing!

[1] https://docs.aws.amazon.com/AmazonS3/latest/userguide/direct...

temac

0 replies

1d1h

2024-05-31 17:19:40 UTC

and which also according to the command docs i read recently add random limitations to maybe half of the S3 API

I'm not sure reusing (parts of) the protocol was a good idea given unrelated half of it was already legacy and discouraged (e.g. the insane permission model with policies and acls) or even when not, let's say... weird and messy.

hnlmorg

1 replies

1d11h

2024-05-31 06:37:24 UTC

Yeah, the prefix thing is a source of so many bugs. I get why AWS took that approach and it’s actually a really smart approach but it still catches so many developers out.

Just this year our production system was hit by a weird bug that took 5 people to find. Turned out the issue was an object literally just named “/“ and the software was trying to treat it as a path rather than a file.

BirAdam

0 replies

1d6h

2024-05-31 12:08:35 UTC

Happened at my job yesterday.

mdaniel

0 replies

1d3h

2024-05-31 15:12:34 UTC

also, don't overlook that "/" is only the default path delimiting character; one is free to use your favorite other character if you need a filename with a "/" in it: https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListObje...

isoprophlex

7 replies

1d13h

2024-05-31 04:49:23 UTC

That bit on failed multipart uploads invisibly sticking around (and incurring storage cost, unless you explicitly specify some lifecycle magic)... just ugh.

And I thought one of the S-es was for 'simple'.

e____g

2 replies

1d12h

2024-05-31 05:49:29 UTC

Yes, that sucks. Blame ahenry@, then GM for S3.

My proposal was that parts of incomplete uploads would stick around for only 24 hours after the most recent activity on the upload, and you wouldn't be charged for storage during that time. ahenry@ vetoed that.

vbezhenar

1 replies

1d10h

2024-05-31 08:23:21 UTC

Why would you propose something that makes company earn less money? I'm sure that at Amazon scale, this misfeature earned millions of dollars.

spacebanana7

0 replies

1d9h

2024-05-31 09:23:21 UTC

Customer relationships. I recall a Bezos quote along the lines of "It's better to lose a refund than to lose a customer".

teitoklien

0 replies

1d12h

2024-05-31 05:44:36 UTC

That S for simple stands for simple ways to skyrocket your expenses

scosman

0 replies

1d6h

2024-05-31 12:02:09 UTC

“Simple” was coined when the alternative was managing a fleet of servers with disks . Time changes everything.

notatoad

0 replies

1d3h

2024-05-31 15:17:33 UTC

this one has cost us many thousands of dollars.

we had cron script on a very old server running for almost a decade, starting a multipart upload every night, pushing what was supposed to be backups to a bucket that also stored user-uploaded content so it was normal that the bucket grows in size by a bit every day. the script was 'not working' so we never relied on the backup data it was supposed to be pushing, never saw the files in s3, the bucket grew at a steady and not unreasonable pace. and then this spring i discovered that we were storing almost 3TB of incomplete multipart uploads.

and yes, i know that anecdote is just chock full of bad practices.

Hamuko

0 replies

1d11h

2024-05-31 06:35:34 UTC

Yeah, I've definitely treaded on the storage cost landmine. Thankfully it was just some cents in my case, but it's really infuriating how badly the console exposes the information.

willsoon

6 replies

1d13h

2024-05-31 04:48:17 UTC

My wife - she's a future developer, I'm a lawyer - has had enough of me explaining some of these details in the form of concerns. I am somewhat satisfied to know that my perplexities are not imaginary.

trallnag

5 replies

1d6h

2024-05-31 11:43:26 UTC

What's a future developer?

cynicalsecurity

4 replies

1d5h

2024-05-31 12:31:31 UTC

A developer that is learning programming or a specific technology? Isn't it self-explanatory or what problem did you have understanding it?

krisoft

2 replies

1d4h

2024-05-31 14:02:19 UTC

Isn't it self-explanatory

No.

what problem did you have understanding it?

It is simply not a phrase I commonly read. When I google it I find a visa consultancy under the name, and not much else. Curiously your comment is also among the results.

The problem is that the phrase "<something> developer" is used to describe what the person develops. A "real-estate developer" invests in real estate as a business. A "web developer" develops web applications, a "game developer" develops games, and so on and so on. So reading the word I immediately thought they mean someone who is developing the future? Like idk Douglas Engelbart or Tim Berners-Lee or someone like that.

If you want to write that someone is learning to become a developer I would recommend the much less confusing "developer in training" phrase, or even better if you just write "they are learning to become a developer".

willsoon

1 replies

1d1h

2024-05-31 16:47:48 UTC

Bad English case. It just is. I pay a lot of attention when I write English, but you can always tell when someone isn't a native English speaker in about two words. Je suis désolé.

krisoft

0 replies

1d1h

2024-05-31 17:01:20 UTC

No worries! Glad that trallnag asked so we could clear it up.

Wishing your wife the best of luck with her career! (and to you too!)

willsoon

0 replies

1d1h

2024-05-31 16:42:27 UTC

Wants to be. He reads a lot of code and tries to understand it and write his own. She will probably never be one of you, like and expert. But she's interested, she works laterally in the field - she's done a lot of hours on the intranet at her job. So I think she can be a developer in the future, a good developer. Now she has a degree in the field and... Wait. But then... she is already a developer. I don't know, man. It's a philosophical question.

scosman

6 replies

1d6h

2024-05-31 11:55:50 UTC

Here is a good one: deleting billions of objects can be expensive if you call delete APIs.

However you can set a wildcard or bucket wide object expiry of time=now for free. You’ll immediately stop being charged for storage, and AWS will manage making sure everything is deleted.

bushbaba

2 replies

1d2h

2024-05-31 16:09:35 UTC

Because AWS gets to choose when the actual deletes happen. The metadata is marked as object deleted but AWS can process the delete off peak. It also avoids the s3 api server being hammered with qps

yencabulator

1 replies

20h21m

2024-05-31 22:06:44 UTC

An explicit delete operation can mark metadata the same way.

The real difference is likely more along the lines of LSM compaction, with expiry they likely use a moral equivalent of https://github.com/facebook/rocksdb/wiki/Compaction-Filter to actually do the deletion.

bushbaba

0 replies

12h51m

2024-06-01 05:36:13 UTC

That’s called a delete marker in s3. Less IO than a true delete but also still can lead to high qps to s3 api server. https://docs.aws.amazon.com/AmazonS3/latest/userguide/Delete...

est31

1 replies

21h56m

2024-05-31 20:31:08 UTC

You’ll immediately stop being charged for storage

The effect of lifecycle rules is not immediate: they get applied in a once-per-day batch job, so the removal is not immediate.

electroly

0 replies

18h54m

2024-05-31 23:33:45 UTC

That's true but OP clearly knows that already. You stop getting charged for storage as soon as the object is marked for expiration, not when the object is finally removed at AWS's leisure. You can check the expiration status of objects in the metadata panel of the S3 web console.

https://docs.aws.amazon.com/AmazonS3/latest/userguide/lifecy...

There may be a delay between the expiration date and the date at which Amazon S3 removes an object. You are not charged for expiration or the storage time associated with an object that has expired.

ericpauley

0 replies

1d6h

2024-05-31 12:05:42 UTC

Nit: the delete call is free, it’s the list call to get the objects that costs money. In theory if you know what objects you have from another source it’s free.

nh2

6 replies

1d12h

2024-05-31 05:32:48 UTC

How about the fact that S3 is not suitable for web serving due to high latencies (in standard storage classes)?

Many people think you can just host the resources for your websites, such as images or fonts, straight on S3. But that can make for a shitty experience:

applications can achieve consistent small object latencies (and first-byte-out latencies for larger objects) of roughly 100–200 milliseconds.

From: https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimi...

weird-eye-issue

0 replies

1d5h

2024-05-31 13:09:53 UTC

This has been well known for over a decade

teitoklien

0 replies

1d12h

2024-05-31 05:46:12 UTC

Most folks use S3 as a source for AWS cloudfront for serving content

You can even use cloudfront signed cookies to give specific users cdn access to only specific content owned by them on S3. How cool is that.

metadat

0 replies

1d12h

2024-05-31 05:37:26 UTC

Put a memcached instance in front and you're good.

mcqueenjordan

0 replies

1d9h

2024-05-31 08:44:52 UTC

s3 is not optimized to directly serve websites, but to durably store and retrieve ~unlimited data.

lulznews

0 replies

1d12h

2024-05-31 05:41:41 UTC

Pretty much everyone knows this …

WatchDog

0 replies

1d12h

2024-05-31 05:45:20 UTC

Typically you would use cloudfront with S3 if you want to use it for serving web assets.

It will cache frequently accessed assets, and in addition to reducing latency may reduce cost quite a bit.

reddalo

5 replies

1d12h

2024-05-31 05:55:20 UTC

I can't trust myself using S3 (or any other AWS service). Nothing is straightforward, there are too many things going on, too much documentation that I should read, and even then (as OP shows) I may be accidentally and unknowingly expose everything to the world.

I think I'll stick to actually simple services, such as Hetzner Storage Boxes or DigitalOcean Spaces.

slig

2 replies

1d5h

2024-05-31 12:37:04 UTC

The way that DO handles secrets should scare anyone. Did you know that if you use their Container Registry and set it up so that your K8S has automatically access to it, their service will create a secret that has full access to your Spaces?

marcosdumay

1 replies

1d3h

2024-05-31 14:46:02 UTC

Hum... Kubernetes is not on the GP's list...

slig

0 replies

23h20m

2024-05-31 19:07:30 UTC

Fair enough, but not having scoped secrets is a red flag.

maximinus_thrax

0 replies

1d2h

2024-05-31 16:20:18 UTC

Nothing is straightforward, there are too many things going on, too much documentation that I should read, and even then (as OP shows) I may be accidentally and unknowingly expose everything to the world.

I took a break from cloud development for a couple of years (working mostly on client stuff) and just recently got back. I am shocked at the amount of complexity built over the years along with the cognitive load required for someone to build an ironclad solution in the public cloud. So many features and quirks which were originally designed to help some fringe scenario are now part of the regular protocol, so that the business makes sure nobody is turned away.

chamomeal

0 replies

1d11h

2024-05-31 07:26:02 UTC

I like digital ocean spaces, but it has its own annoying quirks.

Like I recently found out yif you pipe a video file larger than a few MB, it’ll drop the https:// from the returned Location. So on every file upload I have to check if the location starts with https, and add it on if it’s not there.

Of course the S3 node client GitHub issue says “sounds like a digital ocean bug”, and the digital ocean forums say “sounds like an S3 node client bug” lol

gymbeaux

4 replies

1d4h

2024-05-31 14:14:38 UTC

No mention of how AWS/S3 approximates the size of a file to save CPU cycles. It used to drive me up a wall seeing S3 show a file size as slightly different than what it was.

If I recall correctly, S3 uses 1000 rather than 1024 to convert bytes to KB and MB and GB. This saves CPU cycles but results in “rounding errors” with the file’s reported size.

It’s discussed here, although they’re talking about it being “CLI vs console” which may be true?

https://stackoverflow.com/questions/57201659/s3-bucket-size-...

abadpoli

1 replies

1d4h

2024-05-31 14:21:16 UTC

This almost certainly has nothing to do with “saving CPU cycles” and is most likely just that whoever created the Cloudwatch console used the same rounding that is used for all other metrics in CW, rather than the proper calculation for disk size, and it was a small enough issue that it was never caught until it was too late to change it because changing it would disrupt the customers that have gotten used to it.

presentation

0 replies

1d3h

2024-05-31 14:38:53 UTC

If anything a bit shift could do the 1024 division faster

Self-Perfection

0 replies

1d3h

2024-05-31 15:23:08 UTC

But KB is indeed 1000 bytes, and MB is indeed 1000 KB.

In case of 2^10 units the correct names are kibibyte (KiB) and mebibyte (MiB). Check https://en.wikipedia.org/wiki/Mebibyte#Multiple-byte_units

Yeah we have long standing confusion that for historical reasons KB and MB often means 2^10 bytes units so now when you see KB you really don't know what it means. Therefore I am a staunch supporter of unambiguous KiB and MiB.

DougBTX

0 replies

1d3h

2024-05-31 14:53:43 UTC

S3 doesn't format the size for human display, the object's Size is returned in bytes:

https://docs.aws.amazon.com/AmazonS3/latest/API/API_Object.h...

phrotoma

3 replies

1d7h

2024-05-31 10:32:08 UTC

Here's another fun one that took a colleague and I several days of analysis to diagnose. S3 will silently drop all requests once a single TCP connection has sent 100 HTTP requests.

https://github.com/aws/aws-sdk-go/issues/2825

seabrookmx

2 replies

1d2h

2024-05-31 15:43:12 UTC

It doesn't silently drop, it sends a header indicating it's closed the TCP connection.

This is a pretty common pattern whereby you want keep-alive.for performance reasons, but you don't want clients running _too long_ creating hot spots on your load balancers.

temac

0 replies

1d1h

2024-05-31 17:24:18 UTC

Instead of closing the connection it sends a message stating that it is closed? Wow.

phrotoma

0 replies

1d1h

2024-05-31 17:14:55 UTC

What header? Our client was envoy and it's pretty standards compliant, and it just kept trying to use the connection.

Edit: I see that it is `connection: close` I wonder if that is new behaviour or if envoy did not honour it at the time we encountered the issue.

Thanks for the info!

julik

1 replies

1d5h

2024-05-31 12:40:22 UTC

A few more:

* Multipart uploads cannot be performed from multiple machines having instance credentials (as the principal will be different and they don't have access to each other's multipart uploads). You need an actual IAM user if you want to assemble a multipart upload from multiple machines.

* LIST requests are not only slow, but also very expensive if done in large numbers. There are workarounds ("bucket inventory") but they are neither convenient nor cheap

* Bucket creation is not read-after-write consistent, because it uses DNS under the hood. So it is possible that you can't access a bucket right after creating it, or that you can't delete a bucket you just created until you waited enough for the changes to propagate. See https://github.com/julik/talks/blob/master/euruko-2019-no-su...

* You can create an object called "foo" and an object called "foo/bar". This will make the data in your bucket unportable into a filesystem structure (it will be a file clobbering a directory)

* S3 is case-sensitive, meaning that you can create objects which will unportable into a filesystem structure (Rails file storage assumed a case-sensitive storage system, which made it break badly on macOS - this was fixed by always using lowercase identifiers)

* Most S3 configurations will allow GETs, but will not allow HEADs. Apparently this is their way to prevent probing for object existence, I am not sure. Either way - cache-honoring flows involving, say, a HEAD request to determine how large an object is will not work (with presigned URLs for sure!). You have to work around this doing a GET with a Range: of "very small" (say, the first byte only)

* If you do a lot of operations using pre-signed URLs, it is likely you can speed up the generation of these URLs by a factor of 10x-40x (see https://github.com/WeTransfer/wt_s3_signer)

* You still pay for storage of unfinished multipart uploads. If you are not careful and, say, these uploads can be initiated by users, you will be paying for storing them - there is a setting for deleting unfinished MP uploads automatically after some time. Do enable it if you don't want to have a bad time.

These just off the top of my head :-) Paradoxically, S3 used to be revolutionaly and still is, onl multiple levels, a great products. But: plenty features, plenty caveats.

dugmartin

0 replies

1d3h

2024-05-31 14:28:29 UTC

The one that caught me a couple of weeks ago is multipart uploads have a minimum initial chunk size of 5 MiBs (https://docs.aws.amazon.com/AmazonS3/latest/userguide/qfacts...). I built a streaming CSV post-processing pipeline in Elixir that uses Stream.transform (https://hexdocs.pm/elixir/Stream.html#transform/3) to modify and inject columns. The Elixir AWS and CSV modules handle streaming data in but the AWS module throws an error (from S3) if you stream "out" that totals less than 5 MiBs as is uses multi-part uploads which made me sad.

jimbobthrowawy

1 replies

1d4h

2024-05-31 13:28:44 UTC

Those uploader decides rules are wild. Does that mean someone with a website poorly configured enough can have user content uploaded to, and later served from amazon glacier? (assuming a sufficiently motivated user)

wmfiv

0 replies

1d3h

2024-05-31 15:19:08 UTC

It does. But if you're concerend about this (and many of the other items mentioned), you can control access to those features using IAM.

https://docs.aws.amazon.com/service-authorization/latest/ref...

The condition keys specifically are here and you can see keys to control access to storage class, tagging, etc.

https://docs.aws.amazon.com/service-authorization/latest/ref...

anonymouse008

1 replies

1d1h

2024-05-31 17:08:06 UTC

I’ve become old. If you look at these things in disgust (ACL vs Policies, Delete bucket with s3:*, etc), you’re missing the point of (deterministic ?) software. It does what it is written to do, faithfully, and error out when not. When it doesn’t do as written or as documented, then yes… go full bore.

temac

0 replies

2024-05-31 17:29:23 UTC

The doc is huge, and the principle of least astonishment is often not respected.

Also third parties providers support a random subset of it given the protocol has nothing simple anymore (or maybe never had)

aao

1 replies

1d7h

2024-05-31 11:25:29 UTC

lotsofspots

0 replies

1d7h

2024-05-31 11:27:09 UTC

About a week after that went viral - https://aws.amazon.com/about-aws/whats-new/2024/05/amazon-s3...

yencabulator

0 replies

20h4m

2024-05-31 22:23:14 UTC

The S3 API, and most of AWS, is a kludgy legacy mess.

Is there any chance of getting a new, industry-standard not just industry-adopted, simpler but more usable[1] common API?

We already have some client libraries that try to paper over the differences (https://gocloud.dev/, https://crates.io/crates/object_store), but wouldn't it be nice to have just one wire protocol?

[1]: E.g. standardize create-if-not-exist, even if S3 doesn't implement that.

yencabulator

0 replies

20h27m

2024-05-31 22:00:31 UTC

Object lock until 2099 that can only be cancelled by deleting the AWS account is *nasty*.

xarope

0 replies

1d10h

2024-05-31 07:34:14 UTC

There are only losers in this game, but at least we’ve all got a participation ribbon to comfort us in moments of angst.

I sense the frustration...

unethical_ban

0 replies

36m

2024-06-01 17:51:09 UTC

Schrodiner’s cat is the one that’s both alive and de-lifed at the same time, right?

It was alive and dead at the same time. Don't use censor-appeoved language outside Tiktok!

swiftcoder

0 replies

1d10h

2024-05-31 08:16:01 UTC

S3 isn’t the only service that works this way, Hosted Cognito UI endpoints do something similar (https://[your-user-pool-domain]/login).

Basically every new service that expects to deal with a lot of traffic should work this way (we did this also for AWS IoT). It's a hell of a lot easier to deal with load balancing and request routing if you can segment resources starting at the DNS level...

prpl

0 replies

1d12h

2024-05-31 06:14:30 UTC

Not one thing about different API limits for operations according to key prefixes, and how you need to contact support if you need partitioning and provide them the prefixes, huh?

perpil

0 replies

1d5h

2024-05-31 13:15:16 UTC

A workaround for some of the limits of presigned urls like not being able to specify a max file size is to front your uploads using CloudFront OAC and CloudFront Functions. It costs more (.02/GB) but you can run a little JavaScript code to validate/augment headers between your user and S3 and you don't need to expose your bucket name. https://speedrun.nobackspacecrew.com/blog/2024/05/22/using-c...

nathants

0 replies

22h20m

2024-05-31 20:07:24 UTC

a fun read!

i just built a live streaming platform[1]. chat, m3u8, ts, all objects. list operations used to concat chat objects. works perfectly, object storage is such a great api.

it uses r2 as the only data store, which is a delightfully simple s3-alike. the only thing i miss are some of the advanced listv2 features.

too anyone not enjoying s3 complexity, go try r2.

1. https://nathants.com/live

est31

0 replies

21h51m

2024-05-31 20:37:00 UTC

Regarding the deletion point, note that you cannot delete S3 buckets that are non-empty, so in order to actually delete data you have to first manually delete them. Of course, if any action is allowed then that is as well. But still, it's not a single request away for any non-trivial bucket.

croes

0 replies

1d6h

2024-05-31 11:42:54 UTC

Things you wish you didn't need to know about S3

A time travel paradox in the title is a good place to start a blog post, don’t you think?

Where is the paradox?

buggythebug

0 replies

1d6h

2024-05-31 11:49:04 UTC

Audi S3?

CaliforniaKarl

0 replies

1d12h

2024-05-31 06:03:56 UTC

S3 buckets are the S3 API

… a relatively small part of the API requires HTTP requests to be sent to generic S3 endpoints (such as s3.us-east-2.amazonaws.com), while the vast majority of requests must be sent to the URL of a target bucket.

I believe this is talking about virtual-hosted style and path-style methods for accessing the S3 API.

From what I can see [0], at least for the REST API, the entire API works either with virtual-hosted style (where the bucket name is in the host part of the URL) and path-style (where the bucket name is in the path part of the URL). Amazon has been wanting folks to move over to the virtual-hosted style for a long time, but (as of 3+ years ago) the deprecation of path-style has been delayed[1].

This deprecation of path-style requests has been extremely important for products implementing the S3 API. For example…

* MinIO uses path-style requests by default, requiring you set a configuration variable[2] (and set up DNS appropriately) to handle the virtual-hosted style.

* Wasabi supports both path-style and virtual-hosted style, but "Wasabi recommends using path-style requests as shown in all examples in this guide (for example, http://s3.wasabisys.com/my-bucket/my-object) because the path-style offers the greatest flexibility in bucket names, avoiding domain name issues."[3].

Now here's the really annoying part: The REST API examples show virtual-hosting style, but path style works too!

For example, take the GetBucketTagging example. Let's say you have bucket "karl123456" in region US-West-2. The example would have you do this:

GET /?tagging HTTP/1.1

Host: karl123456.s3.amazonaws.com

But instead, you can do this:

GET /karl123456?tagging HTTP/1.1

Host: s3.us-west-2.amazonaws.com

How do I know this? I tried it! I constructed a `curl` command do to the path-style request, and it worked! (I didn't use "karl123456", though.)

So hopefully that helps resolve at least one of your S3 annoyances :-)

[0]: https://docs.aws.amazon.com/AmazonS3/latest/userguide/RESTAP...

[1]: https://aws.amazon.com/blogs/aws/amazon-s3-path-deprecation-...

[2]: https://min.io/docs/minio/linux/reference/minio-server/setti...

[3]: https://docs.wasabi.com/docs/rest-api-introduction