Too bad we now have Unicode, an elegant castle covered with ugly graffiti and ramshackle addons. For example:
1. normalization
2. backwards running text (hey, why not add spiral running text?)
3. fonts
4. invisible characters
5. multiple code points with the same glyph
6. glyphs defined by multiple code points (gee, I thought Unicode was to get away with that mess from code pages!)
7. made up languages (Elvish? Come on!)
8. you vote for my made-up emoticon, and I'll vote for yours!
How to say you don't know what Unicode is for without saying it.
1, 2, 4, 5, 6, and, unfortunately, 8, all fall under "ability to encode written text from all human languages". And that includes historical. Some of the issues (5 & 6) are due semantic difference even if the resulting glyph looks the same. Unfortunately you can't expect programmers to understand pesky little thing like languages having different writing, so you end up with normalisation to handle the fact that one system sent "a + ogonek accent" and another (properly) sent "a with ogonek" (these print the same but are semantically different!), and now you need to figure out normalisation in order to be able to compare strings.
7. just like 8 are down to proposal of specific new forms of writing to add to Unicode. Elvish had one since 1997 but only now got a tentative "we will talk about it". Klingon, which is IIRC more complete language including native speakers (...weird things happened sometimes) does not have outside of private use area.
Emojis were added because they were used with incompatible encodings first, even before unicode happened, and without including something like SIXEL into unicode they were unrepresentable (and with SIXEL would lose semantic information)
How can these possibly be semantically different? Isn’t the point of combining characters to create semantic characters that are the combination of those parts?
There's a semantic difference between "accented letter" and "different letter that happens to visually look like another language's accented letter".
"Ą" in polish is not "A" with some accent. And the idea behind unicode was to preserve human written text, including keeping track of things like "this is letter A1 with an accent, but this is letter A2 that looks visually similar to A1 with accent but is different semantically". Of course then worries about code page size resulted in the stupidity of Han unification, so Unicode is a bit broken.
But it is precisely "a with some accent", you just have two ways to encode it for
"Ą" is a separate letter in polish alfabet, not an accented variant of "A".
There are writing systems where combining accents are used to represent just variation on a letter. Use of combining characters for "Ą" (and "Ć" and "Ł" and many other so-called "polish letters") is, at best, a historical artefact of trying to write them in deficient encodings.
It doesn't matter that it's a separate letter in an alphabet, you're denying the obvious - it IS an accented (or ogonek'ed) variant of A, and you can achieve this in Unicode in 2 ways: having one id for a precomposed variant and composing the variant from two ids.
There is no semantic difference, just an encoding one, the end result looks the same and means the same thing (well, to a point, it still depends on the context - like what language you mean - but within the same context it's the same thing and there are even Unicode rules to treat it the same like in search etc.)
And precomposed is just the same historical deficiency - you could've just as well designed a more compact encoding with no precomposed letters, only combinations
Unless there's some nuance I'm missing, I think you're reading too much into the word "accent".
Especially because the codepoint is actually called "Combining Ogonek".
And for anyone writing in Cyrillic, it's actually more accurate to use the combining form, even as its own letter, because the only precomposed form technically uses a latin A.
But my main point is that I do not think there is supposed to be any semantic difference in Unicode based on whether you use precomposed or decomposed code points.
I know what its original mission was, which was a character set.
It's been mangled beyond recognition - by including semantic information which is in the purview of context, and presentation information (italics, fonts) which is in the purview of markup languages and layout information (backwards text) which is also in the purview of markup.
But you're requiring programmers to understand all the complicated normalization rules? Normalization is a totally unnecessary feature. Just use the normalized code points. Done.
Think about what this means. How ever did people manage to read and understand printed books? The semantic meaning comes from the context, not the glyph. For example, I can use `a` to mean the `ahh` sound, or the `ayy` sound, or mean a variable in algebra. How can I know which? The context.
It is totally impossible to add every meaning a glyph has.
Unicode is supposed to be a character set. That's it. Characters do not have semantic information without context.
Oh, and here's some drawkcab text I wrote without any help from Unicode at all.
I had to add some code into the D compiler to reject Unicode text direction "characters" because they can be used to invisibly insert malware into ordinary code.
Adding toy "languages" should be for people having fun, not Unicode.
As someone that whose native language isn't representable purely by ASCII, I celebrate it. Plus the first 128 codepoints are same as ASCII in UT-8.
Is Unicode kind of messy? Sure, but that's just natural consequences of writing systems being messy. Every point you made was for a sensible reason that is in a scope of Unicode mission (representing all text in all writing systems).
I'm sure that books can be printed in your language without any need for semantic information in the characters.
Yes, they can.
Is it a problem that they do? I don't think so. Using semantic symbols seems far better option. Most fonts simply map multiple codepoints to a single glyph while dealing with all fun stuff like ligatures and all fun from GSUB tables (and fun company tables in fonts).
Honestly, I see semantic information as an absolute win and good choice. If unicode didn't contain it, it would have to be somewhere else (or making ratehr unpleasant choices like having fj together). It's an illusion that it wouldn't. People want pretty text. Rest of the world doesn't care. They want pretty text everywhere.
Instead of hating unicode, there would be hating "glyph points" plus "markup" (that would be literally everwere, from email to form editors) have all kinds of problems.
Except it doesn't actually work. 'a' has a zillion different semantic meanings, all dependent on context. There is no crisis with somebody reading a book and misunderstanding which particular semantic meaning it has, because it is inferred from the context.
Semantic meaning always comes from context, and Unicode cannot fix that. People can use the mathematical code point for 'a' instead of the text 'a' and the semantic Unicode meaning is meaningless because the reader will see, like the letter 'a' in because, that it is a text meaning.
The only thing you get with multiple code points for 'a' is you can send out multiple identical appearing texts, but are different Unicode, so you can determine who leaked the memo.
Unicode's extremely limited markup ability helps nobody.
How do you Ctrl+F in a printed book? Why printed when we're taking about digital?
If you search for 'a', which one of the Unicode 'a's will it find?
I can't wait for when the majority of Unicode codepoints/glyphs are emojis that are no longer fashionable! That'll be a really weird relic of history, later.
It would probably be like other letters like þ that are no longer fashionable in some languages. Or not-so-small parts of Hanzi. Or completely dead scripts.
That being said, Emoji are a drop in a bucket when it comes to the number of encoded code points. Nicely enough, by encoding emoji outside the BMP, you can now use characters from astral planes in a lot more places without software breaking.
All languages are made up. For that matter, all glyphs are made up, too.
there is not only a quantitative difference between a conlang designed by a small group (or 1 person) and a "human" language developed organically in the span of centuries by millions of speakers, but also qualitative.
Unfortunately there is plently of precendent for this ramshacklism. Like ACK/NAK - those are protocol signals, not characters! ENQ? What even is Shift In/Shift Out (SI/SO)? Then the database characters toward the end there FS, RS, GS, US.
You jest, but you do have cursor positioning ANSI sequences which are designed to let text draw anywhere on your screen. And make it blink! You also don't find it weird to have a destructive "clear-screen" sequence?
I wonder when they started putting the slash across the 0 to differentiate from the O.
I mean you do have the private Unicode range where you can actually do that. But before that, SIXEL graphics.
American Standard Code for Information Interchange
Unicode encodes code points in logical order rather than visual order: the order in which text is supposed to be collated and spoken rather than the visual order.
One tricky issue is when both directions exist in the same text. Unicode can encode nesting of text in one direction within another. For example, text consisting of an English word and a Hebrew word can be encoded as either the English embedded in Hebrew or the Hebrew embedded in English: both would render the same but collate differently.
Is there a better way?
I've seen newspapers with txet sdrawkcab in them. Note that the last sentence has text in both directions.
I didn't need Unicode for that - nobody does.
Why is there no Unicode markup for that?What would be the alternative? I think Unicode is pretty great.
You can pretty easily imagine a world where we had a bunch of different encodings with none being dominant.
Unicode is quite elegant in its encoding too. If you're going to criticize it for its content, maybe start with talking about how ASCII also has invisible characters and those that people rarely use.
Hey at least we got the astral planes. https://justine.lol/dox/unicode.txt
9. color variants
10. code points where the appropriate glyph depends on the language (CJK unification)
Language itself is a pile of ugly graffiti and ramshackle addons. It would be weird if Unicode didn't reflect this.
They're all made-up languages, some were just made-up a little bit more transparently.
Most of this is pretty useful for reproducing a wide gamut of human language. It gets completely fucked when it comes to fonts with png's embedded in svg's and other INSANE matryoshka doll nesting of bitmap/vector rendering technologies.
I also half hate emoji as it pollutes human writable text with bitmaps that are difficult to reproduce by hand on paper with a writing instrument - it's not text. I say half hate as it allows us a standard set of icons to use that can be easily rendered in-line with text or on their own.
For me it's how they inconsistently, backwards-incompatibly, make some existing characters outside of the emoji-plane (and especially when in technical/mathematical blocks) render colored by default, rather than keep everything colored related in the emoji plane (making copies if needed rather than affecting old character, the semantics are very different anyway), e.g. https://imgur.com/a/Ugi7K1i and https://imgur.com/a/UMppZHG