HN comments for: The Elegance of the ASCII Table

WalterBright

30 replies

1d12h

2024-07-23 05:48:23 UTC

Too bad we now have Unicode, an elegant castle covered with ugly graffiti and ramshackle addons. For example:

1. normalization

2. backwards running text (hey, why not add spiral running text?)

3. fonts

4. invisible characters

5. multiple code points with the same glyph

6. glyphs defined by multiple code points (gee, I thought Unicode was to get away with that mess from code pages!)

7. made up languages (Elvish? Come on!)

8. you vote for my made-up emoticon, and I'll vote for yours!

p_l

7 replies

1d12h

2024-07-23 06:08:00 UTC

How to say you don't know what Unicode is for without saying it.

1, 2, 4, 5, 6, and, unfortunately, 8, all fall under "ability to encode written text from all human languages". And that includes historical. Some of the issues (5 & 6) are due semantic difference even if the resulting glyph looks the same. Unfortunately you can't expect programmers to understand pesky little thing like languages having different writing, so you end up with normalisation to handle the fact that one system sent "a + ogonek accent" and another (properly) sent "a with ogonek" (these print the same but are semantically different!), and now you need to figure out normalisation in order to be able to compare strings.

7. just like 8 are down to proposal of specific new forms of writing to add to Unicode. Elvish had one since 1997 but only now got a tentative "we will talk about it". Klingon, which is IIRC more complete language including native speakers (...weird things happened sometimes) does not have outside of private use area.

Emojis were added because they were used with incompatible encodings first, even before unicode happened, and without including something like SIXEL into unicode they were unrepresentable (and with SIXEL would lose semantic information)

news_to_me

5 replies

1d3h

2024-07-23 14:35:44 UTC

"a + ogonek accent" and another (properly) sent "a with ogonek" (these print the same but are semantically different!)

How can these possibly be semantically different? Isn’t the point of combining characters to create semantic characters that are the combination of those parts?

p_l

4 replies

2024-07-23 17:35:08 UTC

There's a semantic difference between "accented letter" and "different letter that happens to visually look like another language's accented letter".

"Ą" in polish is not "A" with some accent. And the idea behind unicode was to preserve human written text, including keeping track of things like "this is letter A1 with an accent, but this is letter A2 that looks visually similar to A1 with accent but is different semantically". Of course then worries about code page size resulted in the stupidity of Han unification, so Unicode is a bit broken.

eviks

2 replies

7h48m

2024-07-24 10:44:03 UTC

But it is precisely "a with some accent", you just have two ways to encode it for

p_l

1 replies

3h44m

2024-07-24 14:48:55 UTC

"Ą" is a separate letter in polish alfabet, not an accented variant of "A".

There are writing systems where combining accents are used to represent just variation on a letter. Use of combining characters for "Ą" (and "Ć" and "Ł" and many other so-called "polish letters") is, at best, a historical artefact of trying to write them in deficient encodings.

eviks

0 replies

3h37m

2024-07-24 14:55:45 UTC

It doesn't matter that it's a separate letter in an alphabet, you're denying the obvious - it IS an accented (or ogonek'ed) variant of A, and you can achieve this in Unicode in 2 ways: having one id for a precomposed variant and composing the variant from two ids.

There is no semantic difference, just an encoding one, the end result looks the same and means the same thing (well, to a point, it still depends on the context - like what language you mean - but within the same context it's the same thing and there are even Unicode rules to treat it the same like in search etc.)

And precomposed is just the same historical deficiency - you could've just as well designed a more compact encoding with no precomposed letters, only combinations

Dylan16807

0 replies

22h54m

2024-07-23 19:38:44 UTC

Unless there's some nuance I'm missing, I think you're reading too much into the word "accent".

Especially because the codepoint is actually called "Combining Ogonek".

And for anyone writing in Cyrillic, it's actually more accurate to use the combining form, even as its own letter, because the only precomposed form technically uses a latin A.

But my main point is that I do not think there is supposed to be any semantic difference in Unicode based on whether you use precomposed or decomposed code points.

WalterBright

0 replies

21h39m

2024-07-23 20:53:26 UTC

How to say you don't know what Unicode is for without saying it.

I know what its original mission was, which was a character set.

It's been mangled beyond recognition - by including semantic information which is in the purview of context, and presentation information (italics, fonts) which is in the purview of markup languages and layout information (backwards text) which is also in the purview of markup.

you can't expect programmers to understand pesky little thing like languages having different writing,

But you're requiring programmers to understand all the complicated normalization rules? Normalization is a totally unnecessary feature. Just use the normalized code points. Done.

these print the same but are semantically different!

Think about what this means. How ever did people manage to read and understand printed books? The semantic meaning comes from the context, not the glyph. For example, I can use `a` to mean the `ahh` sound, or the `ayy` sound, or mean a variable in algebra. How can I know which? The context.

It is totally impossible to add every meaning a glyph has.

would lose semantic information

Unicode is supposed to be a character set. That's it. Characters do not have semantic information without context.

Oh, and here's some drawkcab text I wrote without any help from Unicode at all.

I had to add some code into the D compiler to reject Unicode text direction "characters" because they can be used to invisibly insert malware into ordinary code.

Adding toy "languages" should be for people having fun, not Unicode.

mnau

5 replies

1d9h

2024-07-23 08:58:11 UTC

As someone that whose native language isn't representable purely by ASCII, I celebrate it. Plus the first 128 codepoints are same as ASCII in UT-8.

Is Unicode kind of messy? Sure, but that's just natural consequences of writing systems being messy. Every point you made was for a sensible reason that is in a scope of Unicode mission (representing all text in all writing systems).

WalterBright

4 replies

21h35m

2024-07-23 20:57:42 UTC

I'm sure that books can be printed in your language without any need for semantic information in the characters.

mnau

1 replies

19h48m

2024-07-23 22:44:22 UTC

Yes, they can.

Is it a problem that they do? I don't think so. Using semantic symbols seems far better option. Most fonts simply map multiple codepoints to a single glyph while dealing with all fun stuff like ligatures and all fun from GSUB tables (and fun company tables in fonts).

Honestly, I see semantic information as an absolute win and good choice. If unicode didn't contain it, it would have to be somewhere else (or making ratehr unpleasant choices like having fj together). It's an illusion that it wouldn't. People want pretty text. Rest of the world doesn't care. They want pretty text everywhere.

Instead of hating unicode, there would be hating "glyph points" plus "markup" (that would be literally everwere, from email to form editors) have all kinds of problems.

WalterBright

0 replies

16h8m

2024-07-24 02:24:22 UTC

Using semantic symbols seems far better option

Except it doesn't actually work. 'a' has a zillion different semantic meanings, all dependent on context. There is no crisis with somebody reading a book and misunderstanding which particular semantic meaning it has, because it is inferred from the context.

Semantic meaning always comes from context, and Unicode cannot fix that. People can use the mathematical code point for 'a' instead of the text 'a' and the semantic Unicode meaning is meaningless because the reader will see, like the letter 'a' in because, that it is a text meaning.

The only thing you get with multiple code points for 'a' is you can send out multiple identical appearing texts, but are different Unicode, so you can determine who leaked the memo.

Unicode's extremely limited markup ability helps nobody.

eviks

1 replies

7h45m

2024-07-24 10:47:06 UTC

How do you Ctrl+F in a printed book? Why printed when we're taking about digital?

WalterBright

0 replies

52m

2024-07-24 17:40:24 UTC

If you search for 'a', which one of the Unicode 'a's will it find?

yencabulator

1 replies

1d2h

2024-07-23 16:27:47 UTC

I can't wait for when the majority of Unicode codepoints/glyphs are emojis that are no longer fashionable! That'll be a really weird relic of history, later.

ygra

0 replies

12h17m

2024-07-24 06:15:24 UTC

It would probably be like other letters like þ that are no longer fashionable in some languages. Or not-so-small parts of Hanzi. Or completely dead scripts.

That being said, Emoji are a drop in a bucket when it comes to the number of encoded code points. Nicely enough, by encoding emoji outside the BMP, you can now use characters from astral planes in a lot more places without software breaking.

chthonicdaemon

1 replies

1d12h

2024-07-23 06:15:14 UTC

All languages are made up. For that matter, all glyphs are made up, too.

bandie91

0 replies

1d9h

2024-07-23 08:45:29 UTC

there is not only a quantitative difference between a conlang designed by a small group (or 1 person) and a "human" language developed organically in the span of centuries by millions of speakers, but also qualitative.

RiverCrochet

1 replies

1d6h

2024-07-23 11:54:51 UTC

covered with ugly graffiti and ramshackle addons

Unfortunately there is plently of precendent for this ramshacklism. Like ACK/NAK - those are protocol signals, not characters! ENQ? What even is Shift In/Shift Out (SI/SO)? Then the database characters toward the end there FS, RS, GS, US.

backwards running text (hey, why not add spiral running text?)

You jest, but you do have cursor positioning ANSI sequences which are designed to let text draw anywhere on your screen. And make it blink! You also don't find it weird to have a destructive "clear-screen" sequence?

glyphs defined by multiple code points

I wonder when they started putting the slash across the 0 to differentiate from the O.

you vote for my made-up emoticon, and I'll vote for yours!

I mean you do have the private Unicode range where you can actually do that. But before that, SIXEL graphics.

kps

0 replies

1d4h

2024-07-23 13:47:34 UTC

Like ACK/NAK - those are protocol signals, not characters!

American Standard Code for Information Interchange

Findecanor

1 replies

22h42m

2024-07-23 19:50:13 UTC

2. backwards running text (hey, why not add spiral running text?)

Unicode encodes code points in logical order rather than visual order: the order in which text is supposed to be collated and spoken rather than the visual order.

One tricky issue is when both directions exist in the same text. Unicode can encode nesting of text in one direction within another. For example, text consisting of an English word and a Hebrew word can be encoded as either the English embedded in Hebrew or the Hebrew embedded in English: both would render the same but collate differently.

Is there a better way?

WalterBright

0 replies

21h30m

2024-07-23 21:02:37 UTC

I've seen newspapers with txet sdrawkcab in them. Note that the last sentence has text in both directions.

I didn't need Unicode for that - nobody does.

    Uni
    ! c
    edo

Why is there no Unicode markup for that?

shepherdjerred

0 replies

2024-07-23 18:12:35 UTC

What would be the alternative? I think Unicode is pretty great.

You can pretty easily imagine a world where we had a bunch of different encodings with none being dominant.

saagarjha

0 replies

1d4h

2024-07-23 13:47:01 UTC

Unicode is quite elegant in its encoding too. If you're going to criticize it for its content, maybe start with talking about how ASCII also has invisible characters and those that people rarely use.

jart

0 replies

1d12h

2024-07-23 05:51:13 UTC

Hey at least we got the astral planes. https://justine.lol/dox/unicode.txt

account42

0 replies

4h49m

2024-07-24 13:43:51 UTC

9. color variants

10. code points where the appropriate glyph depends on the language (CJK unification)

Retr0id

0 replies

1d12h

2024-07-23 06:03:04 UTC

Language itself is a pile of ugly graffiti and ramshackle addons. It would be weird if Unicode didn't reflect this.

NoMoreNicksLeft

0 replies

1d2h

2024-07-23 16:19:26 UTC

They're all made-up languages, some were just made-up a little bit more transparently.

MisterTea

0 replies

21h57m

2024-07-23 20:35:11 UTC

Most of this is pretty useful for reproducing a wide gamut of human language. It gets completely fucked when it comes to fonts with png's embedded in svg's and other INSANE matryoshka doll nesting of bitmap/vector rendering technologies.

I also half hate emoji as it pollutes human writable text with bitmaps that are difficult to reproduce by hand on paper with a writing instrument - it's not text. I say half hate as it allows us a standard set of icons to use that can be easily rendered in-line with text or on their own.

Aardwolf

0 replies

1d9h

2024-07-23 08:34:37 UTC

For me it's how they inconsistently, backwards-incompatibly, make some existing characters outside of the emoji-plane (and especially when in technical/mathematical blocks) render colored by default, rather than keep everything colored related in the emoji plane (making copies if needed rather than affecting old character, the semantics are very different anyway), e.g. https://imgur.com/a/Ugi7K1i and https://imgur.com/a/UMppZHG

BobbyTables2

29 replies

1d17h

2024-07-23 00:53:13 UTC

I always lament that since at least 1980s or so, it seems the vast majority of the control characters were never used for their intended purpose.

Instead, we crudely use commas and tabs as delimiters instead of something like RS (#30).

thaumasiotes

19 replies

1d17h

2024-07-23 01:14:23 UTC

That's because the intended purpose is either useless (for machine control characters) or useless and logically impossible (for delimiters).

What do you do if you have a record that includes a record separator character? Given that you have this problem anyway, why do you want a character dedicated to achieving the same thing that a comma achieves?

penteract

15 replies

1d17h

2024-07-23 01:32:34 UTC

The record separator isn't on people's keyboards, so it's less likely to show up where it's not expected. Also it's less likely to legitimately occur in something like a name, so there are many users of CSVs who can say they will never need to consider data containing a record separator, and they will be right more often than those who never consider data containing a comma.

Of course, the fact that record separators aren't on keyboards is probably why CSVs use commas.

yardshop

5 replies

1d16h

2024-07-23 01:51:40 UTC

In the DOS days, you could "type" control characters by pressing Ctrl and the corresponding letter key, Ctrl+M is Carriage Return, Ctrl+H is Backspace, Ctrl+Z is End Of File, etc.

It was probably possible to type an RS with Ctrl+Shift+. and the others with similar combos.

penteract

1 replies

1d15h

2024-07-23 02:54:35 UTC

In a desktop linux terminal, Ctrl-^ or Ctrl-~ work for me. In a tty, I need to press Ctrl-V before them.

jart

0 replies

1d12h

2024-07-23 05:59:10 UTC

Yeah Linux still works exactly this way. The modern WIN32 API even works that way too. When you ReadConsoleInput() it gives you teletypewriter style keyboard codes. When I wrote a termios driver for Cosmopolitan to have a Linux-style shell in CMD it really didn't take much to translate them into the Linux style. We're all still using glorified teletypes at the end of the day. It will always be the substrate of our world. One system built upon another older system.

jki275

1 replies

1d16h

2024-07-23 02:16:05 UTC

you can still type them -- alt + 030(for instance) on the keypad will insert that RS character. In Windows at least -- not sure about the other OS.

Symbiote

0 replies

1d12h

2024-07-23 06:03:42 UTC

On Linux terminals entering control characters is done with the control key, Ctrl-G for example, but they will often be intercepted by the program that is running.

Bash will insert the control character (rather than interpret it) if you prefix it with Ctrl-V.

_flux

0 replies

1d10h

2024-07-23 07:49:37 UTC

I think it's worth mentioning that Ctrl-A is ascii 1, Ctrl-B ascii 2, etc, as it is in Unix today.

thaumasiotes

5 replies

1d16h

2024-07-23 01:42:37 UTC

Also it's less likely to legitimately occur in something like a name, so there are many users of CSVs who can say they will never need to consider data containing a record separator, and they will be right more often than those who never consider data containing a comma.

No, they'll be right exactly as often, 0% of the time.

But their mistake will show up less frequently, causing more problems when it does.

As soon as it's possible for some of your data to come from someone else's dataset, you're guaranteed to have to accommodate record separators within your data as well as within the metadata. You're better off using a system that plans for this inevitability than one that pretends it can't happen at all.

penteract

4 replies

1d16h

2024-07-23 02:25:16 UTC

No, they'll be right exactly as often, 0% of the time.

But their mistake will show up less frequently, causing more problems when it does.

Enough people use CSVs (and have limited, small-scale use-cases) that I'd be willing to bet "less frequently" means never for at least 1% of people who use CSVs.

I don't know whether the chance of no problems is worth the increased difficulty of problems that do occur - considering that balance feels a bit silly because if you're aware there could be a problem in a context where you could choose between commas and unit separators, you could just add validation or escaping.

thaumasiotes

3 replies

1d16h

2024-07-23 02:28:38 UTC

considering that balance feels a bit silly because if you're aware there could be a problem in a context where you could choose between commas and record separators, you could just add validation or escaping.

As soon as you have validation or escaping, having a record separator character loses its entire purpose. The existence of the character is predicated on the idea that you don't have to do that, and that idea is false.

That's why the character is never used. It's a conceptual mistake that was accidentally enshrined in a series of encoding standards that had enough free space to accommodate it.

penteract

1 replies

1d15h

2024-07-23 02:46:55 UTC

As soon as you have validation or escaping, having a record separator character loses its entire purpose. The existence of the character is predicated on the idea that you don't have to do that, and that idea is false.

I disagree with this - the data needs to be stored somehow, and while other characters (like comma) can be used, having a dedicated character can help - for example if the data might legitimately contain commas or newlines but not unit separators or record separators, then escaping isn't needed if you use unit/record separators (although validation is still necessary).

Symbiote

0 replies

1d12h

2024-07-23 06:06:20 UTC

I agree.

TSV is widely used, but lacks a way to escape the tab and new line characterss. RS-V is the same, but allows including tabs and new lines in records.

Dylan16807

0 replies

22h33m

2024-07-23 19:59:59 UTC

As soon as you have validation or escaping, having a record separator character loses its entire purpose.

Not true. Validation is easier than escaping.

keybored

2 replies

1d11h

2024-07-23 07:02:17 UTC

I can’t think of a case where someone would write a control character like that into something intended for text on purpose. So you might as well disallow it.

jerf

1 replies

1d3h

2024-07-23 15:07:16 UTC

The situation that comes up the most often that you need to consider is when someone embeds the same sort of file into itself, or chunks of the same sort of file into itself. If using the ASCII characters to delimit fields was common, you'd need to consider that over the course of some moderately interesting system's life time the odds of someone copying and pasting something from an encoded file into the spreadsheet application and picking up the ASCII control characters with it is basically 100%. And while we may be able to say with some confidence that nobody is going to embed a CSV file into a CSV file (and I say only some confidence, the world is weird and I'm sure someone will read this who has actually seen someone do this), there's other situations like HTML-in-HTML (for example, every HTML tutorial ever) that are guaranteed by their nature.

It is still valid to disallow the ASCII control characters, one just has to make sure that it is done comprehensively, in all places users may input them. But that's not created by using ASCII control characters, that's a consequence of the "ban the control characters entirely" approach regardless of what the control characters are.

It's neat when you can get away with it, but I generally prefer to define a robust encoding scheme instead. A minimal one like "replace backslash with double-backslash, replace control characters with backslashed characters" and "replace backslash sequences with their control characters, including backslash-backslash as a single backslash" can be inserted almost anywhere in just a few lines of string replace (or stream processing if you need the speed). The only tricky bit is you need to make sure you get the order correct or you corrupt data, and while I've done this enough to have it almost memorized now I do recall feeling like the correct order is backwards from what I naturally wanted the first few times. But it is simple and robust if you get it right.

keybored

0 replies

1d2h

2024-07-23 15:35:52 UTC

Someday I will create both formats: a control-characters are banned format (and never accepted) and one where they are escaped. That ought to be good enough for all needs!

(A trivial evening project for some; not for all of us)

keybored

0 replies

1d11h

2024-07-23 07:00:11 UTC

What do you do if you have a record that includes a record separator character?

This comes up every time. Options:

1. You disallow it. And you might as well disallow all the control codes except the carriage return, line feed, and other “spacing” characters. Because what are they doing in the data proper? They are in-band signals.

2. You use the Escape character to escape them

3. Weirdest option: if you really want to nest in a limited way you can still use the group and file separator characters

NoMoreNicksLeft

0 replies

1d2h

2024-07-23 16:26:11 UTC

Well, that's what an escape is for. Are we really having a serious discussion in 2024, where someone is suggesting that it's not the responsibility of the software engineer to sanitize inputs before chucking the data into some sort of database?

AdamH12113

0 replies

1d15h

2024-07-23 03:30:11 UTC

> What do you do if you have a record that includes a record separator character?

You use the ASCII escape character (0x1B), which is designed for exactly that purpose.

EvanAnderson

4 replies

1d16h

2024-07-23 01:44:31 UTC

I did some ETL work that used the ASCII delimiter characters. It was very enjoyable. I didn't have to worry about escaping or parsing escaped strings. The control codes were guaranteed to be illegal in input. It was refreshing.

theamk

3 replies

1d16h

2024-07-23 02:32:15 UTC

Could you do the same with TSV? A lot of datasets can either prohibit tabs in data, or convert it to spaces in early ingestion.

EvanAnderson

1 replies

1d14h

2024-07-23 04:12:44 UTC

TSV is a joy compared to CSV, for sure. CLI tools that output TSV are what immediately spring to mind.

bandie91

0 replies

10h11m

2024-07-24 08:21:15 UTC

shameless plug: https://repo.or.cz/hband-tools.git/blob/HEAD:/tabdata/README...

red_admiral

0 replies

1d8h

2024-07-23 09:51:14 UTC

Yes, and as long as you remember to turn off the "TAB produces 4 spaces" thing in your editor (grumble makefiles grumble) it's really nice to work with.

yencabulator

0 replies

1d1h

2024-07-23 16:34:02 UTC

Ah, Deborah␞ Records. Little Debbie Records, we call her.

tracker1

0 replies

1d3h

2024-07-23 15:21:42 UTC

That's my thought as well... I remember using them pre-xhr web in order to send data from the server to JS, which I could then split up pretty easily on the client side. I still don't know why we are so tethered to CSV.

red_admiral

0 replies

1d8h

2024-07-23 09:49:52 UTC

As long as your data is not binary, so does not contain record separators itself, this would be a thousand times better than CSV (because text data _does_ often contain commas and double quotes).

The only thing you'd need is editors to support some way of entering and displaying the RS, and CTRL+^ is a bit of a kludge as it ends up CTRL+SHIFT+6.

Of course, if a record itself can contain RS for subrecords, things become more complicated. I guess you could use `\^`.

fukawi2

0 replies

1d9h

2024-07-23 09:17:19 UTC

I recall working on a PICK D3 system, which was a "multivalue" database. Each field could have multiple values, those values could have sub values, and a third level beyond that.

Values were separated with char(254), subvalues were separated with char(253), and the third level were char(252) separated.

It was... unique, but worked. And to be fair, PICK originated in the 60's, so this method probably evolved in parallel to the ASCII table!

jolmg

16 replies

1d17h

2024-07-23 00:34:02 UTC

So when you’re reading 7-bit ASCII, if it starts with 00, it’s a non-printing character. Otherwise it’s a printing character.

The first printing character is space; it’s an invisible character, but it’s still one that has meaning to humans, so it’s not a control character (this sounds obvious today, but it was actually the source of some semantic argument when the ASCII standard was first being discussed).

Hmm.. Interesting that space is considered a printing character while horizontal tab and newline are control characters. They're all invisible and move the cursor, but I guess it makes sense. Space is uniquely very specific in how the cursor is moved one character space, so it's like an invisible character. Newline can either imply movement straight down, or down and to the left, depending on a configuration or platform (e.g. DOS vs UNIX line endings). Horizontal tab can also move you a configurable amount rightwards, and perhaps it might've been thought a bit differently, given there's also a vertical tab, which I've got no idea on how it was used. Maybe it's the newline-equivalent for tables, e.g. "id\tcolor\v1\tred\v2\tblue\v" or something like that.

Interesting also that BS is a control char while DEL is a printing(?) char. I guess that's because BS implies just movement leftwards over the text, while DEL is all ones like running a black sharpie through text. Guess that's what makes it printing. Wonder if there were DEL keys on typewriters that just stamped a black square, and on keypunchers that just punched 7 holes, so people would press "backspace" to go back then "delete" to overwrite.

I've used ASCII a lot, but even after so many years, I'm getting moments where it's like "oh this piece isn't just here, it needs to be here for a deep reason". It's like a jigsaw puzzle.

pwg

5 replies

1d17h

2024-07-23 01:14:23 UTC

You also have to keep in mind the "interface" for 1962-1968. The printer teletype machine.

The "control codes" were to "control" the printhead. So "carriage return" meant move the "print carriage" back to the left margin. "New line" meant move the paper platen one line height of rotation to move the paper to the next line. In that context, "back space" was "move print head one space left" (rather more like a "reverse space"). The article does mention that there was some debate about whether space should be considered "printable", but if you consider a mechanical printer, as the head is moving to the right and banging out characters onto the paper, the spaces between words do, sort of, look like "printables" (of a sort, a "print nothing" character as it were).

Tab's being control characters then make a bit more sense, in that they cause the printhead to jump some fixed distance to the right.

The article stated why DEL is where it is (all ones) -- so that for punched paper tape, one could get a punch-out of every position, which was then interpreted as "nothing here" by the tape reading machine.

As for typewriters, no, none had a "black box" blot out key. Correction (for typewriters without built in correction tape) was one of: retype the page, apply an eraser (and hopefully not damage the paper surface too much) then retype character and continue, or apply correction fluid (white-out) and retype character and continue.

For those typewriters with built in correction tape options (at least some IBM Selectric models, possibly more) the typewriter would retype the character using the "white-out" ribbon, then retype the replacement character using the normal "typewriting" ribbon.

tivert

1 replies

1d11h

2024-07-23 07:06:14 UTC

Tab's being control characters then make a bit more sense, in that they cause the printhead to jump some fixed distance to the right.

Isn't that incorrect? Tab doesn't jump a "fixed distance to the right," it jumps a variable distance to the next tab-stop to the right.

bandie91

0 replies

1d9h

2024-07-23 08:52:51 UTC

yea he must meant that it jumps to a fixed position

EvanAnderson

1 replies

1d16h

2024-07-23 01:42:08 UTC

The article stated why DEL is where it is (all ones) -- so that for punched paper tape, one could get a punch-out of every position...

I saw an analogous use of backspace on some OS I ran into 30 years ago cruising around either Tymnet or TELENET. (I wish I could remember the OS...)

The password prompt assumed local echo. After entering a password the host would send a series of backspaces and various patterns of characters (####, **, etc) to overprint the locally-echoed (and printed) characters.

kmoser

0 replies

1d13h

2024-07-23 05:25:14 UTC

On the login to the first timesharing system I used, it would prompt for your password, then type eight M's, W's, and X's on top of each other (on paper, of course, since this was using a Teletype terminal), so when you actually typed your password the characters would be printed on top of those already obscured lines.

rob74

0 replies

1d13h

2024-07-23 04:40:57 UTC

For those typewriters with built in correction tape options (at least some IBM Selectric models, possibly more) the typewriter would retype the character using the "white-out" ribbon

there was also a solution for cheaper typewriters: small sheets of "white-out" paper (known under the genericized brand name "Tipp-Ex" here in Germany) that you could hold between the ink ribbon and the paper to "overwrite" a typo.

kragen

5 replies

1d15h

2024-07-23 02:44:57 UTC

del is not a printing character. it's a control character. if you run a paper tape full of del characters through a teletype it does not print anything. it has to have that bit pattern, even though it greatly complicates the mechanics of the teletype (which has to do all the digital logic with cams and levers) because that way it can be punched over any character on the paper tape to delete it

a figure caption in this page says 'This is a historical throwback to paper tape, where the keyboard would punch some permutation of seven holes to represent the ones and zeros of each character. You can’t delete holes once they’ve been punched, so the only way to mark a character as invalid was to rewind the tape and punch out all the holes in that position: i.e. all 1s.' which is mostly correct, except that it wasn't a historical throwback; paper tape was perhaps the most important medium for ascii not just in 01963 and 01967 but probably in 01973, maybe even in 01977. teletype owners today are still using paper tape that was manufactured during the vietnam war, where it was used in unprecedented volume for routing teletype messages by hand

the dominant early pc operating system, cp/m (if it's not overly grandiose to call it an 'operating system') had system calls for reading and writing the console, the disk, and the paper tape punch and reader. when i hooked up a modem to my cp/m system to call bbses, i hooked it up as the punch and reader

91bananas

2 replies

2024-07-23 17:56:59 UTC

just... this is why this forum exists. thank you

kragen

1 replies

2024-07-23 18:24:06 UTC

you're welcome. i'll try to remember your comment the next time someone replies to me with something like https://news.ycombinator.com/item?id=40993821 or https://news.ycombinator.com/item?id=40993328 or https://news.ycombinator.com/item?id=40992456

kragen

0 replies

4h58m

2024-07-24 13:34:33 UTC

wow, i sure didn't have to wait long; in this case it's someone who's harassed me repeatedly and who uses the site mostly for political flamewars: https://news.ycombinator.com/item?id=41056718

jart

1 replies

1d12h

2024-07-23 05:54:00 UTC

so the only way to mark a character as invalid was to rewind the tape and punch out all the holes in that position

So that's why \177 (DEL) is the loneliest control character. Wow. Thank you!

kragen

0 replies

1d11h

2024-07-23 06:45:43 UTC

happy to help

nikau

0 replies

1d10h

2024-07-23 08:17:33 UTC

Logically space maps to a character people use with pen and paper unlike tab

layer8

0 replies

1d8h

2024-07-23 10:27:15 UTC

Space is what is represented in the output, i.e. in one cell of the terminal grid, whereas control characters like Tab and CR/LF don’t map onto such an output representation. If you want to represent the printed contents of each “grid cell” of a printout or a textmode screen buffer, you don’t need the control characters, only the printable characters. The printable characters are what you’d need in a screen font.

kazinator

0 replies

1d14h

2024-07-23 04:32:15 UTC

Space doesn't just move the cursor on a display; it will obliterate a character cell with a space glyph.

When a display terminal has nondestructive backspace (backspace character doesn't erase), it can be software emulated with BS-SPACE-BS.

At your Linux terminal, you can do "stty echoe" (echo erase) to turn this on (affecting the echoing of backspace characters that are input, not all backspace characters).

Dial-up BBSes had this as a configurable setting also.

california-og

0 replies

1d10h

2024-07-23 07:35:03 UTC

While DEL didn't stamp a black square on typewriters, it sometimes did so (or something similar, like diagonal stripes) in various digital character sets. ISO 2047[0] established the graphical representations for the control characters of the 7-bit coded character set in 1975, maily for debugging reasons. This graphical representation for DEL was used by Apple IIGS, TRS-80 and even Amiga!

[0]: https://en.m.wikipedia.org/wiki/ISO_2047

augusto-moura

12 replies

1d11h

2024-07-23 06:58:15 UTC

Useful tip, on linux (not sure about other *nixes) you can view the ascii table by opening its manpage:

  man ascii

It's been useful to me more than once every year, mostly to know about shell escape codes and when doing weird character ranges in regex and C.

It can be a bit confusing, but the gist is that you have 2 chars being show in each line, I would prefer a view where you see the same char with shift and/or ctrl flags, but you can only ask so much

fitsumbelay

3 replies

1d1h

2024-07-23 17:21:30 UTC

strange: on MacOS 14.5 I get output for `man ascii` but `ascii` goes "command not found"

augusto-moura

1 replies

18h29m

2024-07-24 00:03:33 UTC

Not sure why anyone would downvote your comment, because it is a genuine question

`man` is basically manual documentation on anything on your system, not only commands. Most commands do have a manpage for them, but it is not a requirement. The argument of the command is just the file name for the document

aff0

0 replies

17h42m

2024-07-24 00:50:36 UTC

Indeed `man ascii` (on MacOS but the same for Linux for the most part) shows the manpage for 'ASCII(7)' - the '7' denotes the section of the manual the manpage is from. If you use `man man`, you can see the section numbers and names, e.g. Section 1 General Commands, 5 File Formats, 7 Misc Info, 8 System Manager's Manual. If a word, e.g. 'crontab', has multiple entries in different sections, then you might have to specify the section you want, e.g.`man crontab` shows the crontab(1) (General Command) and use `man -s 5 crontab` to see the crontab(5) (File Format). `apropos crontab` will show entries related to crontab, i.e. cron(8), crontab(1), and crontab(5).

AnimalMuppet

0 replies

1d1h

2024-07-23 17:27:02 UTC

On my Linux VM, it's the same, and it's because 'man ascii' comes from man(7), not man(1). It's not a man page for a program. It's just a man page.

dailykoder

1 replies

1d10h

2024-07-23 07:50:58 UTC

Damn, thanks!

Why the hell did I never try this? Maybe because typing ascii table into my favorite search engine and clicking one of the first links was fast enough

omnicognate

0 replies

1d8h

2024-07-23 10:14:32 UTC

I used to do that until the experience became degraded enough, reflecting the general state of the web, that I took the time to look for a better way and found `man ascii`.

INTPenis

1 replies

1d9h

2024-07-23 08:42:14 UTC

The reason I know this is because in 2004 I was squatting in an apartment with no TV and no internet. So each day after work I would go home and just read manpages for fun.

Ended up learning ipfw through the firewall manpage on FreeBSD, and using my skills to setup and manage an IPFW at work.

It's amazing how much you get done with no TV and no internet. Also played a lot of nethack.

w0m

0 replies

2024-07-23 17:53:59 UTC

I learned vim proper by reading :help on an eeepc while flying back and forth over the Atlantic alone one year.

layer8

0 replies

1d8h

2024-07-23 10:20:43 UTC

Or even simpler use the ascii command, when installed: https://packages.debian.org/bookworm/ascii

irrational

0 replies

22h51m

2024-07-23 19:41:04 UTC

Works on mac

bodyfour

0 replies

1d3h

2024-07-23 15:32:09 UTC

not sure about other *nixes

Should be available on any UNIX, it was added to V7 UNIX back in the 1970s: https://github.com/dspinellis/unix-history-repo/blob/Researc...

Even before that, it existed as a standalone text file https://github.com/dspinellis/unix-history-repo/blob/8cf2a84... This still exists on many systems -- for instance as /usr/share/misc/ascii on MacOS

bell-cot

0 replies

1d10h

2024-07-23 08:26:01 UTC

Similar in FreeBSD. It has octal, hex, decimal, and binary ASCII tables, along with the full names of the control characters.

KingOfCoders

8 replies

1d14h

2024-07-23 04:20:54 UTC

For everyone who doesn't need ä,ü,ö. Or software that needs to take ä,ü,ö. For everyone else, UTF is a blessing.

lmm

4 replies

1d12h

2024-07-23 05:56:35 UTC

For everyone else, UTF is a blessing.

Except people who want to use Japanese and not have it render weirdly, something that was easy in any internationalised software that used the traditional codepage system, but is practically impossible in Unicode-based software.

Retr0id

2 replies

1d12h

2024-07-23 06:06:39 UTC

Where can I learn more about this issue?

zokier

0 replies

1d12h

2024-07-23 06:25:38 UTC

https://en.wikipedia.org/wiki/Han_unification

p_l

0 replies

1d12h

2024-07-23 06:26:54 UTC

Probably referring to so-called "Han unification" which tried to use same codepoints for different glyphs to reduce code space for ideograms derived from Chinese ones.

But that only causes confusion because you need to provide external information which way to interpret them, just like a code page

eviks

0 replies

7h30m

2024-07-24 11:02:47 UTC

How is it impossible if you Unicode has language tags?

bigstrat2003

1 replies

1d13h

2024-07-23 05:08:07 UTC

Which, given the people who designed this and the time they were designing for, was most of them (and most of their audience). Don't confuse "this old standard doesn't adequately cover all cases today" with "this old standard sucked at the time".

eviks

0 replies

7h31m

2024-07-24 11:01:40 UTC

Don't confuse "this old standard sucked at the time for missing the obvious at the time" with "I can come up with some excuse"

account42

0 replies

3h48m

2024-07-24 14:44:19 UTC

Here, take these: ae ue oe

zokier

6 replies

1d11h

2024-07-23 06:40:06 UTC

I think that adopting ASCII as the general purpose text encoding was one of the great mistakes of early computing. It originated as control interface for teletypes and such, and that's arguably where it should have remained. For storing and processing (plain) text ASCII doesn't really fit that well, control characters are a hindrance and the code space would have been useful for additional characters. The ASCII set of printables was definitely a compromise formed by the limited code space.

ddingus

2 replies

1d10h

2024-07-23 07:33:55 UTC

No way!

No amount of extra characters was going to address what Unicode did.

ASCII was not a mistake at all. Adopting it unified what was surely going to be a real mess.

At the time it made sense, and the control functions were needed. Still are.

zokier

1 replies

1d6h

2024-07-23 11:37:08 UTC

At the time it made sense, and the control functions were needed. Still are.

Control characters were needed for terminals. They never made sense for text. Mixing the two matters is the problem.

ddingus

0 replies

1d2h

2024-07-23 16:12:57 UTC

It isn't a problem. The text is the UX.

What else would you have proposed, or would propose?

kstenerud

1 replies

1d11h

2024-07-23 07:22:25 UTC

It's one of the greatest triumphs of early computing. Not only did it harmonize text representation and transmission in a backwards compatible manner; the fact that they deliberately kept it 7 bit for so long also helped for developing a sane set of other language character sets (ISO-8859), and paved the way for a smooth transition to Unicode (UTF-8) - which is now the dominant encoding worldwide.

ddingus

0 replies

1d10h

2024-07-23 07:38:10 UTC

Yes, seconded easily

hackit2

0 replies

1d11h

2024-07-23 07:29:11 UTC

Yeah you were not around when a kb of memory took up half your room. Looking back it doesn't make sense but at the time a byte was what-ever you wanted it to be. Considering number of characters in English language is 26, it is reasonable for a byte to be 5 bits, giving you a total of 32 possible states. Which leaves you with 6 values which could be used as control characters. how-ever lets not forget there are 7,164 other languages of the world, and they all have their own unique way of doing things.

Oh yeah, lets not forget that at the time you had other nationalistic countries/territories/people with their own superior technology all vying for the top position, all well trying to out do each other. Then you also had manipulative monopolies/trade embargo's and wars.

It isn't perfect but people aren't perfect.

lucasoshiro

6 replies

1d19h

2024-07-22 22:50:59 UTC

Once I saw a case-insensitive switch in C using that pattern of letters:

switch (my_char | 0x20) {

   case 'a': ...
   break;

   case 'b': ...
   break;

}

Sharlin

4 replies

1d11h

2024-07-23 07:14:24 UTC

Yes, that’s very intentional and just masking (or setting) the bit is the intended way to do case-insensitive comparison of the letter range in ASCII (eg. stricmp in C), or to transform text to lower or upper case (tolower, toupper).

But what’s more, ever wondered whence the control (Ctrl) key presses like Ctrl-H to backspace, or Ctrl-M for carriage return? Well, inspecting the ASCII chart it becomes evident: the Ctrl key simply masks bit 6 (0x40), turning a letter into its respective control character!

lucasoshiro

1 replies

1d4h

2024-07-23 14:04:23 UTC

Nice!

I'm an emacs user, and when I use a readline-based REPL I use ctrl-M a lot. I thought it was inherited from the emacs keybindings, like many other shortcuts from GNU readline

jerf

0 replies

1d3h

2024-07-23 14:53:14 UTC

Then an additional useful command: In the out-of-the-box emacs bindings, C-q is the "quoted insert" command. It will take the next character and directly insert it into the buffer. This is useful for things like tab or control characters where emacs would normally use the keystroke to do something else. I've been working in an email-related space lately so I've been doing a good amount of C-q C-m for inserting literal CRs, and C-q TAB for a few places where I want a literal tab in the source, in a buffer that interprets a normal TAB as a command to indentify the current row. I mention this because you can use the ASCII table to work out how to insert a particular control character with your keyboard literally, if you need to insert one of the handful of other characters you may be interested in every so often, like C-l for "form feed" (now used for "page feed" in some older printer-related contexts) or C-@ for NUL if you're doing something weird with binary files in a "text" buffer.

flohofwoe

1 replies

1d1h

2024-07-23 16:35:06 UTC

...it's a bit of a shame that the same upper/lowercase trick doesn't apply to all UNICODE codepoints (at least those that have upper/lower variants).

It seems to work for codepoints up to U+00FF, for instance:

    - Å (U+00C5) vs å (U+00E5)

...but above 0xFF lowercase follows uppercase:

    - Ă (U+0102) vs ă (U+0103)

Typical for UNICODE though, nothing makes sense ;)

Findecanor

0 replies

23h32m

2024-07-23 19:00:14 UTC

That's because U+00A0–U+00FF are encoding an earlier character set: "ISO Latin-1" (ISO 8859-1), itself based on DEC's "Multinational Character Set". The upper/lowercase trick does not apply to ß/ÿ but does in MCS where Ÿ/ÿ are at a different pair of code points.

ISO Latin-1 was the character set on many Unix systems, Amiga OS, MS-Windows (as "Windows-1252" with extra chars), and was for many years the default character set on the web.

mananaysiempre

0 replies

1d18h

2024-07-23 00:26:57 UTC

This can be made to work for ASCII and EBCDIC simultaneously for extra esoterica points:

  switch (my_char | 'A' ^ 'a') {
  case 'A' | 'a': /* ... */ break;
  /* ... */
  }

I don’t know if this is too fancy to have ever made it into real code, but I believe I’ve seen places in the ICU source that still say ('A' <= x <= 'I' || 'J' <= x <= 'R' || 'S <= x <= 'Z') instead of just ('A' <= x <= 'Z'), EBCDIC letters being arranged in those three contiguous ranges.

gumby

5 replies

1d15h

2024-07-23 02:53:51 UTC

I wish the author had included the full ascii chart in 4 bits across / 4 bits down. You can mask a single bit to change case and that is super obvious that way.

The charts that simply show you the assignments in hex and octal obscure the elegance of the design.

kalleboo

2 replies

1d15h

2024-07-23 03:27:20 UTC

It was at some point looking at a chart like that where it also dawned on me where the control codes like ^D, ^H, ^[ etc came from

kr2

1 replies

1d11h

2024-07-23 07:18:32 UTC

I was going to ask you to please explain as I didn't understand, but I am guessing you are talking about the same thing as this comment[1] right? That's super cool

https://news.ycombinator.com/item?id=41042570

gumby

0 replies

1d7h

2024-07-23 11:03:35 UTC

Yes, though ^m, ^[ et al aren’t so much elegant as coincidental; but look at A and a, for example.

I found the chart I was looking for: https://en.wikipedia.org/wiki/ASCII#/media/File:USASCII_code...

AdamH12113

1 replies

1d15h

2024-07-23 03:22:20 UTC

The third and fourth columns of the table are only a single bit apart from each other. If you mentally swap the first two columns, you get a Gray code ordering of the most significant bits, which is pretty close to what you're looking for.

gumby

0 replies

1d7h

2024-07-23 11:04:18 UTC

Found it: https://en.wikipedia.org/wiki/ASCII#/media/File:USASCII_code...

ggm

5 replies

1d18h

2024-07-23 00:27:32 UTC

  man ascii

is never far from my fingers. combined with od -c and od -x it gets the job done. I don't think as fluently in Octal as I used to. Hex has become ubiquitous.

fsckboy

4 replies

1d13h

2024-07-23 04:56:19 UTC

you mean ?

    ascii

ggm

2 replies

1d13h

2024-07-23 05:31:25 UTC

No I don't -I live in a different universe to you:

  % (uname; cd /usr/ports; ls -d */ascii)
  FreeBSD
  zsh: no matches found: */ascii
  % which ascii
  ascii not found
  %

It's the same on OSX and debian by default doesn't install that command. If you live inside a POSIX/IEEE 1003 system and want to know the ascii table reliably then the command I run is the one which works. If your distribution doesn't ship manuals by default you have bigger problems.

amszmidt

1 replies

1d10h

2024-07-23 07:54:08 UTC

”man ascii” has as much guarantee to work on a POSIX system as a command called ”ascii” seeing neither (specifically a man page called ”ascii”) are part of the standard.

So you will either get command not found, or man page not found.

account42

0 replies

3h55m

2024-07-24 14:37:16 UTC

Neither being guaranteed does not mean both have the same likelyhood of existing.

The man page comes preinstalled on most modern non-embedded POSIX systems. The command does not.

bandie91

0 replies

1d1h

2024-07-23 16:39:02 UTC

man 7 ascii

snvzz

4 replies

1d11h

2024-07-23 07:28:02 UTC

The ASCII table is defective; it is missing a dedicated code for newline.

CR and LF aren't dedicated, and have precise cursor movement meanings, rather than being a logical line ender.

There was a proposal in the 80s to reassigning the -otherwise useless- VT (vertical tab) character for the purpose. Unfortunately unfruitful.

Gormo

2 replies

2024-07-23 17:53:25 UTC

Unfortunately unfruitful.

Fortunately unfruitful, since if it had gained adoption, there'd be a mix of three different line endings (and combinations thereof) in widespread use, instead of two.

snvzz

0 replies

19h21m

2024-07-23 23:11:47 UTC

widespread use

We are talking about the early 80s at worst. It would have consolidated by now.

snvzz

0 replies

11h8m

2024-07-24 07:24:51 UTC

there'd be a mix of three

We already got four. \n, \r, \n\r and \r\n.

bregma

0 replies

1d7h

2024-07-23 10:57:47 UTC

A separate control character was not needed to indicate where your Hollerith string ended: it ended at the end of the Hollerith string. If you wanted to render a Hollerith string onto print media, you'd often want to feed the line and then return the carriage before printing the next Hollerith string. Of course, that wasn't strictly necessary if you were using a line printer, which would just print the line and advance.

The filesystems I used had 5 kinds of file: random-access, sequential, ISAM, Fortran-carriage-control and carriage-return-carriage-control. The only people who used the latter were the eggheads that used that new-fangled C programming language brought over from Bell Lab's experimental Unix system.

You're probably just looking for the record separator (036). If you are storing multiple text records and a block of memory, that would be the ideal ASCII code to separate them.

jiveturkey

4 replies

1d19h

2024-07-22 23:07:58 UTC

ebcdic is also quite elegant

https://news.ycombinator.com/item?id=13543715

theamk

2 replies

1d15h

2024-07-23 02:43:04 UTC

It's really not.

In base-2 machines, the letters are mixed with punctuation, which is pretty horrible design which makes simple things complex, and does not actually bring anything new to the table.

In BCD machines it is slightly better, except letters aren't contiguous either - row 0 is bad, but it's the extra space between R and S which is really ugly. And it's unusable with BCD operations anyway, as high nibble values are used extensively.

Naive sorting simply does not work... lowercase before uppercase, punctuation in the middle of the alphabet, numbers after letters.

I see no elegance there, it's like the worst example of legacy code.

jiveturkey

1 replies

2024-07-23 17:51:05 UTC

It was designed for a specific purpose ... elegance in context

theamk

0 replies

23h44m

2024-07-23 18:48:12 UTC

Which would that "specific purpose" be? Even punch cards have alphabet interrupted between R and S.

And the whole punch card -> 8-bit is pretty illogical, just like the cards themselves. How come no punches in zone don't correspond to 0 high bits?

(and don't get me started on punch card.. it started with "let's do 1 hole per column for digits" - OK, makes sense; then "let's do 2 hole/column for uppercase" - I guess OK but why did you put extra char in the middle... but then it's 4 holes/column for superscripts? 3-6 holes/col for punctuation? If someone were to design punch cards today but using same requirements, they could easily come up with a much more logical schema)

gerdesj

0 replies

1d17h

2024-07-23 00:38:57 UTC

Its shit if you don't routinely speak or write English. On those grounds, I'll decry it as not only shit but purposely shit.

OK a bit over the top ... the designers of EBCSDIC had a rather tight set of constraints to deal with, none of which included: "be inclusive". Again, if I really had to be charitable (I looked after a System/36, back in the day), the hardware was rather shit too, sorry ... constrained. Yes constrained. Why should six inch fans fire up reliably after a few years of use and not need a poke after an IPL? No real dust snags and I carefully sprayed some WD40 on the one that I could get at. I have modern Dells and HPs in horrid environments that do better with shitty plastic fans.

EBCDIC is not elegant at all unless excluding non English characters in an encoding system is your idea of elegant.

According to this: https://en.wikipedia.org/wiki/EBCDIC it expended loads of effort with dealing with control eg: "SM/SW" instead of language.

ASCII and EBCDIC and that basically say: fuck you foreigners!

We now have hardware that is apparently capable of messianic feats. Let's do the entirety of humanity some justice and really do something elegant. It won't involve EBCDIC.

thristian

3 replies

1d16h

2024-07-23 02:13:27 UTC

That, I’m afraid, is because ASCII was based not on modern computer keyboards but on the shifted positions of a Remington No. 2 mechanical typewriter – whose shifted layout was the closest compromise we could find as a standard at the time, I imagine.

According to Wikipedia¹, American typewriters were pretty consistent with keyboard layout until the IBM Selectric electric typewriter. Apparently "small" characters (like apostrophe, double-quote, underscore, and hyphen) should be typed with less pressure to avoid damaging the platen, and IBM decided the Selectric could be simpler if those symbols were grouped on dedicated keys instead of sharing keys with "high pressure" symbols, so they shuffled the symbols around a bit, resulting in a layout that would look very familiar to a modern PC user.

Because IBM electric typewriters were so widely used (at least in English speaking countries), any computer company that wanted to sell to businesses wanted a Selectric-style layout, including the IBM PC.

Meanwhile, in other countries where typewriters in general weren't so popular or useful, the earliest computers had ASCII-style punctuation layout for simplicity, and later computers didn't have any pressing need to change, so they stuck with it. Japanese keyboards, for example, are still ASCII-style to this day.

¹: https://en.wikipedia.org/wiki/IBM_Selectric#Keyboard_layout

sixothree

2 replies

1d13h

2024-07-23 05:06:19 UTC

I never realized my first computer used ascii directly for the shifted number keys.

https://en.wikipedia.org/wiki/TRS-80_Color_Computer#/media/F...

kps

0 replies

1d4h

2024-07-23 13:52:19 UTC

https://en.wikipedia.org/wiki/Bit-paired_keyboard

Mountain_Skies

0 replies

1d7h

2024-07-23 11:19:24 UTC

So many things on the CoCo turned out to be the way they were for cost saving reasons. Tandy was good at saving pennies everywhere it could. When I took 'Typing' in high school, my muscle memory was in a constant fight between the IBM Selectric layout of the typewriters at school and the CoCo at home.

th0ma5

3 replies

1d17h

2024-07-23 00:55:46 UTC

I heard someone describe the ASCII table as a state machine. Guess I could understand that as a state machine needed to parse it? This is surprisingly hard to search for but I was wondering if anyone knows what they were talking about.

kevin_thibedeau

0 replies

1d16h

2024-07-23 02:15:32 UTC

Bespoke hardware for text handling isn't a thing these days but would have been in the 60's and 70's. A table layout that can be easily decoded in hardware simplifies the necessary circuitry for responding to control characters or converting binary numbers to/from decimal when the microprocessor hadn't been invented yet.

gumby

0 replies

1d7h

2024-07-23 11:16:25 UTC

It was originally implemented in actual hardware (rods and bars). Just look inside a teletype, like a KSR-23 (pre ascii, but similar)

EvanAnderson

0 replies

1d16h

2024-07-23 01:50:19 UTC

They might be talking about using escape sequences to map additional codepoints into ASCII. It was designed to be extensible. See: https://web.archive.org/web/20150810075144/http://bobbemer.c...

senkora

3 replies

1d2h

2024-07-23 15:37:54 UTC

Fun fact: sorting ASCII numerically puts all the uppercase letters first, followed by all the lowercase letters (ABC... abc...). A more typical dictionary ordering would be more like AaBbCc... (or to even consider A and a at the same sort level and only use them to break ties if the words are otherwise identical).

The order used by ASCII is sometimes called "ASCIIbetical", which I think is wonderful.

https://en.wiktionary.org/wiki/ASCIIbetical

NoMoreNicksLeft

2 replies

1d2h

2024-07-23 16:17:51 UTC

I thought the point of that was that a single bitflip makes an uppercase lower, or vice versa...

Dylan16807

1 replies

22h28m

2024-07-23 20:04:09 UTC

It can't be "the" point, because AaBbCc would also let you use a single bit to control case, the bottom bit.

NoMoreNicksLeft

0 replies

2h11m

2024-07-24 16:21:12 UTC

The first bit was already taken at that point. The original 6bit ASCII (or whatever that predecessor was called) only did uppercase. A 7th (and later, 8th) bit was added to give it lowercase.

georgehotelling

2 replies

1d5h

2024-07-23 12:58:36 UTC

Dark grey #303030 text on slightly darker grey #1B1C21 background is really hard to read. Maybe I'm just getting old, but I also assume the audience for a blog post about the ASCII table was born in a year that starts with 19.

Retr0id

1 replies

1d5h

2024-07-23 13:16:13 UTC

The background is white on my machine, are you using some kind of extension to force "dark mode"?

georgehotelling

0 replies

1d5h

2024-07-23 13:21:31 UTC

I'm using pi-hole, uBlock Origin, and Privacy Badger on Firefox. I checked my network tab before complaining and didn't see any resources that failed to load.

userbinator

1 replies

1d14h

2024-07-23 04:04:38 UTC

You might be familiar with carriage return (0D) and line feed (10)

You mean 0D and 0A, or 13 and 10, but that mix of base really stood out to me in an otherwise good article. I'm one of numerous others who have memorised most of the base ASCII table, and quite a few of the symbols as well as extended ASCII (CP437), mainly because it comes in handy for reading programs without needing a disassembler. Those who do a lot of web development may find the sequence 3A 2F 2F familiar too, as well as 3Ds and 3Fs.

I can see the rationale for <=> being in that order, but [\] and {|} are less obvious, as well as why their position is 1 column to the left of <=>.

yardshop

0 replies

1d6h

2024-07-23 11:50:37 UTC

You mean 0D and 0A, or 13 and 10

He fixed it.

dwheeler

1 replies

1d19h

2024-07-22 23:05:44 UTC

The encodings we use today have a surprisingly deep and complex history. For more, see: "The Evolution of Character Codes, 1874-1968" https://ia800606.us.archive.org/17/items/enf-ascii/ascii.pdf

rrwo

0 replies

1d6h

2024-07-23 11:33:47 UTC

Thanks for posting that.

People tend to overlook that the technologies we use today have a much older history.

yawl

0 replies

1d15h

2024-07-23 03:27:09 UTC

I also wrote a chat novel about ASCII: https://www.lostlanguageofthemachines.com/chapter2/chat

wduquette

0 replies

1h45m

2024-07-24 16:47:06 UTC

Regarding paper tape, our first home computer (this was in the mid-to-late 70's) had a paper tape reader and punch. I do not miss paper tape as a storage medium, but I have to say the little punched-out dots were fun to use as confetti at high school football games.

transfire

0 replies

1d18h

2024-07-23 00:28:41 UTC

One downside of ASCII is the lack of two extra “letters” (whatever they might be, e.g. perhaps German ß), as it makes it impossible to represent base 64 alphanumerically. So we ended up with many alternatives picking two arbitrary punctuation marks.

renox

0 replies

10h17m

2024-07-24 08:15:41 UTC

I still think that they made a big mistake in not having the letters immediately following the numbers, this would have made printing numbers in hexadecimal much more efficient.

red_admiral

0 replies

1d8h

2024-07-23 09:44:36 UTC

The "16 rows x 8 columns" version, with the lowercase letters added, seems the most elegant one to me because it makes the internal structure of the thing visible. For example, to lowercase a letter, you set bit 6; a decimal digit is the prefix 011 followed by the binary encoding of the digit etc.

It also makes clear why ESC can be entered as `^[` or ENTER (technically CR) as `^M` on some terminals (still works in my xterm), because the effect of the control key is to unset bits 6 and 7 in the original set-up.

Of course you can color in the fields too, if you want.

pixelbeat__

0 replies

23h54m

2024-07-23 18:38:19 UTC

I wrote about ASCII and UTF-8 elegance at:

https://www.pixelbeat.org/docs/utf8_programming.html

netcraft

0 replies

1d15h

2024-07-23 03:14:49 UTC

I've searched off and on for a great stylistic representation of the ASCII table, id love a poster to hang on my wall, or possibly even something I could get as a tattoo.

kragen

0 replies

1d15h

2024-07-23 02:34:45 UTC

unfortunately this page is based on mackenzie's book. mackenzie is the ibm guy who spent decades trying to kill ascii, promoting its brain-damaged ebcdic as a superior replacement (because it was more compatible, at least if you were already an ibm customer). he spends most of his fucking book trumpeting the virtues of ebcdic actually

bob bemer more or less invented ascii. he was also an ibm guy before mackenzie's crowd pushed him out of ibm for promoting it. he wrote a much better book about the history of ascii which is also freely available online, really more a pamphlet than a book, called "a story of ascii": https://archive.org/details/ascii-bemer/page/n1/mode/2up

tom jennings, who invented fido, also wrote a history of ascii, called 'an annotated history of some character codes or ascii: american standard code for information infiltration'; it's no longer online at his own site, but for the time being the archive has preserved it: https://web.archive.org/web/20100414012008/http://wps.com/pr...

jennings's history is animated by a palpable rage at mackenzie's self-serving account of the history of ascii, partly because bemer hadn't really told his own story publicly. so jennings goes so far as to write punchcard codes (and mackenzie) out of ascii's history entirely, deriving it purely from teletypewriter codes—from which it does undeniably draw many features, but after all, bemer was a punchcard guy, and ascii's many excellent virtues for collation show it

as dwheeler points out, the accomplished informatics archivist eric fischer has also written an excellent history of the evolution of ascii. though, unlike bemer, fischer wasn't actually at the standardization meetings that created ascii, he is more careful and digs deeper than either bemer or jennings, so it might be better to read him first: https://archive.org/details/enf-ascii/

it would be a mistake to credit ascii entirely to bemer; aside from the relatively minor changes in 01967 (including making lowercase official), the draft was extensively revised by the standards committees in the years leading up to 01963, including dramatic improvements in the control-character set

for the historical relationship between ascii character codes and keyboard layouts, see https://en.wikipedia.org/wiki/Bit-paired_keyboard

johanneskanybal

0 replies

1d7h

2024-07-23 11:05:11 UTC

Kind of hard to read something where the author considers every non-english languages equally worthy to emoji’s.. It was good in the 50’s but was important like 4-5 decades too long.

eviks

0 replies

7h41m

2024-07-24 10:51:31 UTC

The first 32 “characters” (and, arguably, the final one) aren’t things that you can see, but commands sent between machines to provide additional instructions

Such a waste and no extensibility kills and claim to elegance of some shifted binary numbers, that'd the wrong end to focus your optimization efforts on

bloak

0 replies

1d5h

2024-07-23 13:28:58 UTC

Vaguely related: Apart from £ and €, a typical GB keyboard has a couple of non-ASCII characters printed on it: ¬ and ¦. The key labelled ¦ is usually mapped to |, but the key labelled ¬ often gives you an actual ¬, though I can't remember many occasions on which I've wanted one of them. Apparently the characters ¬ and ¦ are in EBCDIC.

blahedo

0 replies

1d13h

2024-07-23 04:45:33 UTC

Another piece of elegance: by putting the uppercase letters in the block beginning at 0x40 (well, 0x41) it means that all the control codes at the start of the table line up with a letter (or one of a small set of other punctuation: @[\]^_), giving both a natural shorthand visual representation and a way to enter them with an early keyboard, by joining the pressing of the letter with... the Control key. Control-M (often written ^M) is carriage return because carriage return is 0x0D and M is 0x4D.

aronhegedus

0 replies

1d1h

2024-07-23 17:18:41 UTC

Was a really fun article to read/podcast to listen to.

Favorite fact is that 127 is the DEL because for hole punching it removes all the info. I love those little nuggets of history

PaulHoule

0 replies

1d4h

2024-07-23 13:34:22 UTC

Beats EBCDIC

https://en.wikipedia.org/wiki/EBCDIC

On the 4th floor of my building the computer systems lab has a glass front that has what looks like a punch card etched in frosted glass but if you look closer it was made by sticking stickers on the glass.

I made a "punchcard decoder" on a 4x6 card to help people decode the message on the wall

https://mastodon.social/@UP8/112836035703067309

The EBCDIC code was designed to be compatible with this encoding which has all sorts of weird features, for instance the "/" right between "R" and "Z"; letters don't form a consecutive block so testing to see if a char is a letter is more complex than in ASCII.

I am thinking of redoing that card to put the alphabet in order. A column in a punched card has between 0 to 3 punches, 0 is a space, 1 is a letter or a symbol in the first column, if one of the rows at the top is punched you combine that with the number of the other punched row on the left 3x9 grid. If three holes are punched one of them is an 8 (unless you've got one of the extended charsets) and you have one of the symbols in the right 3x6. Note the ¬ and ¢ which are not in ASCII but are in latin-1.

EvanAnderson

0 replies

1d16h

2024-07-23 01:49:36 UTC

I would be remiss not to post a link to the late Bob Bemer's[0] website.

https://web.archive.org/web/20150801005415/http://bobbemer.c...

He was considered the "father of ASCII". Hr wrote very well and gives clear explanations for the motivations behind the design of ASCII.

[0] https://en.m.wikipedia.org/wiki/Bob_Bemer

Dwedit

0 replies

1d19h

2024-07-22 23:18:09 UTC

Many old NES/SNES games had a simpler character encoding system, with 0-9 and A-Z at the beginning of the table. No conversion require to display hex.

DonHopkins

0 replies

1d2h

2024-07-23 16:11:27 UTC

The Apple ][ and TTYs and other old computers had "bit pairing keyboards", where the punctuation marks above the digits were aligned with the ASCII values of the corresponding digits, different by one bit.

    Typewriter: !@#$%^&*()
    Apple:      !"#$%&'()
    Digits:     1234567890

https://en.wikipedia.org/wiki/Bit-paired_keyboard

A bit-paired keyboard is a keyboard where the layout of shifted keys corresponds to columns in the ASCII (1963) table, archetypally the Teletype Model 33 (1963) keyboard. This was later contrasted with a typewriter-paired keyboard, where the layout of shifted keys corresponds to electric typewriter layouts, notably the IBM Selectric (1961). The difference is most visible in the digits row (top row): compared with mechanical typewriters, bit-paired keyboards remove the _ character from 6 and shift the remaining &() from 7890 to 6789, while typewriter-paired keyboards replace 3 characters: ⇧ Shift+2 from " to @ ⇧ Shift+6 from _ to ^ and ⇧ Shift+8 from ' to . An important subtlety is that ASCII was based on mechanical typewriters, but electric typewriters became popular during the same period that ASCII was adopted, and made their own changes to layout.[1] Thus differences between bit-paired and (electric) typewriter-paired keyboards are due to the differences of both of these from earlier mechanical typewriters.

[...] Bit-paired keyboard layouts survive today only in the standard Japanese keyboard layout, which has all shifted values of digits in the bit-paired layout.

[...] For this reason, among others (such as ease of collation), the ASCII standard strove to organize the code points so that shifting could be implemented by simply toggling a bit. This is most conspicuous in uppercase and lowercase characters: uppercase characters are in columns 4 (100) and 5 (101), while the corresponding lowercase characters are in columns 6 (110) and 7 (111), requiring only toggling the 6th bit (2nd high bit) to switch case; as there are only 26 letters, the remaining 6 points in each column were occupied by symbols or, in one case, a control character (DEL, in 127).

[...] In the US, bit-paired keyboards continued to be used into the 1970s, including on electronic keyboards like the HP 2640 terminal (1975) and the first model Apple II computer (1977).

1vuio0pswjnm7

0 replies

1d10h

2024-07-23 08:21:54 UTC

Mentioned in footnote 7:

https://ia601808.us.archive.org/2/items/mackenzie-coded-char...