return to table of content

The Elegance of the ASCII Table

WalterBright
30 replies
1d12h

Too bad we now have Unicode, an elegant castle covered with ugly graffiti and ramshackle addons. For example:

1. normalization

2. backwards running text (hey, why not add spiral running text?)

3. fonts

4. invisible characters

5. multiple code points with the same glyph

6. glyphs defined by multiple code points (gee, I thought Unicode was to get away with that mess from code pages!)

7. made up languages (Elvish? Come on!)

8. you vote for my made-up emoticon, and I'll vote for yours!

p_l
7 replies
1d12h

How to say you don't know what Unicode is for without saying it.

1, 2, 4, 5, 6, and, unfortunately, 8, all fall under "ability to encode written text from all human languages". And that includes historical. Some of the issues (5 & 6) are due semantic difference even if the resulting glyph looks the same. Unfortunately you can't expect programmers to understand pesky little thing like languages having different writing, so you end up with normalisation to handle the fact that one system sent "a + ogonek accent" and another (properly) sent "a with ogonek" (these print the same but are semantically different!), and now you need to figure out normalisation in order to be able to compare strings.

7. just like 8 are down to proposal of specific new forms of writing to add to Unicode. Elvish had one since 1997 but only now got a tentative "we will talk about it". Klingon, which is IIRC more complete language including native speakers (...weird things happened sometimes) does not have outside of private use area.

Emojis were added because they were used with incompatible encodings first, even before unicode happened, and without including something like SIXEL into unicode they were unrepresentable (and with SIXEL would lose semantic information)

news_to_me
5 replies
1d3h

"a + ogonek accent" and another (properly) sent "a with ogonek" (these print the same but are semantically different!)

How can these possibly be semantically different? Isn’t the point of combining characters to create semantic characters that are the combination of those parts?

p_l
4 replies
1d

There's a semantic difference between "accented letter" and "different letter that happens to visually look like another language's accented letter".

"Ą" in polish is not "A" with some accent. And the idea behind unicode was to preserve human written text, including keeping track of things like "this is letter A1 with an accent, but this is letter A2 that looks visually similar to A1 with accent but is different semantically". Of course then worries about code page size resulted in the stupidity of Han unification, so Unicode is a bit broken.

eviks
2 replies
7h48m

But it is precisely "a with some accent", you just have two ways to encode it for

p_l
1 replies
3h44m

"Ą" is a separate letter in polish alfabet, not an accented variant of "A".

There are writing systems where combining accents are used to represent just variation on a letter. Use of combining characters for "Ą" (and "Ć" and "Ł" and many other so-called "polish letters") is, at best, a historical artefact of trying to write them in deficient encodings.

eviks
0 replies
3h37m

It doesn't matter that it's a separate letter in an alphabet, you're denying the obvious - it IS an accented (or ogonek'ed) variant of A, and you can achieve this in Unicode in 2 ways: having one id for a precomposed variant and composing the variant from two ids.

There is no semantic difference, just an encoding one, the end result looks the same and means the same thing (well, to a point, it still depends on the context - like what language you mean - but within the same context it's the same thing and there are even Unicode rules to treat it the same like in search etc.)

And precomposed is just the same historical deficiency - you could've just as well designed a more compact encoding with no precomposed letters, only combinations

Dylan16807
0 replies
22h54m

Unless there's some nuance I'm missing, I think you're reading too much into the word "accent".

Especially because the codepoint is actually called "Combining Ogonek".

And for anyone writing in Cyrillic, it's actually more accurate to use the combining form, even as its own letter, because the only precomposed form technically uses a latin A.

But my main point is that I do not think there is supposed to be any semantic difference in Unicode based on whether you use precomposed or decomposed code points.

WalterBright
0 replies
21h39m

How to say you don't know what Unicode is for without saying it.

I know what its original mission was, which was a character set.

It's been mangled beyond recognition - by including semantic information which is in the purview of context, and presentation information (italics, fonts) which is in the purview of markup languages and layout information (backwards text) which is also in the purview of markup.

you can't expect programmers to understand pesky little thing like languages having different writing,

But you're requiring programmers to understand all the complicated normalization rules? Normalization is a totally unnecessary feature. Just use the normalized code points. Done.

these print the same but are semantically different!

Think about what this means. How ever did people manage to read and understand printed books? The semantic meaning comes from the context, not the glyph. For example, I can use `a` to mean the `ahh` sound, or the `ayy` sound, or mean a variable in algebra. How can I know which? The context.

It is totally impossible to add every meaning a glyph has.

would lose semantic information

Unicode is supposed to be a character set. That's it. Characters do not have semantic information without context.

Oh, and here's some drawkcab text I wrote without any help from Unicode at all.

I had to add some code into the D compiler to reject Unicode text direction "characters" because they can be used to invisibly insert malware into ordinary code.

Adding toy "languages" should be for people having fun, not Unicode.

mnau
5 replies
1d9h

As someone that whose native language isn't representable purely by ASCII, I celebrate it. Plus the first 128 codepoints are same as ASCII in UT-8.

Is Unicode kind of messy? Sure, but that's just natural consequences of writing systems being messy. Every point you made was for a sensible reason that is in a scope of Unicode mission (representing all text in all writing systems).

WalterBright
4 replies
21h35m

I'm sure that books can be printed in your language without any need for semantic information in the characters.

mnau
1 replies
19h48m

Yes, they can.

Is it a problem that they do? I don't think so. Using semantic symbols seems far better option. Most fonts simply map multiple codepoints to a single glyph while dealing with all fun stuff like ligatures and all fun from GSUB tables (and fun company tables in fonts).

Honestly, I see semantic information as an absolute win and good choice. If unicode didn't contain it, it would have to be somewhere else (or making ratehr unpleasant choices like having fj together). It's an illusion that it wouldn't. People want pretty text. Rest of the world doesn't care. They want pretty text everywhere.

Instead of hating unicode, there would be hating "glyph points" plus "markup" (that would be literally everwere, from email to form editors) have all kinds of problems.

WalterBright
0 replies
16h8m

Using semantic symbols seems far better option

Except it doesn't actually work. 'a' has a zillion different semantic meanings, all dependent on context. There is no crisis with somebody reading a book and misunderstanding which particular semantic meaning it has, because it is inferred from the context.

Semantic meaning always comes from context, and Unicode cannot fix that. People can use the mathematical code point for 'a' instead of the text 'a' and the semantic Unicode meaning is meaningless because the reader will see, like the letter 'a' in because, that it is a text meaning.

The only thing you get with multiple code points for 'a' is you can send out multiple identical appearing texts, but are different Unicode, so you can determine who leaked the memo.

Unicode's extremely limited markup ability helps nobody.

eviks
1 replies
7h45m

How do you Ctrl+F in a printed book? Why printed when we're taking about digital?

WalterBright
0 replies
52m

If you search for 'a', which one of the Unicode 'a's will it find?

yencabulator
1 replies
1d2h

I can't wait for when the majority of Unicode codepoints/glyphs are emojis that are no longer fashionable! That'll be a really weird relic of history, later.

ygra
0 replies
12h17m

It would probably be like other letters like þ that are no longer fashionable in some languages. Or not-so-small parts of Hanzi. Or completely dead scripts.

That being said, Emoji are a drop in a bucket when it comes to the number of encoded code points. Nicely enough, by encoding emoji outside the BMP, you can now use characters from astral planes in a lot more places without software breaking.

chthonicdaemon
1 replies
1d12h

All languages are made up. For that matter, all glyphs are made up, too.

bandie91
0 replies
1d9h

there is not only a quantitative difference between a conlang designed by a small group (or 1 person) and a "human" language developed organically in the span of centuries by millions of speakers, but also qualitative.

RiverCrochet
1 replies
1d6h

covered with ugly graffiti and ramshackle addons

Unfortunately there is plently of precendent for this ramshacklism. Like ACK/NAK - those are protocol signals, not characters! ENQ? What even is Shift In/Shift Out (SI/SO)? Then the database characters toward the end there FS, RS, GS, US.

backwards running text (hey, why not add spiral running text?)

You jest, but you do have cursor positioning ANSI sequences which are designed to let text draw anywhere on your screen. And make it blink! You also don't find it weird to have a destructive "clear-screen" sequence?

glyphs defined by multiple code points

I wonder when they started putting the slash across the 0 to differentiate from the O.

you vote for my made-up emoticon, and I'll vote for yours!

I mean you do have the private Unicode range where you can actually do that. But before that, SIXEL graphics.

kps
0 replies
1d4h

Like ACK/NAK - those are protocol signals, not characters!

American Standard Code for Information Interchange

Findecanor
1 replies
22h42m

2. backwards running text (hey, why not add spiral running text?)

Unicode encodes code points in logical order rather than visual order: the order in which text is supposed to be collated and spoken rather than the visual order.

One tricky issue is when both directions exist in the same text. Unicode can encode nesting of text in one direction within another. For example, text consisting of an English word and a Hebrew word can be encoded as either the English embedded in Hebrew or the Hebrew embedded in English: both would render the same but collate differently.

Is there a better way?

WalterBright
0 replies
21h30m

I've seen newspapers with txet sdrawkcab in them. Note that the last sentence has text in both directions.

I didn't need Unicode for that - nobody does.

    Uni
    ! c
    edo
Why is there no Unicode markup for that?

shepherdjerred
0 replies
1d

What would be the alternative? I think Unicode is pretty great.

You can pretty easily imagine a world where we had a bunch of different encodings with none being dominant.

saagarjha
0 replies
1d4h

Unicode is quite elegant in its encoding too. If you're going to criticize it for its content, maybe start with talking about how ASCII also has invisible characters and those that people rarely use.

account42
0 replies
4h49m

9. color variants

10. code points where the appropriate glyph depends on the language (CJK unification)

Retr0id
0 replies
1d12h

Language itself is a pile of ugly graffiti and ramshackle addons. It would be weird if Unicode didn't reflect this.

NoMoreNicksLeft
0 replies
1d2h

They're all made-up languages, some were just made-up a little bit more transparently.

MisterTea
0 replies
21h57m

Most of this is pretty useful for reproducing a wide gamut of human language. It gets completely fucked when it comes to fonts with png's embedded in svg's and other INSANE matryoshka doll nesting of bitmap/vector rendering technologies.

I also half hate emoji as it pollutes human writable text with bitmaps that are difficult to reproduce by hand on paper with a writing instrument - it's not text. I say half hate as it allows us a standard set of icons to use that can be easily rendered in-line with text or on their own.

Aardwolf
0 replies
1d9h

For me it's how they inconsistently, backwards-incompatibly, make some existing characters outside of the emoji-plane (and especially when in technical/mathematical blocks) render colored by default, rather than keep everything colored related in the emoji plane (making copies if needed rather than affecting old character, the semantics are very different anyway), e.g. https://imgur.com/a/Ugi7K1i and https://imgur.com/a/UMppZHG

BobbyTables2
29 replies
1d17h

I always lament that since at least 1980s or so, it seems the vast majority of the control characters were never used for their intended purpose.

Instead, we crudely use commas and tabs as delimiters instead of something like RS (#30).

thaumasiotes
19 replies
1d17h

That's because the intended purpose is either useless (for machine control characters) or useless and logically impossible (for delimiters).

What do you do if you have a record that includes a record separator character? Given that you have this problem anyway, why do you want a character dedicated to achieving the same thing that a comma achieves?

penteract
15 replies
1d17h

The record separator isn't on people's keyboards, so it's less likely to show up where it's not expected. Also it's less likely to legitimately occur in something like a name, so there are many users of CSVs who can say they will never need to consider data containing a record separator, and they will be right more often than those who never consider data containing a comma.

Of course, the fact that record separators aren't on keyboards is probably why CSVs use commas.

yardshop
5 replies
1d16h

In the DOS days, you could "type" control characters by pressing Ctrl and the corresponding letter key, Ctrl+M is Carriage Return, Ctrl+H is Backspace, Ctrl+Z is End Of File, etc.

It was probably possible to type an RS with Ctrl+Shift+. and the others with similar combos.

penteract
1 replies
1d15h

In a desktop linux terminal, Ctrl-^ or Ctrl-~ work for me. In a tty, I need to press Ctrl-V before them.

jart
0 replies
1d12h

Yeah Linux still works exactly this way. The modern WIN32 API even works that way too. When you ReadConsoleInput() it gives you teletypewriter style keyboard codes. When I wrote a termios driver for Cosmopolitan to have a Linux-style shell in CMD it really didn't take much to translate them into the Linux style. We're all still using glorified teletypes at the end of the day. It will always be the substrate of our world. One system built upon another older system.

jki275
1 replies
1d16h

you can still type them -- alt + 030(for instance) on the keypad will insert that RS character. In Windows at least -- not sure about the other OS.

Symbiote
0 replies
1d12h

On Linux terminals entering control characters is done with the control key, Ctrl-G for example, but they will often be intercepted by the program that is running.

Bash will insert the control character (rather than interpret it) if you prefix it with Ctrl-V.

_flux
0 replies
1d10h

I think it's worth mentioning that Ctrl-A is ascii 1, Ctrl-B ascii 2, etc, as it is in Unix today.

thaumasiotes
5 replies
1d16h

Also it's less likely to legitimately occur in something like a name, so there are many users of CSVs who can say they will never need to consider data containing a record separator, and they will be right more often than those who never consider data containing a comma.

No, they'll be right exactly as often, 0% of the time.

But their mistake will show up less frequently, causing more problems when it does.

As soon as it's possible for some of your data to come from someone else's dataset, you're guaranteed to have to accommodate record separators within your data as well as within the metadata. You're better off using a system that plans for this inevitability than one that pretends it can't happen at all.

penteract
4 replies
1d16h

No, they'll be right exactly as often, 0% of the time.

But their mistake will show up less frequently, causing more problems when it does.

Enough people use CSVs (and have limited, small-scale use-cases) that I'd be willing to bet "less frequently" means never for at least 1% of people who use CSVs.

I don't know whether the chance of no problems is worth the increased difficulty of problems that do occur - considering that balance feels a bit silly because if you're aware there could be a problem in a context where you could choose between commas and unit separators, you could just add validation or escaping.

thaumasiotes
3 replies
1d16h

considering that balance feels a bit silly because if you're aware there could be a problem in a context where you could choose between commas and record separators, you could just add validation or escaping.

As soon as you have validation or escaping, having a record separator character loses its entire purpose. The existence of the character is predicated on the idea that you don't have to do that, and that idea is false.

That's why the character is never used. It's a conceptual mistake that was accidentally enshrined in a series of encoding standards that had enough free space to accommodate it.

penteract
1 replies
1d15h

As soon as you have validation or escaping, having a record separator character loses its entire purpose. The existence of the character is predicated on the idea that you don't have to do that, and that idea is false.

I disagree with this - the data needs to be stored somehow, and while other characters (like comma) can be used, having a dedicated character can help - for example if the data might legitimately contain commas or newlines but not unit separators or record separators, then escaping isn't needed if you use unit/record separators (although validation is still necessary).

Symbiote
0 replies
1d12h

I agree.

TSV is widely used, but lacks a way to escape the tab and new line characterss. RS-V is the same, but allows including tabs and new lines in records.

Dylan16807
0 replies
22h33m

As soon as you have validation or escaping, having a record separator character loses its entire purpose.

Not true. Validation is easier than escaping.

keybored
2 replies
1d11h

I can’t think of a case where someone would write a control character like that into something intended for text on purpose. So you might as well disallow it.

jerf
1 replies
1d3h

The situation that comes up the most often that you need to consider is when someone embeds the same sort of file into itself, or chunks of the same sort of file into itself. If using the ASCII characters to delimit fields was common, you'd need to consider that over the course of some moderately interesting system's life time the odds of someone copying and pasting something from an encoded file into the spreadsheet application and picking up the ASCII control characters with it is basically 100%. And while we may be able to say with some confidence that nobody is going to embed a CSV file into a CSV file (and I say only some confidence, the world is weird and I'm sure someone will read this who has actually seen someone do this), there's other situations like HTML-in-HTML (for example, every HTML tutorial ever) that are guaranteed by their nature.

It is still valid to disallow the ASCII control characters, one just has to make sure that it is done comprehensively, in all places users may input them. But that's not created by using ASCII control characters, that's a consequence of the "ban the control characters entirely" approach regardless of what the control characters are.

It's neat when you can get away with it, but I generally prefer to define a robust encoding scheme instead. A minimal one like "replace backslash with double-backslash, replace control characters with backslashed characters" and "replace backslash sequences with their control characters, including backslash-backslash as a single backslash" can be inserted almost anywhere in just a few lines of string replace (or stream processing if you need the speed). The only tricky bit is you need to make sure you get the order correct or you corrupt data, and while I've done this enough to have it almost memorized now I do recall feeling like the correct order is backwards from what I naturally wanted the first few times. But it is simple and robust if you get it right.

keybored
0 replies
1d2h

Someday I will create both formats: a control-characters are banned format (and never accepted) and one where they are escaped. That ought to be good enough for all needs!

(A trivial evening project for some; not for all of us)

keybored
0 replies
1d11h

What do you do if you have a record that includes a record separator character?

This comes up every time. Options:

1. You disallow it. And you might as well disallow all the control codes except the carriage return, line feed, and other “spacing” characters. Because what are they doing in the data proper? They are in-band signals.

2. You use the Escape character to escape them

3. Weirdest option: if you really want to nest in a limited way you can still use the group and file separator characters

NoMoreNicksLeft
0 replies
1d2h

Well, that's what an escape is for. Are we really having a serious discussion in 2024, where someone is suggesting that it's not the responsibility of the software engineer to sanitize inputs before chucking the data into some sort of database?

AdamH12113
0 replies
1d15h

> What do you do if you have a record that includes a record separator character?

You use the ASCII escape character (0x1B), which is designed for exactly that purpose.

EvanAnderson
4 replies
1d16h

I did some ETL work that used the ASCII delimiter characters. It was very enjoyable. I didn't have to worry about escaping or parsing escaped strings. The control codes were guaranteed to be illegal in input. It was refreshing.

theamk
3 replies
1d16h

Could you do the same with TSV? A lot of datasets can either prohibit tabs in data, or convert it to spaces in early ingestion.

EvanAnderson
1 replies
1d14h

TSV is a joy compared to CSV, for sure. CLI tools that output TSV are what immediately spring to mind.

red_admiral
0 replies
1d8h

Yes, and as long as you remember to turn off the "TAB produces 4 spaces" thing in your editor (grumble makefiles grumble) it's really nice to work with.

yencabulator
0 replies
1d1h

Ah, Deborah␞ Records. Little Debbie Records, we call her.

tracker1
0 replies
1d3h

That's my thought as well... I remember using them pre-xhr web in order to send data from the server to JS, which I could then split up pretty easily on the client side. I still don't know why we are so tethered to CSV.

red_admiral
0 replies
1d8h

As long as your data is not binary, so does not contain record separators itself, this would be a thousand times better than CSV (because text data _does_ often contain commas and double quotes).

The only thing you'd need is editors to support some way of entering and displaying the RS, and CTRL+^ is a bit of a kludge as it ends up CTRL+SHIFT+6.

Of course, if a record itself can contain RS for subrecords, things become more complicated. I guess you could use `\^`.

fukawi2
0 replies
1d9h

I recall working on a PICK D3 system, which was a "multivalue" database. Each field could have multiple values, those values could have sub values, and a third level beyond that.

Values were separated with char(254), subvalues were separated with char(253), and the third level were char(252) separated.

It was... unique, but worked. And to be fair, PICK originated in the 60's, so this method probably evolved in parallel to the ASCII table!

jolmg
16 replies
1d17h

So when you’re reading 7-bit ASCII, if it starts with 00, it’s a non-printing character. Otherwise it’s a printing character.

The first printing character is space; it’s an invisible character, but it’s still one that has meaning to humans, so it’s not a control character (this sounds obvious today, but it was actually the source of some semantic argument when the ASCII standard was first being discussed).

Hmm.. Interesting that space is considered a printing character while horizontal tab and newline are control characters. They're all invisible and move the cursor, but I guess it makes sense. Space is uniquely very specific in how the cursor is moved one character space, so it's like an invisible character. Newline can either imply movement straight down, or down and to the left, depending on a configuration or platform (e.g. DOS vs UNIX line endings). Horizontal tab can also move you a configurable amount rightwards, and perhaps it might've been thought a bit differently, given there's also a vertical tab, which I've got no idea on how it was used. Maybe it's the newline-equivalent for tables, e.g. "id\tcolor\v1\tred\v2\tblue\v" or something like that.

Interesting also that BS is a control char while DEL is a printing(?) char. I guess that's because BS implies just movement leftwards over the text, while DEL is all ones like running a black sharpie through text. Guess that's what makes it printing. Wonder if there were DEL keys on typewriters that just stamped a black square, and on keypunchers that just punched 7 holes, so people would press "backspace" to go back then "delete" to overwrite.

I've used ASCII a lot, but even after so many years, I'm getting moments where it's like "oh this piece isn't just here, it needs to be here for a deep reason". It's like a jigsaw puzzle.

pwg
5 replies
1d17h

You also have to keep in mind the "interface" for 1962-1968. The printer teletype machine.

The "control codes" were to "control" the printhead. So "carriage return" meant move the "print carriage" back to the left margin. "New line" meant move the paper platen one line height of rotation to move the paper to the next line. In that context, "back space" was "move print head one space left" (rather more like a "reverse space"). The article does mention that there was some debate about whether space should be considered "printable", but if you consider a mechanical printer, as the head is moving to the right and banging out characters onto the paper, the spaces between words do, sort of, look like "printables" (of a sort, a "print nothing" character as it were).

Tab's being control characters then make a bit more sense, in that they cause the printhead to jump some fixed distance to the right.

The article stated why DEL is where it is (all ones) -- so that for punched paper tape, one could get a punch-out of every position, which was then interpreted as "nothing here" by the tape reading machine.

As for typewriters, no, none had a "black box" blot out key. Correction (for typewriters without built in correction tape) was one of: retype the page, apply an eraser (and hopefully not damage the paper surface too much) then retype character and continue, or apply correction fluid (white-out) and retype character and continue.

For those typewriters with built in correction tape options (at least some IBM Selectric models, possibly more) the typewriter would retype the character using the "white-out" ribbon, then retype the replacement character using the normal "typewriting" ribbon.

tivert
1 replies
1d11h

Tab's being control characters then make a bit more sense, in that they cause the printhead to jump some fixed distance to the right.

Isn't that incorrect? Tab doesn't jump a "fixed distance to the right," it jumps a variable distance to the next tab-stop to the right.

bandie91
0 replies
1d9h

yea he must meant that it jumps to a fixed position

EvanAnderson
1 replies
1d16h

The article stated why DEL is where it is (all ones) -- so that for punched paper tape, one could get a punch-out of every position...

I saw an analogous use of backspace on some OS I ran into 30 years ago cruising around either Tymnet or TELENET. (I wish I could remember the OS...)

The password prompt assumed local echo. After entering a password the host would send a series of backspaces and various patterns of characters (####, **, etc) to overprint the locally-echoed (and printed) characters.

kmoser
0 replies
1d13h

On the login to the first timesharing system I used, it would prompt for your password, then type eight M's, W's, and X's on top of each other (on paper, of course, since this was using a Teletype terminal), so when you actually typed your password the characters would be printed on top of those already obscured lines.

rob74
0 replies
1d13h

For those typewriters with built in correction tape options (at least some IBM Selectric models, possibly more) the typewriter would retype the character using the "white-out" ribbon

there was also a solution for cheaper typewriters: small sheets of "white-out" paper (known under the genericized brand name "Tipp-Ex" here in Germany) that you could hold between the ink ribbon and the paper to "overwrite" a typo.

kragen
5 replies
1d15h

del is not a printing character. it's a control character. if you run a paper tape full of del characters through a teletype it does not print anything. it has to have that bit pattern, even though it greatly complicates the mechanics of the teletype (which has to do all the digital logic with cams and levers) because that way it can be punched over any character on the paper tape to delete it

a figure caption in this page says 'This is a historical throwback to paper tape, where the keyboard would punch some permutation of seven holes to represent the ones and zeros of each character. You can’t delete holes once they’ve been punched, so the only way to mark a character as invalid was to rewind the tape and punch out all the holes in that position: i.e. all 1s.' which is mostly correct, except that it wasn't a historical throwback; paper tape was perhaps the most important medium for ascii not just in 01963 and 01967 but probably in 01973, maybe even in 01977. teletype owners today are still using paper tape that was manufactured during the vietnam war, where it was used in unprecedented volume for routing teletype messages by hand

the dominant early pc operating system, cp/m (if it's not overly grandiose to call it an 'operating system') had system calls for reading and writing the console, the disk, and the paper tape punch and reader. when i hooked up a modem to my cp/m system to call bbses, i hooked it up as the punch and reader

91bananas
2 replies
1d

just... this is why this forum exists. thank you

kragen
0 replies
4h58m

wow, i sure didn't have to wait long; in this case it's someone who's harassed me repeatedly and who uses the site mostly for political flamewars: https://news.ycombinator.com/item?id=41056718

jart
1 replies
1d12h

so the only way to mark a character as invalid was to rewind the tape and punch out all the holes in that position

So that's why \177 (DEL) is the loneliest control character. Wow. Thank you!

kragen
0 replies
1d11h

happy to help

nikau
0 replies
1d10h

Logically space maps to a character people use with pen and paper unlike tab

layer8
0 replies
1d8h

Space is what is represented in the output, i.e. in one cell of the terminal grid, whereas control characters like Tab and CR/LF don’t map onto such an output representation. If you want to represent the printed contents of each “grid cell” of a printout or a textmode screen buffer, you don’t need the control characters, only the printable characters. The printable characters are what you’d need in a screen font.

kazinator
0 replies
1d14h

Space doesn't just move the cursor on a display; it will obliterate a character cell with a space glyph.

When a display terminal has nondestructive backspace (backspace character doesn't erase), it can be software emulated with BS-SPACE-BS.

At your Linux terminal, you can do "stty echoe" (echo erase) to turn this on (affecting the echoing of backspace characters that are input, not all backspace characters).

Dial-up BBSes had this as a configurable setting also.

california-og
0 replies
1d10h

While DEL didn't stamp a black square on typewriters, it sometimes did so (or something similar, like diagonal stripes) in various digital character sets. ISO 2047[0] established the graphical representations for the control characters of the 7-bit coded character set in 1975, maily for debugging reasons. This graphical representation for DEL was used by Apple IIGS, TRS-80 and even Amiga!

[0]: https://en.m.wikipedia.org/wiki/ISO_2047

augusto-moura
12 replies
1d11h

Useful tip, on linux (not sure about other *nixes) you can view the ascii table by opening its manpage:

  man ascii
It's been useful to me more than once every year, mostly to know about shell escape codes and when doing weird character ranges in regex and C.

It can be a bit confusing, but the gist is that you have 2 chars being show in each line, I would prefer a view where you see the same char with shift and/or ctrl flags, but you can only ask so much

fitsumbelay
3 replies
1d1h

strange: on MacOS 14.5 I get output for `man ascii` but `ascii` goes "command not found"

augusto-moura
1 replies
18h29m

Not sure why anyone would downvote your comment, because it is a genuine question

`man` is basically manual documentation on anything on your system, not only commands. Most commands do have a manpage for them, but it is not a requirement. The argument of the command is just the file name for the document

aff0
0 replies
17h42m

Indeed `man ascii` (on MacOS but the same for Linux for the most part) shows the manpage for 'ASCII(7)' - the '7' denotes the section of the manual the manpage is from. If you use `man man`, you can see the section numbers and names, e.g. Section 1 General Commands, 5 File Formats, 7 Misc Info, 8 System Manager's Manual. If a word, e.g. 'crontab', has multiple entries in different sections, then you might have to specify the section you want, e.g.`man crontab` shows the crontab(1) (General Command) and use `man -s 5 crontab` to see the crontab(5) (File Format). `apropos crontab` will show entries related to crontab, i.e. cron(8), crontab(1), and crontab(5).

AnimalMuppet
0 replies
1d1h

On my Linux VM, it's the same, and it's because 'man ascii' comes from man(7), not man(1). It's not a man page for a program. It's just a man page.

dailykoder
1 replies
1d10h

Damn, thanks!

Why the hell did I never try this? Maybe because typing ascii table into my favorite search engine and clicking one of the first links was fast enough

omnicognate
0 replies
1d8h

I used to do that until the experience became degraded enough, reflecting the general state of the web, that I took the time to look for a better way and found `man ascii`.

INTPenis
1 replies
1d9h

The reason I know this is because in 2004 I was squatting in an apartment with no TV and no internet. So each day after work I would go home and just read manpages for fun.

Ended up learning ipfw through the firewall manpage on FreeBSD, and using my skills to setup and manage an IPFW at work.

It's amazing how much you get done with no TV and no internet. Also played a lot of nethack.

w0m
0 replies
1d

I learned vim proper by reading :help on an eeepc while flying back and forth over the Atlantic alone one year.

irrational
0 replies
22h51m

Works on mac

bell-cot
0 replies
1d10h

Similar in FreeBSD. It has octal, hex, decimal, and binary ASCII tables, along with the full names of the control characters.

KingOfCoders
8 replies
1d14h

For everyone who doesn't need ä,ü,ö. Or software that needs to take ä,ü,ö. For everyone else, UTF is a blessing.

lmm
4 replies
1d12h

For everyone else, UTF is a blessing.

Except people who want to use Japanese and not have it render weirdly, something that was easy in any internationalised software that used the traditional codepage system, but is practically impossible in Unicode-based software.

Retr0id
2 replies
1d12h

Where can I learn more about this issue?

p_l
0 replies
1d12h

Probably referring to so-called "Han unification" which tried to use same codepoints for different glyphs to reduce code space for ideograms derived from Chinese ones.

But that only causes confusion because you need to provide external information which way to interpret them, just like a code page

eviks
0 replies
7h30m

How is it impossible if you Unicode has language tags?

bigstrat2003
1 replies
1d13h

Which, given the people who designed this and the time they were designing for, was most of them (and most of their audience). Don't confuse "this old standard doesn't adequately cover all cases today" with "this old standard sucked at the time".

eviks
0 replies
7h31m

Don't confuse "this old standard sucked at the time for missing the obvious at the time" with "I can come up with some excuse"

account42
0 replies
3h48m

Here, take these: ae ue oe

zokier
6 replies
1d11h

I think that adopting ASCII as the general purpose text encoding was one of the great mistakes of early computing. It originated as control interface for teletypes and such, and that's arguably where it should have remained. For storing and processing (plain) text ASCII doesn't really fit that well, control characters are a hindrance and the code space would have been useful for additional characters. The ASCII set of printables was definitely a compromise formed by the limited code space.

ddingus
2 replies
1d10h

No way!

No amount of extra characters was going to address what Unicode did.

ASCII was not a mistake at all. Adopting it unified what was surely going to be a real mess.

At the time it made sense, and the control functions were needed. Still are.

zokier
1 replies
1d6h

At the time it made sense, and the control functions were needed. Still are.

Control characters were needed for terminals. They never made sense for text. Mixing the two matters is the problem.

ddingus
0 replies
1d2h

It isn't a problem. The text is the UX.

What else would you have proposed, or would propose?

kstenerud
1 replies
1d11h

It's one of the greatest triumphs of early computing. Not only did it harmonize text representation and transmission in a backwards compatible manner; the fact that they deliberately kept it 7 bit for so long also helped for developing a sane set of other language character sets (ISO-8859), and paved the way for a smooth transition to Unicode (UTF-8) - which is now the dominant encoding worldwide.

ddingus
0 replies
1d10h

Yes, seconded easily

hackit2
0 replies
1d11h

Yeah you were not around when a kb of memory took up half your room. Looking back it doesn't make sense but at the time a byte was what-ever you wanted it to be. Considering number of characters in English language is 26, it is reasonable for a byte to be 5 bits, giving you a total of 32 possible states. Which leaves you with 6 values which could be used as control characters. how-ever lets not forget there are 7,164 other languages of the world, and they all have their own unique way of doing things.

Oh yeah, lets not forget that at the time you had other nationalistic countries/territories/people with their own superior technology all vying for the top position, all well trying to out do each other. Then you also had manipulative monopolies/trade embargo's and wars.

It isn't perfect but people aren't perfect.

lucasoshiro
6 replies
1d19h

Once I saw a case-insensitive switch in C using that pattern of letters:

switch (my_char | 0x20) {

   case 'a': ...
   break;

   case 'b': ...
   break;

}

Sharlin
4 replies
1d11h

Yes, that’s very intentional and just masking (or setting) the bit is the intended way to do case-insensitive comparison of the letter range in ASCII (eg. stricmp in C), or to transform text to lower or upper case (tolower, toupper).

But what’s more, ever wondered whence the control (Ctrl) key presses like Ctrl-H to backspace, or Ctrl-M for carriage return? Well, inspecting the ASCII chart it becomes evident: the Ctrl key simply masks bit 6 (0x40), turning a letter into its respective control character!

lucasoshiro
1 replies
1d4h

Nice!

I'm an emacs user, and when I use a readline-based REPL I use ctrl-M a lot. I thought it was inherited from the emacs keybindings, like many other shortcuts from GNU readline

jerf
0 replies
1d3h

Then an additional useful command: In the out-of-the-box emacs bindings, C-q is the "quoted insert" command. It will take the next character and directly insert it into the buffer. This is useful for things like tab or control characters where emacs would normally use the keystroke to do something else. I've been working in an email-related space lately so I've been doing a good amount of C-q C-m for inserting literal CRs, and C-q TAB for a few places where I want a literal tab in the source, in a buffer that interprets a normal TAB as a command to indentify the current row. I mention this because you can use the ASCII table to work out how to insert a particular control character with your keyboard literally, if you need to insert one of the handful of other characters you may be interested in every so often, like C-l for "form feed" (now used for "page feed" in some older printer-related contexts) or C-@ for NUL if you're doing something weird with binary files in a "text" buffer.

flohofwoe
1 replies
1d1h

...it's a bit of a shame that the same upper/lowercase trick doesn't apply to all UNICODE codepoints (at least those that have upper/lower variants).

It seems to work for codepoints up to U+00FF, for instance:

    - Å (U+00C5) vs å (U+00E5)
...but above 0xFF lowercase follows uppercase:

    - Ă (U+0102) vs ă (U+0103)
Typical for UNICODE though, nothing makes sense ;)

Findecanor
0 replies
23h32m

That's because U+00A0–U+00FF are encoding an earlier character set: "ISO Latin-1" (ISO 8859-1), itself based on DEC's "Multinational Character Set". The upper/lowercase trick does not apply to ß/ÿ but does in MCS where Ÿ/ÿ are at a different pair of code points.

ISO Latin-1 was the character set on many Unix systems, Amiga OS, MS-Windows (as "Windows-1252" with extra chars), and was for many years the default character set on the web.

mananaysiempre
0 replies
1d18h

This can be made to work for ASCII and EBCDIC simultaneously for extra esoterica points:

  switch (my_char | 'A' ^ 'a') {
  case 'A' | 'a': /* ... */ break;
  /* ... */
  }
I don’t know if this is too fancy to have ever made it into real code, but I believe I’ve seen places in the ICU source that still say ('A' <= x <= 'I' || 'J' <= x <= 'R' || 'S <= x <= 'Z') instead of just ('A' <= x <= 'Z'), EBCDIC letters being arranged in those three contiguous ranges.

gumby
5 replies
1d15h

I wish the author had included the full ascii chart in 4 bits across / 4 bits down. You can mask a single bit to change case and that is super obvious that way.

The charts that simply show you the assignments in hex and octal obscure the elegance of the design.

kalleboo
2 replies
1d15h

It was at some point looking at a chart like that where it also dawned on me where the control codes like ^D, ^H, ^[ etc came from

kr2
1 replies
1d11h

I was going to ask you to please explain as I didn't understand, but I am guessing you are talking about the same thing as this comment[1] right? That's super cool

https://news.ycombinator.com/item?id=41042570

AdamH12113
1 replies
1d15h

The third and fourth columns of the table are only a single bit apart from each other. If you mentally swap the first two columns, you get a Gray code ordering of the most significant bits, which is pretty close to what you're looking for.

ggm
5 replies
1d18h

  man ascii
is never far from my fingers. combined with od -c and od -x it gets the job done. I don't think as fluently in Octal as I used to. Hex has become ubiquitous.

fsckboy
4 replies
1d13h

you mean ?

    ascii

ggm
2 replies
1d13h

No I don't -I live in a different universe to you:

  % (uname; cd /usr/ports; ls -d */ascii)
  FreeBSD
  zsh: no matches found: */ascii
  % which ascii
  ascii not found
  %
It's the same on OSX and debian by default doesn't install that command. If you live inside a POSIX/IEEE 1003 system and want to know the ascii table reliably then the command I run is the one which works. If your distribution doesn't ship manuals by default you have bigger problems.

amszmidt
1 replies
1d10h

”man ascii” has as much guarantee to work on a POSIX system as a command called ”ascii” seeing neither (specifically a man page called ”ascii”) are part of the standard.

So you will either get command not found, or man page not found.

account42
0 replies
3h55m

Neither being guaranteed does not mean both have the same likelyhood of existing.

The man page comes preinstalled on most modern non-embedded POSIX systems. The command does not.

bandie91
0 replies
1d1h

man 7 ascii

snvzz
4 replies
1d11h

The ASCII table is defective; it is missing a dedicated code for newline.

CR and LF aren't dedicated, and have precise cursor movement meanings, rather than being a logical line ender.

There was a proposal in the 80s to reassigning the -otherwise useless- VT (vertical tab) character for the purpose. Unfortunately unfruitful.

Gormo
2 replies
1d

Unfortunately unfruitful.

Fortunately unfruitful, since if it had gained adoption, there'd be a mix of three different line endings (and combinations thereof) in widespread use, instead of two.

snvzz
0 replies
19h21m

widespread use

We are talking about the early 80s at worst. It would have consolidated by now.

snvzz
0 replies
11h8m

there'd be a mix of three

We already got four. \n, \r, \n\r and \r\n.

bregma
0 replies
1d7h

A separate control character was not needed to indicate where your Hollerith string ended: it ended at the end of the Hollerith string. If you wanted to render a Hollerith string onto print media, you'd often want to feed the line and then return the carriage before printing the next Hollerith string. Of course, that wasn't strictly necessary if you were using a line printer, which would just print the line and advance.

The filesystems I used had 5 kinds of file: random-access, sequential, ISAM, Fortran-carriage-control and carriage-return-carriage-control. The only people who used the latter were the eggheads that used that new-fangled C programming language brought over from Bell Lab's experimental Unix system.

You're probably just looking for the record separator (036). If you are storing multiple text records and a block of memory, that would be the ideal ASCII code to separate them.

theamk
2 replies
1d15h

It's really not.

In base-2 machines, the letters are mixed with punctuation, which is pretty horrible design which makes simple things complex, and does not actually bring anything new to the table.

In BCD machines it is slightly better, except letters aren't contiguous either - row 0 is bad, but it's the extra space between R and S which is really ugly. And it's unusable with BCD operations anyway, as high nibble values are used extensively.

Naive sorting simply does not work... lowercase before uppercase, punctuation in the middle of the alphabet, numbers after letters.

I see no elegance there, it's like the worst example of legacy code.

jiveturkey
1 replies
1d

It was designed for a specific purpose ... elegance in context

theamk
0 replies
23h44m

Which would that "specific purpose" be? Even punch cards have alphabet interrupted between R and S.

And the whole punch card -> 8-bit is pretty illogical, just like the cards themselves. How come no punches in zone don't correspond to 0 high bits?

(and don't get me started on punch card.. it started with "let's do 1 hole per column for digits" - OK, makes sense; then "let's do 2 hole/column for uppercase" - I guess OK but why did you put extra char in the middle... but then it's 4 holes/column for superscripts? 3-6 holes/col for punctuation? If someone were to design punch cards today but using same requirements, they could easily come up with a much more logical schema)

gerdesj
0 replies
1d17h

Its shit if you don't routinely speak or write English. On those grounds, I'll decry it as not only shit but purposely shit.

OK a bit over the top ... the designers of EBCSDIC had a rather tight set of constraints to deal with, none of which included: "be inclusive". Again, if I really had to be charitable (I looked after a System/36, back in the day), the hardware was rather shit too, sorry ... constrained. Yes constrained. Why should six inch fans fire up reliably after a few years of use and not need a poke after an IPL? No real dust snags and I carefully sprayed some WD40 on the one that I could get at. I have modern Dells and HPs in horrid environments that do better with shitty plastic fans.

EBCDIC is not elegant at all unless excluding non English characters in an encoding system is your idea of elegant.

According to this: https://en.wikipedia.org/wiki/EBCDIC it expended loads of effort with dealing with control eg: "SM/SW" instead of language.

ASCII and EBCDIC and that basically say: fuck you foreigners!

We now have hardware that is apparently capable of messianic feats. Let's do the entirety of humanity some justice and really do something elegant. It won't involve EBCDIC.

thristian
3 replies
1d16h

That, I’m afraid, is because ASCII was based not on modern computer keyboards but on the shifted positions of a Remington No. 2 mechanical typewriter – whose shifted layout was the closest compromise we could find as a standard at the time, I imagine.

According to Wikipedia¹, American typewriters were pretty consistent with keyboard layout until the IBM Selectric electric typewriter. Apparently "small" characters (like apostrophe, double-quote, underscore, and hyphen) should be typed with less pressure to avoid damaging the platen, and IBM decided the Selectric could be simpler if those symbols were grouped on dedicated keys instead of sharing keys with "high pressure" symbols, so they shuffled the symbols around a bit, resulting in a layout that would look very familiar to a modern PC user.

Because IBM electric typewriters were so widely used (at least in English speaking countries), any computer company that wanted to sell to businesses wanted a Selectric-style layout, including the IBM PC.

Meanwhile, in other countries where typewriters in general weren't so popular or useful, the earliest computers had ASCII-style punctuation layout for simplicity, and later computers didn't have any pressing need to change, so they stuck with it. Japanese keyboards, for example, are still ASCII-style to this day.

¹: https://en.wikipedia.org/wiki/IBM_Selectric#Keyboard_layout

Mountain_Skies
0 replies
1d7h

So many things on the CoCo turned out to be the way they were for cost saving reasons. Tandy was good at saving pennies everywhere it could. When I took 'Typing' in high school, my muscle memory was in a constant fight between the IBM Selectric layout of the typewriters at school and the CoCo at home.

th0ma5
3 replies
1d17h

I heard someone describe the ASCII table as a state machine. Guess I could understand that as a state machine needed to parse it? This is surprisingly hard to search for but I was wondering if anyone knows what they were talking about.

kevin_thibedeau
0 replies
1d16h

Bespoke hardware for text handling isn't a thing these days but would have been in the 60's and 70's. A table layout that can be easily decoded in hardware simplifies the necessary circuitry for responding to control characters or converting binary numbers to/from decimal when the microprocessor hadn't been invented yet.

gumby
0 replies
1d7h

It was originally implemented in actual hardware (rods and bars). Just look inside a teletype, like a KSR-23 (pre ascii, but similar)

senkora
3 replies
1d2h

Fun fact: sorting ASCII numerically puts all the uppercase letters first, followed by all the lowercase letters (ABC... abc...). A more typical dictionary ordering would be more like AaBbCc... (or to even consider A and a at the same sort level and only use them to break ties if the words are otherwise identical).

The order used by ASCII is sometimes called "ASCIIbetical", which I think is wonderful.

https://en.wiktionary.org/wiki/ASCIIbetical

NoMoreNicksLeft
2 replies
1d2h

I thought the point of that was that a single bitflip makes an uppercase lower, or vice versa...

Dylan16807
1 replies
22h28m

It can't be "the" point, because AaBbCc would also let you use a single bit to control case, the bottom bit.

NoMoreNicksLeft
0 replies
2h11m

The first bit was already taken at that point. The original 6bit ASCII (or whatever that predecessor was called) only did uppercase. A 7th (and later, 8th) bit was added to give it lowercase.

georgehotelling
2 replies
1d5h

Dark grey #303030 text on slightly darker grey #1B1C21 background is really hard to read. Maybe I'm just getting old, but I also assume the audience for a blog post about the ASCII table was born in a year that starts with 19.

Retr0id
1 replies
1d5h

The background is white on my machine, are you using some kind of extension to force "dark mode"?

georgehotelling
0 replies
1d5h

I'm using pi-hole, uBlock Origin, and Privacy Badger on Firefox. I checked my network tab before complaining and didn't see any resources that failed to load.

userbinator
1 replies
1d14h

You might be familiar with carriage return (0D) and line feed (10)

You mean 0D and 0A, or 13 and 10, but that mix of base really stood out to me in an otherwise good article. I'm one of numerous others who have memorised most of the base ASCII table, and quite a few of the symbols as well as extended ASCII (CP437), mainly because it comes in handy for reading programs without needing a disassembler. Those who do a lot of web development may find the sequence 3A 2F 2F familiar too, as well as 3Ds and 3Fs.

I can see the rationale for <=> being in that order, but [\] and {|} are less obvious, as well as why their position is 1 column to the left of <=>.

yardshop
0 replies
1d6h

You mean 0D and 0A, or 13 and 10

He fixed it.

rrwo
0 replies
1d6h

Thanks for posting that.

People tend to overlook that the technologies we use today have a much older history.

wduquette
0 replies
1h45m

Regarding paper tape, our first home computer (this was in the mid-to-late 70's) had a paper tape reader and punch. I do not miss paper tape as a storage medium, but I have to say the little punched-out dots were fun to use as confetti at high school football games.

transfire
0 replies
1d18h

One downside of ASCII is the lack of two extra “letters” (whatever they might be, e.g. perhaps German ß), as it makes it impossible to represent base 64 alphanumerically. So we ended up with many alternatives picking two arbitrary punctuation marks.

renox
0 replies
10h17m

I still think that they made a big mistake in not having the letters immediately following the numbers, this would have made printing numbers in hexadecimal much more efficient.

red_admiral
0 replies
1d8h

The "16 rows x 8 columns" version, with the lowercase letters added, seems the most elegant one to me because it makes the internal structure of the thing visible. For example, to lowercase a letter, you set bit 6; a decimal digit is the prefix 011 followed by the binary encoding of the digit etc.

It also makes clear why ESC can be entered as `^[` or ENTER (technically CR) as `^M` on some terminals (still works in my xterm), because the effect of the control key is to unset bits 6 and 7 in the original set-up.

Of course you can color in the fields too, if you want.

netcraft
0 replies
1d15h

I've searched off and on for a great stylistic representation of the ASCII table, id love a poster to hang on my wall, or possibly even something I could get as a tattoo.

kragen
0 replies
1d15h

unfortunately this page is based on mackenzie's book. mackenzie is the ibm guy who spent decades trying to kill ascii, promoting its brain-damaged ebcdic as a superior replacement (because it was more compatible, at least if you were already an ibm customer). he spends most of his fucking book trumpeting the virtues of ebcdic actually

bob bemer more or less invented ascii. he was also an ibm guy before mackenzie's crowd pushed him out of ibm for promoting it. he wrote a much better book about the history of ascii which is also freely available online, really more a pamphlet than a book, called "a story of ascii": https://archive.org/details/ascii-bemer/page/n1/mode/2up

tom jennings, who invented fido, also wrote a history of ascii, called 'an annotated history of some character codes or ascii: american standard code for information infiltration'; it's no longer online at his own site, but for the time being the archive has preserved it: https://web.archive.org/web/20100414012008/http://wps.com/pr...

jennings's history is animated by a palpable rage at mackenzie's self-serving account of the history of ascii, partly because bemer hadn't really told his own story publicly. so jennings goes so far as to write punchcard codes (and mackenzie) out of ascii's history entirely, deriving it purely from teletypewriter codes—from which it does undeniably draw many features, but after all, bemer was a punchcard guy, and ascii's many excellent virtues for collation show it

as dwheeler points out, the accomplished informatics archivist eric fischer has also written an excellent history of the evolution of ascii. though, unlike bemer, fischer wasn't actually at the standardization meetings that created ascii, he is more careful and digs deeper than either bemer or jennings, so it might be better to read him first: https://archive.org/details/enf-ascii/

it would be a mistake to credit ascii entirely to bemer; aside from the relatively minor changes in 01967 (including making lowercase official), the draft was extensively revised by the standards committees in the years leading up to 01963, including dramatic improvements in the control-character set

for the historical relationship between ascii character codes and keyboard layouts, see https://en.wikipedia.org/wiki/Bit-paired_keyboard

johanneskanybal
0 replies
1d7h

Kind of hard to read something where the author considers every non-english languages equally worthy to emoji’s.. It was good in the 50’s but was important like 4-5 decades too long.

eviks
0 replies
7h41m

The first 32 “characters” (and, arguably, the final one) aren’t things that you can see, but commands sent between machines to provide additional instructions

Such a waste and no extensibility kills and claim to elegance of some shifted binary numbers, that'd the wrong end to focus your optimization efforts on

bloak
0 replies
1d5h

Vaguely related: Apart from £ and €, a typical GB keyboard has a couple of non-ASCII characters printed on it: ¬ and ¦. The key labelled ¦ is usually mapped to |, but the key labelled ¬ often gives you an actual ¬, though I can't remember many occasions on which I've wanted one of them. Apparently the characters ¬ and ¦ are in EBCDIC.

blahedo
0 replies
1d13h

Another piece of elegance: by putting the uppercase letters in the block beginning at 0x40 (well, 0x41) it means that all the control codes at the start of the table line up with a letter (or one of a small set of other punctuation: @[\]^_), giving both a natural shorthand visual representation and a way to enter them with an early keyboard, by joining the pressing of the letter with... the Control key. Control-M (often written ^M) is carriage return because carriage return is 0x0D and M is 0x4D.

aronhegedus
0 replies
1d1h

Was a really fun article to read/podcast to listen to.

Favorite fact is that 127 is the DEL because for hole punching it removes all the info. I love those little nuggets of history

PaulHoule
0 replies
1d4h

Beats EBCDIC

https://en.wikipedia.org/wiki/EBCDIC

On the 4th floor of my building the computer systems lab has a glass front that has what looks like a punch card etched in frosted glass but if you look closer it was made by sticking stickers on the glass.

I made a "punchcard decoder" on a 4x6 card to help people decode the message on the wall

https://mastodon.social/@UP8/112836035703067309

The EBCDIC code was designed to be compatible with this encoding which has all sorts of weird features, for instance the "/" right between "R" and "Z"; letters don't form a consecutive block so testing to see if a char is a letter is more complex than in ASCII.

I am thinking of redoing that card to put the alphabet in order. A column in a punched card has between 0 to 3 punches, 0 is a space, 1 is a letter or a symbol in the first column, if one of the rows at the top is punched you combine that with the number of the other punched row on the left 3x9 grid. If three holes are punched one of them is an 8 (unless you've got one of the extended charsets) and you have one of the symbols in the right 3x6. Note the ¬ and ¢ which are not in ASCII but are in latin-1.

Dwedit
0 replies
1d19h

Many old NES/SNES games had a simpler character encoding system, with 0-9 and A-Z at the beginning of the table. No conversion require to display hex.

DonHopkins
0 replies
1d2h

The Apple ][ and TTYs and other old computers had "bit pairing keyboards", where the punctuation marks above the digits were aligned with the ASCII values of the corresponding digits, different by one bit.

    Typewriter: !@#$%^&*()
    Apple:      !"#$%&'()
    Digits:     1234567890
https://en.wikipedia.org/wiki/Bit-paired_keyboard

A bit-paired keyboard is a keyboard where the layout of shifted keys corresponds to columns in the ASCII (1963) table, archetypally the Teletype Model 33 (1963) keyboard. This was later contrasted with a typewriter-paired keyboard, where the layout of shifted keys corresponds to electric typewriter layouts, notably the IBM Selectric (1961). The difference is most visible in the digits row (top row): compared with mechanical typewriters, bit-paired keyboards remove the _ character from 6 and shift the remaining &() from 7890 to 6789, while typewriter-paired keyboards replace 3 characters: ⇧ Shift+2 from " to @ ⇧ Shift+6 from _ to ^ and ⇧ Shift+8 from ' to . An important subtlety is that ASCII was based on mechanical typewriters, but electric typewriters became popular during the same period that ASCII was adopted, and made their own changes to layout.[1] Thus differences between bit-paired and (electric) typewriter-paired keyboards are due to the differences of both of these from earlier mechanical typewriters.

[...] Bit-paired keyboard layouts survive today only in the standard Japanese keyboard layout, which has all shifted values of digits in the bit-paired layout.

[...] For this reason, among others (such as ease of collation), the ASCII standard strove to organize the code points so that shifting could be implemented by simply toggling a bit. This is most conspicuous in uppercase and lowercase characters: uppercase characters are in columns 4 (100) and 5 (101), while the corresponding lowercase characters are in columns 6 (110) and 7 (111), requiring only toggling the 6th bit (2nd high bit) to switch case; as there are only 26 letters, the remaining 6 points in each column were occupied by symbols or, in one case, a control character (DEL, in 127).

[...] In the US, bit-paired keyboards continued to be used into the 1970s, including on electronic keyboards like the HP 2640 terminal (1975) and the first model Apple II computer (1977).