This is really great, I didn't know you could switch encoding schemes within the same QR code. There's a nifty visualization tool [0] that shows how this can reduce QR code sizes. It can determine the optimal segmentation strategy for any string and display a color-code version with statistics. Very nice!
0: https://www.nayuki.io/page/optimal-text-segmentation-for-qr-...
Seems like the lede was buried in the article; I know a bit about QR codes: there's different modes for alphanum, binary, kanji, etc, and error correcting capacity...but being able to switch character sets in the middle was new to me.
I am not entirely sure why you would want to switch encodings for URLs, personally. If you use alphanumeric encoding and a URL in Base36, you are pretty much information-theoretically optimal.
The issue is that QR's alphanumeric segments are uppercase only, and while browsers will automatically lowercase the protocol and domain name, you'll have to either have all your paths be uppercase or automatically lowercase paths. On top of that when someone scans the code it will likely be presented with an uppercase URL (if it doesn't automatically open in a browser) and that should alert anyone that doesn't already know that uppercase domains are equivalent to lowercase domains.
Ideally QR codes would have had a segment to encode URIs more efficiently (73-82 characters depending on how the implementation decided to handle the "unreserved marks"), but that ship has long sailed.
Many QR code readers will auto-lowercase URLs that are encoded in alphanumeric encoding. The rest will recognize uppercase URLs just fine. Alphanumeric encoding was basically made for URLs.
The QR alphanumeric input encoding does not include basic URL query string characters like '?' '&' '='
I've been putting URL in QR for like a decade, mixed case and query string included. How has it never been an issue?
Because you used bytes mode, not alphanumeric mode
base36 with alphanumeric mode encoding has around 6.38% overhead compared to base10's 0.34% overhead in numeric mode. So numeric mode gets you closer to optimal.
Speaking of visualization...that last figure in this post is super interesting in part because you can actually see some of the redundancy in the base64 encoding on the left, in the patterns of vertical lines.
In general, better compression means output that looks more like "randomness"—any redundancy implies there was room for more compression—and that figure makes this quite clear visually!
That’s undoubtedly some redundancy in the underlying data, not in the encoding itself.
Yes, the data is the bytes 00, 01, …, FF repeating, and that pattern is highly visible with a power-of-2 encodings, but not visible with other bases (for similar reasons that 0.1 as a (binary) float doesn’t behave as people expect).