return to table of content

Mediocre Engineer's Guide to HTTPS

jessriedel
18 replies
1d

Tangential question from a layman: when I lose access to a particular website, or the internet as a whole, why is it so hard to tell where in the chain the failure is occurring? Like it’s often unclear whether

* I’ve got a network misconfiguration on my local machine;

* My wifi connection to the router is down;

* The cable between my router and ISP is cut;

* My ISP is having large scale issues; or

* The website I’m trying to reach is down.

I’ve been given the vague impression that it has something to do with a non-deterministic path by which requests are routed, but this seems unconvincing. If some link on the path breaks, why doesn’t the last good link send a message backward that says “Your message made it to me, but I tried to send it the next step and it failed there.”

treflop
6 replies
22h27m

It’s possible to figure out exactly what failed if you know how it all works.

But to write a tool to provides a useful description to the user is near impossible because no two setups are the same, it’s not possible to know if something is intentional or not, and it can be dangerous to just make an assumption based on what the common causes are and just suggest to the user a completely wrong answer.

For example, let’s say you can’t connect to a website because the DNS server isn’t responding and the host isn’t responding. You could tell the user that something is probably misconfigured at your router or your ISP is having some issues.

However, it turns out that the actual reason was that your VPN client updated your local routing tables and DNS server but failed to remove the changes when you quit the client. How is a troubleshooter supposed to know that the settings were temporarily changed versus it being the permanent ones?

Once you try to start to write a troubleshooter that can identify the actual cause, you realize that it’s very difficult due to the complexity and variation. At best you can write something that usually spits out a correct answer but also sometimes suggests something totally wrong and leads people down a completely wrong path.

jessriedel
5 replies
20h19m

If Google dedicated 10 engineers full time to this problem for 3 years, could they solve it?

jimkoen
1 replies
19h10m

Yes, and they partially have. Browsers are great at telling you where the chain has failed/ been cut, though some error messages seem to be intentionally uninformative as provided information would be meaningless to your average user.

That said, from an enthusiast perspective, running traceroute to the nearest google service (1e100.net for example) will already give you a huge tip on where things went wrong.

chrismorgan
0 replies
7h31m

I regularly run `mtr 1.1` for monitoring network condition. One of its display modes gives you a 3D view: x-axis is time, y-axis is the hops, and each cell’s colour and character indicates how long the ping took (or if it got no response). This is frequently very valuable at identifying where a problem is, which is generally one of these three: between computer and router, router and ISP, ISP and public internet. It can show also where packet loss or latency jumps are occurring, and patterns where something goes wrong for a few seconds so that you can determine where the problem is (this is where the time axis is crucial).

One thing that becomes apparent when you monitor diverse ISPs and endpoints this way is the inconsistency: in a normally-functioning situation, although most hops will have 0% loss, some will have absolutely any value from 0%–100%. The network I’m on at present has ten hops from _gateway to one.one.one.one; hop five is 100% loss, hop six varies around 40–50% loss, hop seven is about 60–62% loss, the rest are all 0% loss. It does host name lookup as well which can be a little bit useful for figuring out what’s probably local, probably ISP and probably public internet, but the boundaries are often a bit fuzzy.

mtr: <https://en.wikipedia.org/wiki/MTR_(software)>

1.1: short spelling of 1.0.0.1, the second address for Cloudflare’s 1.1.1.1 DNS server.

You can switch between the display modes with the d key, or start in this mode with MTR_OPTIONS=--displaymode=2 in the environment (which is how I do it, as it’s almost always what I want; if it weren’t, I’d probably make some kind of alias for `mtr --displaymode=2 1.1` instead).

otabdeveloper4
0 replies
13h2m

As long as you only ever visit Google web properties, yes.

evilDagmar
0 replies
49m

Short answer: No.

avoid3d
0 replies
5h28m

I work for an acquired startup that tried to solve this problem.

It’s been around 8 years and we’re up to 50 or so people. I’d say we are okay at it.

We haven’t gotten fundamentally better over time recently, it’s more like there is some asymptote of how much you can really tell with a certain amount of insight into the systems between source and destination.

The only real progress we’ve made has been integrating with more and more sources of information about the state of the network.

harry_ord
3 replies
1d

Not a network person, only played with trace route a long time ago but I'm pretty sure that only really happens if you explicitly ask for information about all the middle men.

Most of the time a lot of software kinda doesn't care about what's happening just if it can do what it's told.

For Websites you often get more informative errors like 404, 500 or something else.

recursive
2 replies
1d

If you're getting a status code like 404 or 500, it means there's no problem between you and the web server. The status codes come from the server. The exception is when you get a gateway/reverse proxy error. Usually 503 I think. That means the web server is down, but there's another server in front of it reporting that it's down.

harry_ord
0 replies
1d

True, I thought of those as they're just more informative about why you're not getting what you're looking for.

YZF
0 replies
23h28m

502 Bad Gateway.

nurple
0 replies
1d

If ICMP is allowed into your network, your machine will most likely receive a Destination Unreachable response from the host that can't forward the packet further.

Your application won't see the ICMP message unless you configure the socket to report them(these are considered "transient" errors). On Linux this is done via the socket option IP_RECVERR.

ETA: there's not a ton of value collecting errors at this layer when you're working at L7. The errors that _do_ get surfaced for DU at your layer will be appropriate for the failure handling logic you'll inevitably have already. In this case I think it'd be a timeout, as other layers implement retries in the face of unreachable destinations.

I found these RFCs helpful re: how the TCP layer handles ICMP errors: https://www.rfc-editor.org/rfc/rfc1122#page-103

Section 4.2.3.9:

Since these Unreachable messages indicate soft error conditions, TCP MUST NOT abort the connection, and it SHOULD make the information available to the application.

DISCUSSION: TCP could report the soft error condition to the application layer with an upcall to the ERROR_REPORT routine, or it could merely note the message and report it to the application only when and if the TCP connection times out.

This one gets into the nitty gritty of how the stacks interact in order to study ICMP as vector for TCP attacks.

https://www.rfc-editor.org/rfc/rfc5927

itscrush
0 replies
5h40m

Much of this problem space I've solved with running MTR to the destination when troubleshooting to see each hop's detail.

It's like ping + traceroute in a live running session with each hop broken down.

Quite consistent when I am the first to notice a node down on Xfinity network and in the same mtr see my network at least to my modem is good. Or when there's a hop beyond my ISP with 100s of ms added latency, which I haven't seen other tools do well like MTR can.

Won't solve everything, but might be worth your checking in your case as it breaks down per-hop providing latency for each.

cancerhacker
0 replies
1d

The browser reports the error closest to what it was doing at the time - host not found? Well, the network was reliable enough to reach a dns server that returned that the lack of address for a name. But if the dns server itself can’t come reached, it’s some sort of network error between you and that server. The typical way to diagnose that kind of problem is to perform all the steps yourself - can I ping the dns server address? Can I resolve this host with that dns server? What about a different dns server, maybe that particular name is being excluded because of corporate policy. The command line tools ping, traceroute and dig are useful if you want to get into it.

boffinAudio
0 replies
10h6m

Cyclomatic Complexity is why your Operating System can't do this for you.

https://en.wikipedia.org/wiki/Cyclomatic_complexity

There are so many different paths for an error case to follow.

You can of course debug this by reducing the complexity - for example, by watching one of the links in the chain (say, DNS) and seeing if it is failing - but this is the realm of network engineers who get paid mightily to get through this cyclomatic complexity and work at the relevant layers, all the way down to the atoms in the pipe ..

If some link on the path breaks, why doesn’t the last good link send a message backward that says “Your message made it to me, but I tried to send it the next step and it failed there.”

In fact, the links all do this, but there is simply no provision in your OS - no fancy GUI, perhaps - that allows you to fully understand this without getting overwhelmed by the cyclomatic complexity. Tools exist, and once you learn to use them to tame the complexity - congrats, you're now worth $300k/yr and can go work in San Francisco .. /s ;)

arccy
0 replies
23h43m

http(s) is built on top of multiple layers (HTTP, TLS, TCP, Ethernet...). A broken link in the lower layers can't really be presented as a higher level message (because it has no access to it).

YZF
0 replies
23h15m

For most people most issues would in at their home network. So that's a good first guess for any connectivity problems. Rarely it would be somewhere between your home and the ISP. If it's a small rural ISP then it might be ISP->Internet though I'd think that's rare. Most large scale ISPs have enough redundancy and capacity.

As someone else mentioned ICMP addresses certain classes of failures if enabled but I think the historical reason is more along the lines of the Internet was meant to run over lossy connections. For example, when a certain link is saturated routers will just start dropping packets. Reporting each dropped packet back to the sender is just not a good idea, it adds load to a system already potentially operating at capacity. TCP assumes packets can get lost and retransmits them. When a link goes down routing protocols will potentially send those retransmitted packets over a different link/path. I.e. there's no real concept of "connection down" other than the application layer or TCP eventually giving up (which can take a very long time). The kind of ICMP message that will immediately terminate a connection is when the server machine doesn't have anything listening on the destination port.

AlienRobot
0 replies
20h18m

How are you trying to tell that?

If a web browser can't access a URL, it won't tell you why exactly because there's a chance it diagnosis the reason wrong and most users will be confused by that. I assume most diagnosis tools work the same way. You need to make assumptions about how the OS, hardware, and network are configured to be able to say "the problem is here."

For example, when you access a website, the first thing that needs to be done is check a domain name server (DNS) to get the IP address of the web server. But where does the web browser get the DNS IPs from? You can configure it in the browser. Or in the OS. Or in your router. Or in your modem. And if you don't, it gets them from the DHCP server the router connects to, which could be your ISP's DHCP server (then you get your ISP's default DNS) or it could also be some other router in an organization's network.

If the DNS seems wrong it's easy to tell the IP is wrong but it gets hard to say where that IP came from.

Even SSL could be a problem with the server having the wrong certificates or it could be your computer having the wrong certificates.

_ache_
2 replies
1d

Everything in that article is a little outdated, 30% of web request are in HTTP3 now a day with CORS. There is no date of publication.

recursive
1 replies
1d

30% of requests are CORS? Surely this depends on what type of development you're doing. I'm doing SaaS development for systems generally deployed inside corporate networks. Very close to 0% of requests are CORS. Same for HTTP3.

_ache_
0 replies
19h20m

I said 30% of the requests on the web use HTTP3. And now a day CORS and other mechanisms that are not cited in the articles.

wonnage
1 replies
23h9m

This reads like an AI summary of an actual HTTPS explainer. Terms get introduced with no context - no explanation of what a certificate is or how the chain of trust works, assumes the reader knows about public key cryptography, describes six out of the seven OSI layers (RIP presentation layer) without mentioning that term at all, etc.

TBF it is titled as mediocre!

MediumD
0 replies
22h28m

To be fair, I also didn’t include the session layer!

My writing isn’t a strength of mine, so I appreciate the criticism. My writing going from “bad” -> “is it AI?” is progress.

I struggled with where to “cutoff” the explanation and public key cryptography seemed like a good boundary and better explained elsewhere, as did various OSI layers.

I probably should have gone over the cert and potentially the full chain of trust, I’ll give you that.

jonwest
1 replies
20h45m

Does anyone have more examples of articles written in this perspective? Regardless of my experience level I love diving through “ELI(a mediocre engineer)” type explanations as I either learn another piece that wasn’t completely clear, or gives me another set of examples to help explain it to other people. Either way they’re generally very helpful.

Snawoot
1 replies
23h56m

The client generates a premaster secret, encrypts it with the server’s public key, and sends it to the server.

It's already not true for, like, ages.

Operyl
0 replies
22h4m

Down below it says this:

Everything you’ve learned here is a lie.

The process we just describe is for the original version of TLS, which is outdated compared to the more modern version of TLS 1.3.
raxxorraxor
0 replies
8h42m

Current version of TLS (>1.3) do not support RSA (and various other cipher suites) for security reasons.

That is true for the key exchange part because RSA does not offer forward security. For signatures RSA is still used and probably still the most widely spread type of x509 certs.

I know Safari just upped the requirements to 2048bit keys for RSA not too long ago (for signatures).

pietrod
0 replies
11h57m

I'm unable to find some code where it shows how to verify the signature of SHA256(client_hello_random + server_hello_random + curve_info + public_key) I know the theory but somehow there is some issue to implement it, anybody can link an actual toy program showing practically how to do this?

debo_
0 replies
22h25m

aka. Writing HTTP requests from San Francisco for $300K/year

Best part of the article!

deathanatos
0 replies
13h12m

By agreeing on all these algorithms, exchanging random seeds, and the server’s SSL certificate containing the private key;

I sure hope not. But I suppose it is titled "Mediocre Engineer".

$300K/year

… I'll undercut you by $50k/y; where do I apply?

(There are just more and more errors. TLS <1.3 doesn't even work the way it describes, even though it tries to throw newer stuff into 1.3. The DNS section describes a recursive resolver, but the client isn't going to do that. It is probably talking to a stub resolver, too. "Internet Layer". The implication of "brotli" being a widely used algorithm in a ciphersuite/in TLS's compression, "Current version of TLS (>1.3) do not support RSA" …

… these sorts of blogspam are why I wish sometimes that there was a downvote. The advert isn't so obnoxious as to make me want to flag is low enough. I guess I should write the less mediocre article and make the HN frontpage. If only I made $300K/y, I'd have more time.)

StrLght
0 replies
20h33m

Might be relevant: there's also detailed and somewhat interactive byte-by-byte example of TLS for TLSv1.2[0] and TLSv1.3[1]. I absolutely love it and highly recommend checking it out if you want to learn more about TLS.

[0]: https://tls12.xargs.org/

[1]: https://tls13.xargs.org/