Tangential question from a layman: when I lose access to a particular website, or the internet as a whole, why is it so hard to tell where in the chain the failure is occurring? Like it’s often unclear whether
* I’ve got a network misconfiguration on my local machine;
* My wifi connection to the router is down;
* The cable between my router and ISP is cut;
* My ISP is having large scale issues; or
* The website I’m trying to reach is down.
I’ve been given the vague impression that it has something to do with a non-deterministic path by which requests are routed, but this seems unconvincing. If some link on the path breaks, why doesn’t the last good link send a message backward that says “Your message made it to me, but I tried to send it the next step and it failed there.”
It’s possible to figure out exactly what failed if you know how it all works.
But to write a tool to provides a useful description to the user is near impossible because no two setups are the same, it’s not possible to know if something is intentional or not, and it can be dangerous to just make an assumption based on what the common causes are and just suggest to the user a completely wrong answer.
For example, let’s say you can’t connect to a website because the DNS server isn’t responding and the host isn’t responding. You could tell the user that something is probably misconfigured at your router or your ISP is having some issues.
However, it turns out that the actual reason was that your VPN client updated your local routing tables and DNS server but failed to remove the changes when you quit the client. How is a troubleshooter supposed to know that the settings were temporarily changed versus it being the permanent ones?
Once you try to start to write a troubleshooter that can identify the actual cause, you realize that it’s very difficult due to the complexity and variation. At best you can write something that usually spits out a correct answer but also sometimes suggests something totally wrong and leads people down a completely wrong path.
If Google dedicated 10 engineers full time to this problem for 3 years, could they solve it?
Yes, and they partially have. Browsers are great at telling you where the chain has failed/ been cut, though some error messages seem to be intentionally uninformative as provided information would be meaningless to your average user.
That said, from an enthusiast perspective, running traceroute to the nearest google service (1e100.net for example) will already give you a huge tip on where things went wrong.
I regularly run `mtr 1.1` for monitoring network condition. One of its display modes gives you a 3D view: x-axis is time, y-axis is the hops, and each cell’s colour and character indicates how long the ping took (or if it got no response). This is frequently very valuable at identifying where a problem is, which is generally one of these three: between computer and router, router and ISP, ISP and public internet. It can show also where packet loss or latency jumps are occurring, and patterns where something goes wrong for a few seconds so that you can determine where the problem is (this is where the time axis is crucial).
One thing that becomes apparent when you monitor diverse ISPs and endpoints this way is the inconsistency: in a normally-functioning situation, although most hops will have 0% loss, some will have absolutely any value from 0%–100%. The network I’m on at present has ten hops from _gateway to one.one.one.one; hop five is 100% loss, hop six varies around 40–50% loss, hop seven is about 60–62% loss, the rest are all 0% loss. It does host name lookup as well which can be a little bit useful for figuring out what’s probably local, probably ISP and probably public internet, but the boundaries are often a bit fuzzy.
mtr: <https://en.wikipedia.org/wiki/MTR_(software)>
1.1: short spelling of 1.0.0.1, the second address for Cloudflare’s 1.1.1.1 DNS server.
You can switch between the display modes with the d key, or start in this mode with MTR_OPTIONS=--displaymode=2 in the environment (which is how I do it, as it’s almost always what I want; if it weren’t, I’d probably make some kind of alias for `mtr --displaymode=2 1.1` instead).
As long as you only ever visit Google web properties, yes.
Short answer: No.
I work for an acquired startup that tried to solve this problem.
It’s been around 8 years and we’re up to 50 or so people. I’d say we are okay at it.
We haven’t gotten fundamentally better over time recently, it’s more like there is some asymptote of how much you can really tell with a certain amount of insight into the systems between source and destination.
The only real progress we’ve made has been integrating with more and more sources of information about the state of the network.
Not a network person, only played with trace route a long time ago but I'm pretty sure that only really happens if you explicitly ask for information about all the middle men.
Most of the time a lot of software kinda doesn't care about what's happening just if it can do what it's told.
For Websites you often get more informative errors like 404, 500 or something else.
If you're getting a status code like 404 or 500, it means there's no problem between you and the web server. The status codes come from the server. The exception is when you get a gateway/reverse proxy error. Usually 503 I think. That means the web server is down, but there's another server in front of it reporting that it's down.
True, I thought of those as they're just more informative about why you're not getting what you're looking for.
502 Bad Gateway.
If ICMP is allowed into your network, your machine will most likely receive a Destination Unreachable response from the host that can't forward the packet further.
Your application won't see the ICMP message unless you configure the socket to report them(these are considered "transient" errors). On Linux this is done via the socket option IP_RECVERR.
ETA: there's not a ton of value collecting errors at this layer when you're working at L7. The errors that _do_ get surfaced for DU at your layer will be appropriate for the failure handling logic you'll inevitably have already. In this case I think it'd be a timeout, as other layers implement retries in the face of unreachable destinations.
I found these RFCs helpful re: how the TCP layer handles ICMP errors: https://www.rfc-editor.org/rfc/rfc1122#page-103
Section 4.2.3.9:
This one gets into the nitty gritty of how the stacks interact in order to study ICMP as vector for TCP attacks.
https://www.rfc-editor.org/rfc/rfc5927
Much of this problem space I've solved with running MTR to the destination when troubleshooting to see each hop's detail.
It's like ping + traceroute in a live running session with each hop broken down.
Quite consistent when I am the first to notice a node down on Xfinity network and in the same mtr see my network at least to my modem is good. Or when there's a hop beyond my ISP with 100s of ms added latency, which I haven't seen other tools do well like MTR can.
Won't solve everything, but might be worth your checking in your case as it breaks down per-hop providing latency for each.
The browser reports the error closest to what it was doing at the time - host not found? Well, the network was reliable enough to reach a dns server that returned that the lack of address for a name. But if the dns server itself can’t come reached, it’s some sort of network error between you and that server. The typical way to diagnose that kind of problem is to perform all the steps yourself - can I ping the dns server address? Can I resolve this host with that dns server? What about a different dns server, maybe that particular name is being excluded because of corporate policy. The command line tools ping, traceroute and dig are useful if you want to get into it.
Cyclomatic Complexity is why your Operating System can't do this for you.
https://en.wikipedia.org/wiki/Cyclomatic_complexity
There are so many different paths for an error case to follow.
You can of course debug this by reducing the complexity - for example, by watching one of the links in the chain (say, DNS) and seeing if it is failing - but this is the realm of network engineers who get paid mightily to get through this cyclomatic complexity and work at the relevant layers, all the way down to the atoms in the pipe ..
In fact, the links all do this, but there is simply no provision in your OS - no fancy GUI, perhaps - that allows you to fully understand this without getting overwhelmed by the cyclomatic complexity. Tools exist, and once you learn to use them to tame the complexity - congrats, you're now worth $300k/yr and can go work in San Francisco .. /s ;)
http(s) is built on top of multiple layers (HTTP, TLS, TCP, Ethernet...). A broken link in the lower layers can't really be presented as a higher level message (because it has no access to it).
For most people most issues would in at their home network. So that's a good first guess for any connectivity problems. Rarely it would be somewhere between your home and the ISP. If it's a small rural ISP then it might be ISP->Internet though I'd think that's rare. Most large scale ISPs have enough redundancy and capacity.
As someone else mentioned ICMP addresses certain classes of failures if enabled but I think the historical reason is more along the lines of the Internet was meant to run over lossy connections. For example, when a certain link is saturated routers will just start dropping packets. Reporting each dropped packet back to the sender is just not a good idea, it adds load to a system already potentially operating at capacity. TCP assumes packets can get lost and retransmits them. When a link goes down routing protocols will potentially send those retransmitted packets over a different link/path. I.e. there's no real concept of "connection down" other than the application layer or TCP eventually giving up (which can take a very long time). The kind of ICMP message that will immediately terminate a connection is when the server machine doesn't have anything listening on the destination port.
How are you trying to tell that?
If a web browser can't access a URL, it won't tell you why exactly because there's a chance it diagnosis the reason wrong and most users will be confused by that. I assume most diagnosis tools work the same way. You need to make assumptions about how the OS, hardware, and network are configured to be able to say "the problem is here."
For example, when you access a website, the first thing that needs to be done is check a domain name server (DNS) to get the IP address of the web server. But where does the web browser get the DNS IPs from? You can configure it in the browser. Or in the OS. Or in your router. Or in your modem. And if you don't, it gets them from the DHCP server the router connects to, which could be your ISP's DHCP server (then you get your ISP's default DNS) or it could also be some other router in an organization's network.
If the DNS seems wrong it's easy to tell the IP is wrong but it gets hard to say where that IP came from.
Even SSL could be a problem with the server having the wrong certificates or it could be your computer having the wrong certificates.