You cannot go wrong with the most popular choice: Prometheus/Grafana stack. That includes node_exporter for anything host related, and optionally Loki (and one of its agents) for logs. All this can run anywhere, not just on k8s.
For my homeserver I just have a small python script dumping metrics (CPU, RAM, disk, temperature and network speed) into a database (timescaleDB).
Then I visualize it with grafana, It's actually live here if you want to check it out: https://grafana.dahl.dev
having this up seems ill-advised. posting a link on HN seems crazy
How does the threat model change when exposing grafana to the public? Apart from vulnerabilities in grafana itself? Perhaps hackers will be extra motivated to cause blips in those graphs? Exposing grafana publically is unusual, but I don't see an obvious error-mode.
It’s really not abnormal. GrafanaLabs does this all the time with their IaaS product.
There’s nothing wrong with exposing Grafana as long as you’re following security best practices.
It’s 2024, zero trust networking is where you want to be. Real zero trust networking is NOT adding a VPN to access internal services. It’s doing away with the notion of internal services all together and securing them for exposure on the internet.
Really? So in 2024, folks are only deploying services that have excellent security, and not anything else? This seems like a high bar to clear but I'm curious to learn.
Article and discussion from earlier today that’s relevant: https://news.ycombinator.com/item?id=41274932
It’s about zero trust, includes the claim “VPNs are Deprecated!”
Those companies can afford letting people try "Denial of Wallet" attacks on them, though.
I, for one, will still keep using VPNs as an additional layer of security and expose only a single UDP port (WireGuard), to at least reduce the chances of that happening.
Thanks!
The implementations of zero trust that I have seen involve exposing your service to the public internet with an Authenticating Proxy on top. So instead of trusting the network implicitly you trust the caller’s auth token before they can connect to the server.
So you might have an internal service that has passed a minimal security bar that you can only establish an https connection with if you have a valid SSO token.
I mean this is true but the key part is “securing them for exposure on the internet.” Adding a simple 2FA layer (I think google calls this the Access Proxy or Identity Aware Proxy) on top is usually the way you secure zero trust services.
I don’t think it is advisable to directly expose your Grafana to the public internet where you can hit it with dictionary attacks.
What’s the benefit of having this exposed to the web? Given it’s monitoring a homeserver, seems like overkill.
Why not secure it behind a VPN or tailscale if it’s just for personal use?
None really, I had an idea that my friends could check it out if they notice a service disruption. They don't though, so it's just for fun!
Would you mind sharing the python script?
My two cents: monitoring RAM usage is completely useless, as whatever number you consider an “used/free RAM” is meaningless (and the ideal state is that all of the RAM is somehow “used” anyway). You should monitor for page faults and cache misses in block device reads.
Depends. "free" reports the area used for disk buffers and programs, hence "available" and "free" numbers.
On my servers I want some available RAM which means "used - buffers", because this means I configured my servers correctly and nothing is running away, or nothing is using more than it should.
On the other hand, you want "free" almost zero on a warmed up server (except some cases which hints that heaps of memory has been recently freed) since the rest is always utilized as disk cache.
Similarly having some data on swap state doesn't harm as long as it's spilled there because some process has ran away and used more memory than it should be.
So, RAM usage metrics carry a ton of nuance and can mean totally different things depending on how you use that particular server.
One of the older arguments I get to keep having over and over is No, You May Not Put Another Service on These Servers. We are using those disk caches thank you very much.
I do not enjoy showing up to yet another discussion of why our response times just went up “for no reason”. Learn your latency tables people.
Yeah, people tend to think server utilization as black and white.
Look, we're using just 50% of that RAM. Look, there're two cores that are almost idle.
No & No. Rest of the RAM is your secret for instant responses, and that spare CPU resource is for me to do system management without you notice or to front the odd torrent of requests we have semi regularly (e.g.: /. hug of death. Remember?).
I need to find a really good intro to queuing theory to send people to. A full queue is a slow queue. You actually want to aim for about 65% utilization.
If the numbers from the phoenix project are to be trusted, a loose estimate is the time spent in queue is proportional to the ratio of utilized to unutilized resources. For example, 50% used & 50% unused is 50:50 = 1 unit of time. 99% used is 99:1 = 99 units of time.
Also, there was a formula for determining the optimal cache size. I forget the name all the time. IIRC, in the end, caching most popular 10 items was enough to respond to 95% of your queries without hitting the disk.
This might be too basic, but I found this blog post to be an incredible introduction to queues: https://encore.dev/blog/queueing
Correct identification but wrong prescription.
Cache misses don't have anything to do with memory pressure, they're related to caching effectiveness.
Production systems shouldn't have any page faults because they shouldn't be using swap.
The traditional way Linux memory pressure was measured using a very small swap file and check for any usage of it. Modern Linux has the PSI subsystem.
Also, monitoring for OOM events also means a system needs more RAM or a workload needs to be tuned or spread out.
What are the key assets you monitor beyond the basics like CPU, RAM, and disk usage?
* Network is another basic that should be there
* Average disk service time
* Memory is tricky (even MemAvailable can miss important anonymous memory pageouts with a mistuned vm.swappiness), so also monitor swap page out rates
* TCP retransmits as a warning sign of network/hardware issues
* UDP & TCP connection counts by state (for TCP: established, time_wait, etc.) broken down by incoming and outgoing
* Per-CPU utilization
* Rates of operating system warnings and errors in the kernel log
* Application average/max response time
* Application throughput (both total and broken down by the error rate, e.g. HTTP response code >= 400)
* Application thread pool utilization
* Rates of application warnings and errors in the application log
* Application up/down with heartbeat
* Per-application & per-thread CPU utilization
* Periodic on-CPU sampling for a bit of time and then flame graph that
* DNS lookup response times/errors
Do you also keep tabs on network performance, processes, services, or other metrics?
Per-process and over time, yes, which are useful for post-mortem analysis
How to identify a mistuned vm.swappiness?
I rely on a heuristic approach which is to track the rate of change in key metrics like swap usage, disk I/O, and memory pressure over time. The idea is to calculate these rates at regular intervals and use moving averages to smooth out short-term fluctuations.
By observing trends rather than just static value ( a data point at specific time) you can get a better sense of whether your system is underutilizing or overutilizing swap space. For instance, if swap usage rates are consistently low but memory is under pressure, you might have vm.swappiness set too low. Conversely, if swap I/O is high, it could indicate that swappiness is too high.
This is a poor man’s approach, and there are definitely more sophisticated ways to handle this task, but it’s a quick solution if you just need to get some basic insights without too much work.
That is a good list, now just need to prioritize (after finding the ICP).
Before you start adding all of that make sure you have customers like parent poster.
For example I monitor disk space, RAM, CPU and that’s it for external tooling.
If any of that goes above thresholds someone will log into the server and use windows or Linux tooling to check what is going on.
I mostly monitor services health check endpoints so http calls to our own services. If network is down or shoddy response times of the services.
So all in all not much of servers itself.
Those are some great ideas for Prometheus alert rules. If they aren't already added here: https://samber.github.io/awesome-prometheus-alerts/
With all that, might want some good automatic anomaly detection. While at IBM's Watson lab, I worked out something new, gave an invited talk on the work at the NASDAQ server farm, and published it.
With a lot of monitoring someone might be interested.
Perhaps I can hijack this post to ask some advice on how to monitor servers.
I don't do this professionally. I have a small homelab that is mostly one router running opnsense, one fileserver running TrueNAS, and one container host running Proxmox.
Proxmox does have about 10-15 containers though, almost all Debian, and I feel like I should be doing more to keep an eye on both them and the physical servers themselves. Any suggestions?
nope please
all the free stuff almost disappeared, plus you'll become stupid when you then want to remove their "client" (a huge bunch of stuff spread out everywhere)... I am talking about linux, of course.
Before installing everything assure you read the "uninstall" page, if any
or yes, install it only as a container separated, as somebody suggested, but again: it costs a lot
I run Uptime Kuma, it's a simple but extensible set of monitors but perfect for a home network.
- Prometheus/Grafana/Alertmanager
- Zabbix
What are your goals though? I.e. why are you monitoring? Do you really care that much about getting alerted when something is down? Are you going to be looking at dashboards? Or do you just want to experiment/learn the technologies?
snmp v3 (v3, because anything lower then this is unencrypted and 4) therefore not advised unless it’s on a trusted network)
For most Linux distros and network equipment, this will get you CPU/Disk/RAM/Net statistics. Use LibreNMS for this which runs in a VM or CT which can alert you if metrics go out of spec.
Why would you not use OTel?
This is clearly the industry standard protocol and the present and future of o11y.
The whole point is that o11y vendors can stop reinventing lower level protocols and actually offer unique value props to their customers.
So why would you want to waste your time on such an endeavor?
OTEL was designed like it’s a Java app, and I mean that as a condemnation.
The errors are opaque and the API is obtuse. It’s also not quite done yet. The Node implementation has had major major bugs only fixed in the last twelve months, and necessary parts of the ecosystem (across all languages) are still experimental.
Overengineered and underdelivering. I’m using Prometheus on my next project. Though I’m tempted to go back to StatsD if I can find a good tagged version. Don’t hold data in a process with an unknown lifetime. Send it over loopback to a sidecar immediately and let it sort aggregation out.
Datadog agent supports StatsD with tags (aka DogStatsD - https://docs.datadoghq.com/developers/dogstatsd/ ). You can push StatsD data to Datadog Agent, which will aggregate the data and push it to some centralized tsdb such as VictoriaMetrics according to the following docs - https://docs.victoriametrics.com/#how-to-send-data-from-data...
What the hell is o11y? It's so annoying when people use weird abbreviations rather than just typing a few extra characters.
Observability.
If I were in your position I would craft my own OTel distribution and ship it.
This is very easy to do: https://github.com/open-telemetry/opentelemetry-collector-bu...
With this approach you’re standing on the shoulders of giants, compatible with any agent that speaks OTLP, and can market your distribution as an ecosystem tool.
When it comes to "what" to monitor, many usual suspects already posted in this thread, so in an attempt not to repeat what's there already, I will mention just the following (will somewhat assume Linux/systemd):
- systemd unit failures - I install a global OnFailure hook that applies for all the units, to trigger an alert via a mechanism of choice for a given system,
- restarts of key services - you typically don't want to miss those, but if they are silent, then you quite likely will,
- netfilter reconfigurations - nftables cli has useful `monitor` subcommand for this,
- unexpected ingress or egress connection attempts,
- connections from unknown/unexpected networks (if can't just outright block them for any reason).
Can I bother you for a rough how wrt unexpected ingress/egress and unknown connections?
I'm not aware of any tooling that'd enable such monitoring without massively impacting performance - but I'm not particularly knowledgeable in this field either.
Just from someone that'd be interested to improve my monitoring on my homelab server
eBPF is the way to monitor anything around network connections with minimal performance overhead.
eBPF is actually much more than that, and not limited to network only.
Falco
You are right, there is a performance and resources aspect to this.
When I've given those two particular monitoring examples, I should probably put more emphasis on the word "unexpected", which by its nature reduces cost close to zero for day to day operations. A problem may occur if something wrong is not only actually happening, but also on a massive scale, in which case paying a price for a short moment hopefully makes sense. Although cost benefit ratio may vary, depending on the specifics, sure.
Just to illustrate the point, staying in the context of the transport layer. Let's say I don't expect a particular db server inside internal network to make egress connections to anything other than:
- a local http proxy server to port x1 to fetch os updates and push to external db backups (yes, this proxy then needs source/target policy rules and logging for this http traffic, but that's the same idea, just on the higher layer),
- a local backup server to port x2 to push internal backups,
- a local time server to port x3 for time sync,
- and a local monitoring server to port x4 for logs forwarding.
Depending on the specifics, I may not even need outgoing DNS traffic.
For ingress, I may expect the following connections only:
- to port y1 for db replication, but only from the server running authoritative hot standby,
- to port y2 for SSH access, but only from a set of local bastion hosts,
- to port y3 for metrics pooling, but only from local metric servers.
In the case as above, I would log any other egress or ingress attempts to this host. I can do it with some sanity, because those would be, if at all, a low frequency events stemming from some misconfiguration on my part (VERY good way for detecting those), some software exposing unexpected (to me) behavior by design, important misconceptions in my mental model of this network as a whole, or an actual intentional unauthorized attempt (hopefully never!). In all of those cases I want to know and intervene, if only to update whitelisting setup or maybe reduce logging severity of some events, if they are rare and interesting enough that I can still accept logging them.
On the other hand, If I were to directly expose something to the public internet, as a rule of thumb, I would not log every connection attempt to every port, as those would be more or less constant and more than "expected".
As for the tooling, I believe anything that you use for traffic policing will do, as under those particular assumptions we don't need to chase any unusual performance characteristics.
For example, in the context of Linux and netfilter, you can put a logging-only rule at the end of related chain with default drop policy set and have some useful semantics in the message contents, so that logs monitoring will be easier to configure to catch those up, categorize (direction, severity, class, ...) and act upon it (alerting).
And when it comes to monitoring and logging around relatively high volume, "expected" network traffic (I'm taking a guess you were thinking about something like port 22 on your homelab perimeter as an example), I guess you either don't do it at all, or it's crucial enough that you have (1) compute resources to do it and, equally important, (2) a good idea what to actually do with this data. And then you probably enter into the space of network IDS, IPS, SIEM, WAF and similar acronyms, inspecting application layer traffic etc, but I don't have enough experience to recommend anything particular here.
I'm actually considering linux+k8s log/audit solution consulting or a saas (there still isn't minimally decent journald log collectors) but not sure who would even pay for it... as you can see for the low attention this will get
I would recommend you try Vector. They have a Source specifically for JournalD with a good amount of customisation: https://vector.dev/docs/reference/configuration/sources/jour...
Seconding the vector recommendation. My (admittedly, homelab) infra log shipping is done completely through vector into redpanda. Vector even has a simple remapping language [1] that I use to enrich the data in line with the collection
Vector is GOATed, also reccomend.
Vector is both amazing and terrible. The configuration language is just awful to work with, but the functionality is amazing and I don't know what I'd replace Vector with.
If you do a lot with logs, have different sources and destination it just must have tool.
One thing to be aware of is that up/down alerting bakes downtime into the incident detection and response process, so literally anything anyone can do to get away from that will help.
A lot of the details are pretty application-specific, but the metrics I care about can be broadly classified as "pressure" metrics: CPU pressure, memory pressure, I/O pressure, network pressure, etc.
Something that's "overpressure" can manifest as, e.g., excessively paging in and out, a lot of processes/threads stuck in "defunct" state, DNS resolutions failing, and so on.
I don't have much of an opinion about push versus pull metrics collection as long as it doesn't melt my switches. They both have their place. (That said, programmable aggregation on the metrics exporter is something that's nice to have.)
What you call pressure is often called saturation. Saturation means the resource is at 100% utilization.
But saturation is not the same as errors.
I'm talking beyond saturation.
There are actually quite a few resources for which I'd like to maintain something resembling steady-state saturation, like CPU and RAM utilization. However, it's when I've overcommitted those resources (e.g., for RAM, no more cache pages that can't simply be purged to make more room for RSS) that I start to see problems. (Of note, if I start paging in and out too much, that can also affect task switching, which leaves the kernel doing way more work, which itself can lead to a fun cascade of problems.)
Saturation is not a boolean, it's how beyond 100% utilization the resource is.
Don't monitor your servers, monitor your application.
No. Your application will crash in mysterious ways and you'll scratch your head for months unless you look to system logs and see: MCE on CPU #4, recovered from internal exception <insert details here>, or any similar hardware breakage log. It can be DIMMS, CPU, PCIe, thermal, anything basically.
Monitor your servers. If you can't verify that the foundation is sound, you can't trust anything on that system even if it's formally verified to be bug free.
Monitor both.
Exactly, it's incredibly useful to see the environment in which the application operates and how it reacts.
E.g. does your application spike in memory usage as network latency goes up? How does the application react when /var fills up (you should have an alert for that).
What I've frequently seen, running externally developed applications for customers are applications that hasn't been designed with monitoring in mind, so until you can sort of guide the developers to add all the metrics you'd need, you have to rely on what the operating system can tell you, perhaps with whatever agents are available that particular programming language.
Q1: Can some ELI5 when you’d use:
- nagios
- Victoria metrics
- monit
- datadog
- prometheus grafana
- etc …
Q2: Also, is there something akin to “SQLite” for monitoring servers. Meaning, simple / tested / reliable tool to use.
Q3: if you ran a small saas business, which simple tool would you use to monitor your servers & services health?
Q1:
- Nagios: if it is already set up and covering all requirements in a legacy system.
- Prometheus + Grafana (+ Loki + ...): modern and extensible ecosystem with a unix-philosophy, my default choice of the times.
- Victoria Metrics: up and coming player that combines several features of the promstack into one but is also compatible to be used as part of the mix. Less moving parts and supposedly more performant, but also less accumulated knowledge in forums and the workforce.
- Datadog: if in a team of mainly developers who want good insights out of the box, ootb being the main feature, and if money is of no concern. Complete vendor lock-in and their sales reps will call your private number at 3 am on a sunday if they have it.
- Monit: if im a one man army (RoR on heroku type) who needs some basic monitoring and more importantly supervisor capabilities (e.g. restart service on failure).
Q2. That depends on your requirements (Scale? Logs? UI? Tracing? Environment (e.g. container)?), the less you need the simpler you can go. Some bash, systemd unit files and htop could be all you need.
Q3: I'd go with Prometheus and Grafana, easy enough to get going, extend as needed (features and scale) and hire for.
Q3: if you ran a small saas business, which simple tool would you use to monitor your servers & services health?
It seems like Datadog is an extremely obvious choice for this?
Unless you have someone you're paying to do devops full time it's almost certainly not worth the time you'll have to invest in setting up any of the open source tools.
At small scales Datadog is fairly cheap and will almost certainly give you everything you need out of the box.
I try to monitor everything because it can get much more accessible to debug weird issues when sh*t hits the fan.
Do you also keep tabs on network performance, processes, services, or other metrics?
Everything :)
What's your take—would you trust a simple, bespoke agent, or would you feel more secure with a well-established solution?
I went with collected [1] and Telegraf [2] simply because they support tons of modules and are very stable. However, I have a couple of bespoke agents where neither collected nor Telegraf will fit.
Lastly, what's your preference for data collection—do you prefer an agent that pulls data or one that pushes it to the monitoring system?
We can argue to death, but I'm for push-based agents all the way down. It is much easier to scale, and things are painless to manage when the right tool is used (I'm using Riemann [3] for shaping, routing, and alerting). I used to run Zabbix setup, and scaling was always the issue (Zabbix is pull-based). I'm still baffled how pull-based monitoring gained traction, probably because modern gens need to repeat mistakes from the past.
[2] https://www.influxdata.com/time-series-platform/telegraf/
I'm still baffled how pull-based monitoring gained traction, probably because modern gens need to repeat mistakes from the past.
For us, knowing immediate who should have had data on last scrape but didn’t respond is the value. What mistakes are you referring to?
( I am genuinely curious, not bating you into an argument! )
For us, knowing immediate who should have had data on last scrape but didn't respond is the value.
Maybe I don't understand your use case well, but with tools like Riemann, you can detect stalled metrics (per host or service), who didn't send the data on time, etc.
What mistakes are you referring to?
Besides scaling issues and having a simpler architecture, in Zabbix's case, there were issues with predictability: when the server would start to pull the metrics (different metrics could have different cadence) when the main Zabbix service was reloaded, had connection issues, or was oversaturated with stuck threads because some agents took more time to respond than the others. This is not only Zabbix-specific but a common challenge when a central place has to go around, query things, and wait for a response.
none of the things you list are for Logs. metrics are different use cases. do not use opentelemetry or you will suffer (and everyone who suffered will try to bring you to their hell)
look for guides written before 2010. seriously. it's this bad. then after you have everything in one syslog somewhere, dump to a facy dashboard like o2.
WHat are you using?
Haven't seen the bad parts of open telemetry yet. Im thinking for a k8s cluster but curious what you're using.
What are some of the issues you have seen with OTEL?
Available file descriptors
And available inodes, many forget about those. Suddenly you can't make new files, but you have tons of available disk space. Then you realise that some application creates an ungodly amount of tiny files in a cache directory.
sounds like you're reinventing nagios, which has well addressed all of the above. If nothing else, lots of good solutions in that ecosystem like push/pull.
Or Checkmk [1], which is coming from Nagios and brings thousands of plugins for nearly every hardware and service you can think of..
I think this is a chance for me to go somehow off topic and ask people how they handle combining the monitoring of different logs one place. I think there are many solutions but must of them gear toward enterprise solutions. What do people use for poor's man (personal usage like selfhosting/homelab) approach. That does not require you to be VC funded or takes a lot of time to actually implement.
I like the Grafana stack so far, it seems more lightweight and suitable for homelab scale than something like Elastic. Grafana/alloy remote_write functionality to push logs from all machine/vm/node/pod to a central grafana/loki and metrics to prometheus. Optionally visualize both in a grafana dashboard.
Look at major cloud providers and what they offer in monitoring such AWS CloudWatch, etc.
Be warned though there are a ton of monitoring solutions already. Hopefully yours has something special to bring to the table.
Be careful with hosted services like Amazon CloudWatch. They can get quite expensive if you log too many details. I think you can really save a lot of money with your own solution, eg using Prometheus / Grafana.
With Icinga, for webservers:
- apt status (for security/critical updates that haven't been run yet)
- reboot needed (presence of /var/run/reboot-required)
- fail2ban jail status (how many are in each of our defined jails)
- CPU usage
- MySQL active, long-running processes, number of queries
- iostat numbers
- disk space
- SSL cert expiration date
- domain expiration date
- reachability (ping, domain resolution, specific string in an HTTP request)
- Application-specific checks (WordPress, Drupal, CRM, etc)
- postfix queue size
First mention of [ -f /var/run/reboot-required ], which is important to ensure security updates have taken effect
Tasks are the more annoying things to track.
Did it run? Is it still running? Did it have any errors? Why did it fail? Which step did it fail on?
My last job built a job tracker for "cron" tasks that supported actual cron tab + could schedule hitting a https endpoint.
Of course it requires code modification to ensure it writes something so you can tell it ran in the first place. But that was part of modernizing a 10 year old LAMP stack.
Whatever directly and naterially affects affects cost and that's it.
For some of my services on DigitalOcean for instance, I monitor RAM because using a smaller instance can dramatically save money.
But for the most part I don't monitor anything - if it doesn't make me money why do I care?
I have a list of things I look at every few weeks to see if we’ve developed any new bad habits that are going to bite us.
Open feedback loops make it difficult for people to adjust their behavior. Catching the bad habits early makes them a lot more shallow.
Check out Coroot - with use of eBPF and other modern technologies it can do advanced monitoring with zero configuration
Will they actually last though? We've seen monitoring tech come and go, wherever it's inherently linked to the financial success of the mother company.
I monitor periods between naps. The longer I get naps the happier I am :)
Seriously though, the server itself is not the part that matters, what matters is the application(s) running on the server. So it depends heavily on what the application(s) care about.
If I'm doing some CPU heavy calculations on one server and streaming HTTPS off a different server, I'm going to care about different things. Sure there are some common denominators, but for streaming static content I barely care about CPU stuff, but I care a lot about IO stuff.
I'm mostly agnostic to push vs pull, they both have their weaknesses. Ideally I would get to decide given my particular use case.
The lazy metrics, like you mentioned, are not that useful, like another commenter mentioned, "free" ram is mostly a pointless number, since these days most OS's, wisely use it for caching. But information on the OS level caching can be very useful, depending on the work-loads I'm running on the system.
As for agents, what I care about is how stable, reliable and resource intensive it is. I want it to take zero resources, rock solid and reliable. Many agents fail spectacularly at all 3 of those things. Crowdstrike is the most recent example of failure here with agent based monitoring.
The point of monitoring systems to me are two-fold:
* Trying to spot problems before they become problems(i.e. we have X days before disk is full given current usage patterns).
* Trying to track down a problem as it is happening(i.e. App Y is slow in X scenario all of a sudden, why?).
Focus on the point of monitoring and keep your agent as simple, solid and idiot proof as possible. Crowdstrike's recent failure mode was completely preventable had the agent been written differently. Architect your agent as much as possible to never be another Crowdstrike.Yes I know Crowdstrike was user machines, not servers, but server agent failures happen all the time too, in roughly the same ways, they just don't make the news quite as often.
Active monitoring is a different animal from passive metrics collection. Which is different from log transport.
The Nagios ecosystem was fragmented for the longest time but now it seems most users have drifted towards Icinga, so this is what I use for monitoring. There is some basic integration with Grafana for metrics, so that is for I use for metrics panels. There is good reason to not use your innovation budget on monitoring, instead use simple software that will continue to be around for a long time.
As for what to monitor, that is application specific and should go into the application manifest or configuration management. But generally there should be some sort of active operation that touches the common data path, such as a login, creation of a dummy object such as for example an empty order, validation of said object, and destruction/clean up.
Outside the application there should be checks for whatever the application relies on. Working DNS, NTP drift, Ansible health, certificate validity, applicable APT/RPM packages, database vacuums, log transport health, and the exit status or last file date of scheduled or backgrounded jobs.
Metrics should be collected for total connections, their return status, all types of I/O latency and throughput, and system resources such as CPU, memory, disk space.
node_exporter all the way: https://github.com/prometheus/node_exporter
netdata on all our boxes. It’s incredible. Provides automagic statsd capture, redis, identifies systemd services, and all the usual stuff like network performance, memory, cpu, etc. recently they introduced log capture which is also great, broken down by systemd service too.
Something that I’ve noticed a need for is usage vs requested utilization. Since we roll our own kube cluster, I’m trying to right size our pods and that’s been as straight forward as it could be since I have to do a lot of the math and recalculations myself.
node_exporter ( https://github.com/prometheus/node_exporter ) and process_exporter ( https://github.com/ncabatoff/process-exporter ) expose the most of the useful metrics needed for monitoring server infrastructure together with the running processes. I'd recommend also taking a look at Coroot agent, which uses ebpf for exporting the essential host and process metrics - https://github.com/coroot/coroot-node-agent .
As for the agent, it is better from operations perspective to run a single observability agent per host. This agent should be small in size and lightweight on CPU and RAM usage, should have no external dependencies and should have close to zero configs, which need to be tuned, e.g. it should automatically discover all the apps and metrics needed to be monitored, and send them to the centralized observability database.
If you are lazy to write the agent on yourself, then take a look at vmagent ( https://docs.victoriametrics.com/vmagent/ ), which scrapes metrics from the exporters mentioned above. vmagent satisfies most of the requirements stated above except of configuration - you need to provide configs for scraping metrics from separately installed exporters.
For monitoring proxmox host(s), I use the influxdb to store all the proxmox metrics and then grafana for a beautiful display.
As for the servers, I use uptime kuma (notifies whenever a service goes down), glance (htop in web), vnstat (for network traffic usage) and loki (for logs monitoring)
using datadog these days, newrelic previously - basically every metric you can collect.
Disk i/o and network i/o is particularly important but most of the information you truly care about lies in application traces and application logs. Database metrics in a close second particularly cache/index usage and disk activity, query profiling. Network too if your application is bandwidth heavy.
I might sound weird but I got tired of the whole Prometheus thing so I just put my hosts on a NATS cluster and push the metrics I really care about there.
Latencies. This is a sure fire flag that something is amiss.
Good luck with your project, @gorkemcetin! I hope you achieve your goals. While I’m not a server manager, I’ve read through most of the comments in this thread and would like to suggest a few features that might help evolve your project:
- I noticed some discussions about alarm systems. It could be beneficial to integrate with alarm services like AWS SES and SNS, providing a seamless alerting mechanism.
- Consider adding a feature that allows users to compare their server metrics with others. This could provide valuable context and benchmarking capabilities.
- Integrating AI for log analysis could be a game-changer. An AI-powered tool that reads, analyzes, and reports on logs could help identify configuration errors that might be easily overlooked.
I hope these suggestions help with the development of BlueWave Uptime Manager!
Is this from a sysops perspective? Because Nagios and its fork Ingca are still a thing.
Grafana/Prometheus stack
It's free if you don't have too many servers - 15 uptime monitors (the most useful) and 32 blacklist monitors (useful for e-mail, but don't know why you'd need so many compared to uptime).
It's fairly easy to reach the free limits with not many servers if you're also monitoring VMs, but I've found it reliable so far. It's nice you can have ping tests from different locations, and it collects pretty much any metrics that are useful such as CPU, RAM, network, disk. The HTTP and SMTP tests are good too.
don't forget the "CPU steal" state, and AWS cpu burst credit
I general I would also suggest to monitor server costs (aws EC2 costs, e.g.)
For example, you should be aware that T3 AWS Ec2 instances will just cost double if your CPU is just used, and this since the flag "unlimited" in credit is ON by default: I personally hate the whole "cpu credit" AWS model... it is an instrument totally in their (AWS) hands to just make more money...
I could not find a satisfying way to detect an unusual log, qualitatively (new message) or quantitatively (abnormal amount of occurrences of a given message, neglecting any variable part), and therefore developed a dirty hack and it works quite well for me: https://gitlab.com/natmaka/jrnmnt
I think OOM kills is an important one, especially with containerized workloads. I've found that RAM used/limit metrics aren't sufficient as often the spike that leads to the OOM event happens faster than the metric resolution giving misleading charts.
Ideally I'd see these events overlaid with the time series to make it obvious that a restart was caused by OOM as opposed to other forms of crash.
I use netdata, works like a charm https://github.com/netdata/netdata
Suggestion: If you can adapt your monitoring servers to push data out though a data diode, you might be able to make some unique security guarantees with respect to ingress of control.
I like icinga's model, which can run a small agent on the server, but it doesn't run as root. I grant specific sudo rules for checks that need elevated permissions.
I find it easier to write custom checks for things where I don't control the application. My custom checks often do API calls for the applications they monitor (using curl locally against their own API).
There are also lots of existing scripts I can re-use, either from the Icinga or from Nagios community, so that I don't write my own.
For example, recently I added systemd monitoring. There is a package for the check (monitoring-plugins-systemd). So I used Ansible to install everywhere, and then "apply" a conf to all my Debian servers. Helped me find a bunch of failing services or timers, which previously went un-noticed, including things like backups, where my backup monitoring said everything was OK, but the systemd service for borgmatic was running a "check" a found some corruption.
For logs I use promtail/loki. Also very much worth the investment. Useful to detect elevated error rates, and also for finding slow http queries (again, I don't fully control the code of applications I manage).
1. System temperatures with a custom little python server I wrote that gets polled by HomeAssistant (for all machines on my tailnet, thanks Tailscale).
2. Hard drive health monitoring with Scrutiny.
https://github.com/AnalogJ/scrutiny
Everything else, doesn't matter to me for home use.
Good luck with your endeavor!
Make sure whatever information provided can be actionable.
For example, providing CPU metric alone is just for alerting. If it exceeds a threshold, make sure it gives insights into which process/container was using how much CPU at given moment. Bonus point if you can link logs from that process/container of that time.
For disks, tell which directory is large, and what kind of file types are using much space.
Pretty graphs that don't tell you what to look for next are nothing.
Take a look at vector, I personally prefer it over fluentd and don't think you'll need a custom monitoring agent with it.
Amount of 4xx, 500, and 2xx of an http application can tell a lot about application anomalies. Other protocols also have their error responses.
I also keep a close eye in the throughput VS response time ratio, specially the 95th percentile of the resp time.
It’s also great to have this same ratio measurement for the DBs you might use.
Those are my go to daily metrics, the rest can be zoomed in their own dashboards after I first check this.
My servers send a lot of emails, so postfix failures.
Don’t reinvent the wheel - there are many mature monitoring agents out there that you could ingest from, and it allows easy migration for customers.
As to what I monitor - normally, as little as humanly possible, and when needed, everything possible.
Sounds like you might be reinventing a wheel...
Can you simply include some existing open source tooling into your uptime monitor, and then contribute to those open source projects any desired new features?
Counter question: Why do you think another product is needed in this space?
Nothing at all. And why should I waste energy on this and storage and bandwidth? To watch a few graphs when bored?
In addition to the things already mentioned, there are a few higher level things which I find helpful:
- http req counts vs total non-200-response count vs. 404-and-30x count.
- whatever asynchronous jobs your run, a graph of jobs started vs jobs finished will show you a rough resource utilization and highlight gross bottlenecks.
I use monit and m/monit server to measure CPU/load/memory/disk, processes, and HTTP endpoints.
Essentially at a very generic level (from SOHO to not that critical services at SME level):
- automated alerts on unusual loads, meaning I do not care about CPU/RAM/disk usage as long as there are specific spikes, so the monitor just send alerts (mails) in case of significant/protracted spikes, tuned after a bit of experience. No need to collect such data over significant periods, you have size your infra on expected loads, you deploy and see if you have done correctly, if so you just need to see usual things to filter them keeping alerts only for anomalies;
- log alerts for errors, warning, access logs etc, same principle, you deploy and collect a bit, than you have "the normal logs", you create alerts for unusual things, retention depend on log types and services you run, some retention could be constrained by laws;
Performance metrics are a totally different thing that's should be decided more by the dev than the operation, and much of it's design depend of the kind of development and services you have. It's much more complex because the monitor itself touch the performance of the system MUCH more than generic alerting an casual ping and alike to check service availability. Push and pull are mixed, for alerts push are the obvious goto, for availability pull are much more sound etc. There is no "one choice".
Personally I tend to be very calm in more fine grain monitoring to start, it's important of course, but should not became an analyze-paralyze trap nor waste too much human resources and IT ones for collection of potential garbage in potentially not marginal batches...
unless your resources not an ephemeral resource, then not needed to push metric data to somewhere.
unless your resources are not the ephemeral resource then not needed to push data to somewhere. collecting is more makes sense
Shitheads trash-talking Lisp, followed by disk space, followed by unexplained CPU spikes and suspicious network activity.
disk usage
There's a bunch of ways of measuring "usage" for disks, apart from the "how much space is used". There's "how many iops" (vs the total available for the disk), there's how much wear % is used/left for the disk (specific to flash), how much read/write bandwidth is being used (vs the maximum for the disk), and so on.
I'm not in that game at the moment. I used to run some background services, that could be down for an hour without causing major difficulty (by design) I used to be very focused on checking the application was running rather than the server.
A lot of people here are suggesting metrics that are easy to collect but nearly useless for troubleshooting a problem, or even detecting it.
CPU and Memory are the easiest and most obvious to collect but the most irrelevant.
If nobody’s looked at any metrics before on the server fleet, then basic metrics have some utility: you can find the under- or over- provisioned servers and fix those issues… once. And then that well will very quickly run dry. Unfortunately, everyone will have seen this method “be a success” and will then insist on setting up dashboards or whatever. This might find one issue annually, if that, at great expense.
In practice, modern distributed tracing or application performance monitoring (APM) tools are vastly more useful for day-to-day troubleshooting. These things can find infrequent crashes, expired credentials, correlate issues with software versions or users, and on and on.
I use Azure Application Insights in Azure because of the native integration but New Relic and DataDog are also fine options.
Some system admins might respond to suggestions like this with: “Other people manage the apps!” not realising that therein lies their failure. Apps and their infrastructure should be designed and operated as a unified system. Auto scale on metrics relevant to the app, monitor health relevant to the app, collect logs relevant to the app, etc…
Otherwise when a customer calls about their failed purchase order the only thing you can respond with is: “From where I sit everything is fine! The CPUs are nice and cool.”
For servers, I think the single most important statistic to monitor is percent of concurrent capacity in use, that is, the percent of your thread pool or task pool that's processing requests. If you could only monitor one metric, this is the one to monitor.
For example, say a synchronous server has 100 threads in its thread pool, or an asynchronous server has a task pool of size 100; then Concurrent Capacity is an instantaneous measurement of what percentage of these threads/tasks are in use. You can measure this when requests begin and/or end. If when a request begins, 50 out of 100 threads/tasks are currently in-use, then the metric is 0.5 = 50% of concurrent capacity utilization. It's a percentage measurement like CPU Utilization but better!
I've found this is the most important to monitor and understand because it's (1) what you have the most direct control over, as far as tuning, and (2) its behavior will encompass most other performance statistics anyway (such as CPU, RAM, etc.)
For example, if your server is overloaded on CPU usage, and can't process requests fast enough, then they will pile up, and your concurrent capacity will begin to rise until it hits the cap of 100%. At that point, requests begin to queue and performance is impacted. The same is true for any other type of bottleneck: under load, they will all show up as unusually high concurrent capacity usage.
Metrics that measure 'physical' (ish) properties of servers like CPU and RAM usage can be quite noisy, and they are not necessarily actionable; spikes in them don't always indicate a bottleneck. To the extent that you need to care about these metrics, they will be reflected in a rising concurrent capacity metric, so concurrent capacity is what I prefer to monitor primarily, relying on these second metrics to diagnose problems when concurrent capacity is higher than desired.
Concurrent capacity most directly reflects the "slack" available in your system (when properly tuned; see next paragraph). For that reason, it's a great metric to use for scaling, and particularly automated dynamic auto-scaling. As your system approaches 100% concurrent capacity usage in a sustained way (on average, fleet wide), then that's a good sign that you need to scale up. Metrics like CPU or RAM usage do not so directly indicate whether you need to scale, but concurrent capacity does. And even if a particular stat (like disk usage) reflects a bottleneck, it will show up in concurrent capacity anyway.
Concurrent capacity is also the best metric to tune. You want to tune your maximum concurrent capacity so that your server can handle all requests normally when at 100% of concurrent capacity. That is, if you decide to have a thread pool or task pool of size 100, then it's important that your server can handle 100 concurrent tasks normally, without exhausting any other resource (such as CPU, RAM, or outbound connections to another service). This tuning also reinforces the metric's value as a monitoring metric, because it means you can be reasonably confident that your machines will not exhaust their other resources first (before concurrent capacity), and so you can focus on monitoring concurrent capacity primarily.
Depending on your service's SLAs, you might decide to set the concurrent capacity conservatively or aggressively. If performance is really important, then you might tune it so that at 100% of concurrent capacity, the machine still has CPU and RAM in reserve as a buffer. Or if throughput and cost are more important than performance, you might set concurrent capacity so that when it's at 100%, the machine is right at its limits of what it can process.
And it's a great metric to tune because you can adjust it in a straightforward way. Maybe you're leaving CPU on the table with a pool size of 100, so bump it up to 120, etc. Part of the process for tuning your application for each hardware configuration is determining what concurrent capacity it can safely handle. This does require some form of load testing to figure out though.
looks around I use `htop`
For any web apps or API services, we monitor: - Uptime - Error Rate - Latency
Prometheus/Grafana
Used PRTG for many years. Works ok. Has a free offering too. It's a bit of an artistic process figuring what to log and how to interpret it in an actionable way. Good luck and try to have fun.
* Services that should be running (enabled/autostart) but aren't. This is easier and more comprehensive than stuff like "monitor httpd on webservers", because all necessary services should be on autostart anyways, and all stuff that autostarts should work or be disabled.
* In our setup, container status is included in this thanks to quadlets. However, if using e.g. docker, separate container monitoring is necessary, but complex.
* apt/yum/fwupd/... pending updates
* mailqueue length, root's mailbox size: this is an indicator for stuff going wrong silently
* pending reboot after kernel update
* certain kinds of log entries (block device read error, OOMkills, core dumps).
* network checksum errors, dropped packets, martians
* presence or non-presence of USB devices: desktops should have keyboard and mouse. servers usually shouldn't. usb storage is sometimes forbidden.
Syslog, kern.log, messages, htop, iotop, df -h, ip2ban.log pm2 log, netstat -tulpn
Honestly? Look at Netdata for comparison. Everything from nginx hostname requests (we run web hosting servers) to cpu/ram/disk data but also network data and more. If you can do better than that somehow, by all means do it and make it better.
But there's more to it than just collecting data in a dashboard. Having a reliable agent and being able to monitor the agent itself (fore example, not just saying "server down!" if the agent is offline, but detecting the server remotely for verification) would be nice.
for data collection, veneur is pretty nice, and is open source and vendor agnostic. by stripe.
RAID health
Traces are valuable. But otherwise, I feel like most monitoring information is noise, unactionable or better collected elsewhere.
In my limited experience in a small biz running some SaaS web apps with new relic for monitoring
What are the key assets you monitor beyond the basics like CPU, RAM, and disk usage?
Not much tbh. Those were the key things. Alerts for high CPU and memory. Being able to track those per container etc was useful.
Do you also keep tabs on network performance, processes, services, or other metrics?
Services 100%. We did containerised services with docker swarm and one of the bug bears with new relic was having to sort out container label names and stuff to be able to filter things in the Ui. That took me a day or two to standardise (along with the fluentd logging labels so everything had the same labels).
Background Linux Processes less so, but it was still useful, although we had to turn them off in new relic as they significantly increased the data ingestion (I tuned NR agent configs to minimise data we sent just so we could stick with the free tier as best as we could).
Additionally, we're debating whether to build a custom monitoring agent or leverage existing solutions like OpenTelemetry or Fluentd.
I like fluentd, but I hate setting it up. Like I can never remember the filter and match syntax. Once it’s running I just leave it though so that’s nice
never used open telemetry.
Not sure how useful that info is for you.
What’s your take—would you trust a simple, bespoke agent, or would you feel more secure with a well-established solution?
Ehhhh it depends. New relic was pretty established with a bunch of useful features but deffo felt like over kill for what was essentially two containerised django apps with some extra backend services. There was a lot of bloat in NR we probably didn’t ever touch. Including in the agentnitself which took up quite a bit of memory.
Lastly, what’s your preference for data collection—do you prefer an agent that pulls data or one that pushes it to the monitoring system?
Personally push, mostly because I can set it up and probably forget about it — run it and add egress firewalls. Job done. Helps with network effect probably as easy to start.
I can see pull being the preference for bigger enterprise though who would only want to allow x, y, z data out to third party. Especially for security etc. cos setting a new relic agent running with root access to the host is probably never gonna work in that environment (like new relic container agent asks for).
What new relic kinda got right with their pushing agent was the configs. But finding out the settings was a bear as the docs are a hit of a nightmare.
(Edited)
I dunno if it’s just me, but I would never buy a monitoring solution from a company that has to ask a web forum this kind of question.
If you’re building a product from scratch you must have some kind of vision based on deficiencies in existing solutions that are motivating you to build a new product, right?
I used to have a Nagios, but after years of continuous uptime (except for planned maintenance) i felt it was not worth it. If your tech stack is simple enough and runs on VPS'es (whose physical availability is responsibility of your hoster), there isn't much that could happen.
If i were to setup metrics, the first thing i would go for is the pressure stall information.
What’s your take—would you trust a simple, bespoke agent, or would you feel more secure with a well-established solution?
No, and I have specifically tried to push against monitoring offerings like Datadog and Dynatrace, especially in the case of the second because running OneAgent and Dynakube CRDs are doing things like downloading tarballs from Dynatrace and listening to absolutely everything they can from processes to network.Just echoing some of what others have said...iostat...temperature (sometimes added boards have temperature readings as well as the machine)....plus just hitting web pages or REST APIs and searching the response for expected output... ...file descriptors...
In addition to disk space, running out of inodes on your disk, even if you don't plan to. If you have swap, seeing if you are swapping more than expected. Other things people said make sense depending on your needs as well.
For web-application monitoring, we’ve [1] gone the approach of outside-in monitoring. There’s many approaches to monitoring and depending on your role in a team, you might care more about the individual health of each server, or the application as a whole, independent of its underlying (virtual) hardware.
For web applications for instance, we care about uptime & performance, tls certificates, dns changes, crawled broken links/mixed content & seo/lighthouse metrics.
Server infrastructure is mostly a solved problem - hardware (snmp/ipmi etc) and OS layer.
I think it'd be very hard at this point to come up with compelling alternatives to the incumbents in this space.
I'd certainly not want a non-free, non-battle-tested, potentially incompatible & lock-in agent that wouldn't align with the agents I currently utilise (all free in the good sense).
Push vs pull is an age-old conundrum - at dayjob we're pull - Prometheus scraping Telegraf - for OS metrics.
Though traces, front-end, RUM, SaaS metrics, logs, etc, are obviously more complex.
Whether to pull or push often comes down to how static your fleet is, but mostly whether you've got a decent CMDB that you can rely on to tell you what the state of all your endpoints - registering and decom'ing endpoints, as well as coping with scheduled outages.
For those of you managing server infrastructure,
As a developer who has often had to look into problems and performance issues, instead of an infrastructure person, this is basically the bare minimum of what I want to see:
* CPU usage
* RAM breakdown by at least Used/Disk cache/Free
* Disk fullness (preferably in absolute numbers, percents get screwy when total size changes)
* Disk reads/writes
* Network reads/writes
And this is high on the list but not required:
* Number of open TCP connections, possibly broken down by state
* Used/free inodes (for relevant filesystems); we have actually used them up before (thanks npm)
Follow Brendan Gregg's USE method
Related, have there been any 'truly open-source' forks of Grafana since their license change? Or does anyone know of good Grafana alternatives from FOSS devs in general? My default right now is to just use Prometheus itself, but I miss some of the dashboard functionality etc. from Grafana.
Grafana's license change to AGPLv3 (I suspect to drive their enterprise sales), combined with an experience I had reporting security vulnerabilities, combined with seeing changes like this[1] not get integrated has left a bad taste in my mouth.
[1] https://github.com/grafana/grafana/pull/6627
The AGPL is probably the best option for a FOSS license. Why do you consider it "not truly open-source"?
AGPL prevents from wide product adoption, since corporate lawyers caution against relying on AGPL products because it is easy to violate the license terms and being sued after that.
It's not possible to sell non-FOSS modifications to AGPL-licensed software. I think that's intended. It's not antithetical to Open Source, quite the opposite in fact.
Good. Doesn't prevent it from using (not selling) in your company.
can y explain the problem with that github pull request? I did not get it
AGPLv3 is a completely valid choice for an open source license, and (not that it was necessarily questioned, but since critique of pushing enterprise sales comes up,) having a split open source/enterprise license structure is not particularly egregious and definitely not new. Some people definitely don't like it, but even Richard Stallman is generally approving of this model[1]. It's hard to find someone more ideologically-oriented towards the success and proliferation of free and open source software, though that obviously doesn't mean everyone agrees.
I'm not saying, FWIW, that I think AGPL is "good", but it is at least a perfectly valid open source license. I'm well aware of the criticisms of it in general. But if you're going to relicense an open source project to "defend" it against abuse, AGPL is probably the most difficult to find any objection to. It literally exists for that reason.
I don't necessarily think that Grafana is the greatest company ever or anything, but I think these gripes are relatively minor in the grand scheme of things. (Well, the security issue might be a bit more serious, but without context I can't judge that one.)
[1]: https://www.fsf.org/blogs/rms/selling-exceptions
The FOSS alternative to Grafana is Grafana, which is FOSS. More FOSS than it was before, actually.
I think the word is rather use is copyleft! Agpl is fully open source in its truest sense! It’s so open that it ensures it always stays open!
What they did wrong with this PR? It seems eventually they realized the scope was much bigger, requiring changes on both the frontend and backend, and asked potential contributors to reach out if they're interested in contributing that particular feature (saying between the lines that they themselves don't have a use, but they won't reject a PR).
Seems like they didn't need it themselves, and asked the community to contribute it if someone really wanted it, but no one has stepped up since then.
To be fair, AGPLv3 is a very valid open source licence.
Now, poor and bad behaviour from the prom maintainers is a very fertile subject. If you want to see some real spicy threads check out the one where people raised that Prom’s calculation of rate is incorrect, or the thread where people asked for prom to interpolate secrets into its config from env cars - like every other bit of common cloud-adjacent software.
Both times prom devs behaved pretty poorly and left really bad taste in my mouth. Victoria Metrics seems like a much better replacement.
I'm using VictoriaMetrics instead of Prometheus, am doing something wrong? I have zabbix as well as node_exporter and Percona PMM for mysql servers because sometimes it is hard to configure prometheus stack for snmp when zabbix cover this case out of the box.
Well, they claim superior performance (which might be true), but the costs are high and include a small community, low quality APIs, best effort correctness/PromQL compatibility, and FUD marketing, so I decided to go with the de-facto standard without all of the issues above.
Could you provide more details regarding low quality APIs and PromQL compatibility issues? The following article explains "issues" with PromQL compatibility in VictoriaMetrics - https://medium.com/@romanhavronenko/victoriametrics-promql-c... . See also https://docs.victoriametrics.com/metricsql/ . TL;DR: MetricsQL fixes PromQL issues with rate() and increase() functions. That's why it is "incompatible" with PromQL.
Could you provide examples of FUD marketing from VictoriaMetrics?
I am on mobile, so cannot really link GitHub for examples, but I'd recommend anyone considering using VM over Prometheus to take a cursory look into how similar things are implemented in both projects, and what shortcuts were made in the name of getting "better performance".
Performance-wise e.g. VictoriaMetrics' prometheus-benchmark only covered instant queries without look back for example the last time I checked.
Regarding FUD marketing: All Prometheus community channels (mailing lists, StackOverflow, Reddit, GitHub, etc.) are full of VM devs pushing VM, bashing everything from the ecosystem without mentioning any of the tradeoffs. I am also not aware of VictoriaMetrics giving back anything to the Prometheus ecosystem (can you maybe link some examples if I am wrong?) which is a very similar to Microsoft's embrace, extend, and extinguish strategy. As per recent actual examples, here's a 2 submission of the same post bashing project in the ecosystem: https://news.ycombinator.com/item?id=40838531, https://news.ycombinator.com/item?id=39391208, but it's really hard to avoid all the rest in the places mentioned above.
No costs if you're hosting everything. It does scale better and has better performance. Used it and have nothing bad to say about it. For the most part a drop-in replacement that just performs better. Didn't run into PromQL compatibility issues with off-the-shelf Grafana dashboards.
Prometheus itself is pretty simple, fairly robust, but doesn’t necessarily scale for long-term storage as well. Things like VictoriaMetrics, Mimir, and Thanos tend to be a bit more scalable for longer term storage of metrics.
For a few hundred gigs of metrics, I’ve been fine with Prometheus and some ZFS-send backups.
Just to expand upon some experiences with some of the listed software.
The architecture is quite different between Thanos and the others you've listed as unlike the others, Thanos queries fan out to remote Prometheus instances for hot data and then ship out data (typically older than 2 hours) via a sidecar to s3 storage. As the routing of the query depends on setting Prometheus external labels, our developer queries would often fan out unnecessarily to multiple prometheus instances. This is because our developers often search for metrics via a service name or some service related label rather than use an external label which describes the location of the workload which is used by Thanos.
Upon identifying this, I migrated to Mimir and we saw immediate drops in query response times for developer queries which now don't have to wait for the slowest promethues instances before displaying the data.
We've also since adopted OpenTelemetry in our workloads and directly ingest otlp in to Mimir (Which VictoriaMetrics also support).
I wrote an extensive reply to this but unfortunately the HN servers restarted and lost it.
The TL;DR was that from where I stand, you’re doing nothing wrong.
In a previous client we ran Prometheus for months, then Thanos, and eventually we implemented Victoria Metrics and everyone was happy. It became an order of magnitude cheaper due to using spinning rust for storage and still getting better performance. It was infinitely and very easily scalable, mostly automatically.
The “non-compliant” bits of the query language turned out to be fixes to the UX and other issues. Lots of new functions and features.
Support was always excellent.
I’m not affiliated with them in any way. Was always just a very happy freeloading user.
I have deployed lots of metrics systems, starting with cacti and moving through graphite, kairosdb (which used Cassandra under the hood), Prometheus, Thanos and now Mimir.
What I've realised is that they're all painful to scale 'really big'. One Prometheus server is easy. And you can scale vertically and go pretty big. But you need to think about redundancy, and you want to avoid ending up accidentally running 50 Prometheus instances, because that becomes a pain for the Grafana people. Unless you use an aggregating proxy like Promxy. But even then you have issues running aggregating functions across all of the instances. You need to think about expiring old data and possibly aggregating it down into into a smaller set so you can still look at certain charts over long periods. What's the Prometheus solution here? MOAR INSTANCES. And reads need to be performant or you'll have very angry engineers during the next SEV1, because their dashboards aren't loading. So you throw in an additional caching solution like Trickster (which rocks!) between Grafana and the metrics. Back in the Kairosdb days you had to know a fair bit about running Cassandra clusters, but these days it's all baked into Mimir.
I'm lucky enough to be working for a smaller company right now, so I don't have to spend a lot of time tending to the monitoring systems. I love that Mimir is basically Prometheus backed by S3, with all of the scalability and redundancy features built in (though you still have to configure them in large deployments). As long as you're small enough to run their monolithic mode you don't have to worry about individually scaling half a dozen separate components. The actual challenge is getting the $CLOUD side of it deployed, and then passing roles and buckets to the nasty helm charts while still making it easy to configure the ~10 things that you actually care about. Oh and the helm charts and underlying configs are still not rock solid yet, so upgrades can be hairy.
Ditto all of that for logging via Loki.
It's very possible that Mimir is no better than Victoria Metrics, but unless it burns me really badly I think I'll stick with it for now.
Not doing anything wrong. It scales better and has better performance. Works well. Prometheus is also fine.
Yeah, Ive been working on deploying such with added txtai indexing so I can just ask my stack questions - setup txtai workflows and be able to slice questions across what youre monitoring.
That's how to monitor not what to monitor