Part one of a series objectively examining important topics in contemporary data center monitoring, including observability, automation, and cost...
You are here
Free Monitoring Solutions: Hard to Scale
Large enterprise hybrid IT estates can comprise anywhere from the high hundreds, to tens of thousands of hosts. That’s an intimidating number. But it’s not the number you should be most concerned with in determining monitoring performance requirements.
A much more important figure is number of service checks. Between eight and two dozen service checks might cover monitoring host hardware and basic operating system health. Run a MySQL instance on that host, and you can add 25 or so more checks -- a best practice, must-watch subset of important database health info. Webserver? Half a dozen or so checks. Load balancer? 20-ish checks. And so on.
This stuff adds up. A single LEMP server, running on your premises, can easily demand 100 service checks for full functional insight. Add additional sets of checks for every important application or service, every cloud platform component or resource, every cluster, every unit of network gear. An enterprise datacenter or cloud estate can easily end up requiring hundreds of thousands of service checks. Each check requiring completion within a desired time interval. Plus additional computation in many parallel channels to, for example, determine the current health and capacity of business services spread across multiple hosts.
That’s a lot of work. So being able to scale monitoring is important. But that can be hard with some free monitoring solutions.
Some free monitoring engines have legacy-style “monolithic” architectures
Monoliths are much easier to write than distributed solutions. The trouble is that software monoliths can only scale vertically: i.e., you can only increase application throughput and performance by deploying the app, its database(s) and other components on more and more powerful, and necessarily expensive, hardware or virtual machines.
Many alternatives for “distributed architectures”
While some popular free monitoring solutions can be deployed in distributed fashion to increase capacity (for example, by spreading service checks across multiple monitoring servers), the community and marketplace may provide several toolkits for doing so. It can be difficult to evaluate and choose from among these, and most have one or more gotchas, meaning that they’re okay for some use cases, but not for others (see below.)
Depending on which of several methods you might employ to distribute Nagios Core, for example, you might need to plan to perform configuration work on multiple machines (or all on the central server). In some cases, graphing and I/O intensive tasks might only be performed on the central server, not the satellites, limiting scalability.
Configuring at scale can be hard, too
Some popular free monitoring solutions are configured by making changes to human-readable text files (instead of a database), then changes are made active by restarting/reloading the software. While this isn’t difficult in a single-server environment, performing these steps (without automation) across multiple servers in a distributed environment can be more difficult, time-consuming, and potentially risky -- creating periods during which parts of the estate aren’t observable. In highly-dynamic environments (e.g., many VMs or other resources deployed and retired each shift) it may be unworkable.