Since 1971 Stadtwerke Schwäbisch Hall has been delivering a variety of technical services. These include network management and system monitoring...
You are here
Free Monitoring Solutions: Special Snowflakes
Though zero cost and community support can seem appealing, free monitoring software can be a bad deal for organizations hoping to use monitoring at scale, to make IT more agile, and to control associated costs. We recently asked sysadmins and CIOs to share some of the challenges they encounter with free IT monitoring solutions. We’ve summarized those results in a solution brief and in a series of blogs.
Our first installment discussed why free monitoring solutions are often difficult to deploy in production. To summarize: building a full-featured, web UI-accessible monitoring solution around a free monitoring engine (e.g., Nagios Core, Prometheus) can mean selecting, integrating, and configuring back ends (e.g., specialized databases like InfluxDB or CouchDB), search engines (e.g., Elastic), community-provided front-end webUIs, and other add-ons.
Assuming the resulting confabulation works, congratulations! You’ve created a “special snowflake”: a complex application with many moving parts; potentially with many subtle, critical dependencies -- one that you now need to validate, operate, and maintain.
Testing a Production-Scale Snowflake is Hard
Clustering and failover setups can be complicated. And as hard as they can be to deploy, they may be even more difficult (not to mention time-consuming) to test exhaustively. In-service testing of production setups is harder yet, and potentially more dangerous, since it can mean inducing failures on live infrastructure. Not testing, however, can leave you vulnerable to monitoring service outages or data loss -- a major problem when monitoring is the tool whose availability and data integrity you most need to depend on, to keep other critical enterprise systems available.
Dependencies can make a snowflake fragile
Dependency management is the bane of all “big software” implementations; issues arising as fast-moving component projects get out of sync with one another. Even something as apparently straightforward as integrating a Galera cluster around a popular, open source SQL implementation (e.g., MariaDB) can turn tricky: requiring version “pinning” and special procedures to prevent updates being missed, or worse, accidentally updating to incompatible component versions, breaking your cluster. The more such dependencies exist within your monitoring solution, the more documenting you need to do; the more know-how staff needs to perform operations safely; and the more manual processes you need to manage -- each with its attendant risks of human error. The O2 network outage on December 16, 2018, which took out mobile data service for over 32 million UK subscribers for most of a day and will end up costing the carrier over $125 million USD, was caused simply by letting a security certificate expire.
You can easily end up frozen in time
Where uncurated dependencies exist, updates and upgrades becomes complex, time-consuming, and risky. But stasis -- keeping the same version of a free monitoring solution running, perhaps for years on end -- is also problematic. Users who adopt this strategy don’t get the benefit of new features, become progressively more vulnerable to security issues, and gradually lose the ability to call on community for help as their implementation becomes more and more obsolete.
The community may be of little practical help
While open source user communities can be of great help with popular, widely-used components of a complete monitoring solution (e.g., with the database back-end, for example), answers to questions about the more “bespoke” aspects of your custom free monitoring setup can be much harder to obtain. Popular WebUIs, dashboard add-ons, plugins and other components tend to be the never-monetized creations of single people or small teams, who (understandably) go on to new jobs and projects, leaving “orphan” components behind.
Your snowflake just gets flakier as time goes by
Elaborating, scaling-out, and stabilizing your free monitoring solution is only the beginning. Now begins the ongoing work of configuring your solution to monitor fast-evolving hybrid infrastructure and applications. This can mean going beyond project-provided plugins -- selecting from among community supported solutions of unknown quality, and building complex sets of service checks, dashboards, alerting protocols, and other customizations around these components. Continued customization requires specialized knowledge, significant time, and may entail several kinds of risk: all of which we’ll discuss in our next blog.