Part one of a series objectively examining important topics in contemporary data center monitoring, including observability, automation, and cost...
You are here
Free Monitoring Solutions: Visibility, Cloudy
Effective monitoring requires the ability to visualize information: first, to survey the state of many systems at once and highlight issues; then to drill into them from several angles, exposing more detail and illuminating root causes.
Enterprise IT operations staffs typically spend lots of time in this forensic workflow. Its usability: how information is organized, gisted, surveyed, how issues are flagged, details exposed, explored, and collaborated over -- is critical to morale and efficiency.
Getting it right (a tall order) lets individual operators do more on their own, and do it faster -- shortening resolution times, enabling SLA compliance, and freeing up staff time for the valuable work of architecting and automating more-resilient solutions. The need for expensive and potentially disruptive escalations (to in-house experts, to coding teams, to vendor support, etc.) is reduced significantly. When required, the right shared monitoring and visualization toolkit helps bring experts up to speed more quickly, and helps teams coordinate for faster problem-solving and mitigation.
Our previous blog discussed how free monitoring solutions sometimes fail to provide effective tools for visualizing whole business services and the resilient clusters comprising them, and for determining whether (and by how much) infrastructure-layer failures impact business service availability and performance. The result of such lack can be over-alerting: forcing Ops to respond with all-hands urgency to low-risk/low-impact problems that could more efficiently be fixed during routine maintenance.
Part of the reason free software solutions may lack BSM is that it’s a meta-capability: built atop a kit of tools that includes system modeling, object grouping, conditional metrics evaluation, and visualization. For some free monitoring platforms, such tightly-orchestrated toolkits just don’t exist. In other cases, paradoxically, the opposite problem arises: tools do exist aplenty, but as an embarrassment of riches.
Nagios, for example, whose array of CGIs have tempted many organizations to try their hands at innovation, offers well over 100 community-supported projects and forks providing alternative WebUIs, cosmetic skins, dashboard capability, diagrams and network maps -- even attempts to cobble together BSM outside the Nagios engine. There are dozens of variant interfaces for mobile, for special-size screens, for viewing critical status information at a distance in expansive NOCs. There are higher-order solutions that purport to aggregate status information from multiple, distributed Nagios instances on a single display.
Most solutions require a separate database -- MySQL is popular, though other, more obscure DBs also show up as critical dependencies. Many solutions are built on PHP -- capable and trustworthy, and arguably state-of-the-art in the late 00s, when many of these utilities first appeared. Some visualization solutions extend the functionality of earlier-released and simpler dashboards or utilities, creating an open-source flavor of ‘vendor lock-in.’ Some utilities generate output in XML or other intermediate forms that can then be visualized using other tools.
Using any of these components requires reading reviews (where available - many products are unreviewed), doing due-diligence, picking from among competing offerings, setting up VMs, installing and tuning databases, integrating, testing, and attentive maintenance to avoid breaking the solution during upgrades. It may require the additional, complex step of creating workflows (manual or automated) for extracting data, dashboards and displays to visualize it, and forward integrations to enable operators to act on it.
The result is the wrong sort of freedom: intoxicating, perhaps, to those who enjoy tinkering and have few competing priorities, but lengthening time-to-value, increasing maintenance work, and making monitoring ever more of a “special snowflake” (or snowstorm) of technical debt. This is inefficient for teams and organizations that need a more-curated, thoughtfully-integrated, well documented, supported, and aggressively updated solution stack for visualization, investigation, and collaboration.