Part one of a series objectively examining important topics in contemporary data center monitoring, including observability, automation, and cost...
You are here
Free Monitoring Solutions: Limited Insight
In our previous blog, we discussed some of the challenges free monitoring solutions present in basic implementation. Lack of CMDB and ITOM integration can slow you down by preventing easy ingestion of pre-existing, authoritative infrastructure configuration data. Lack of automation and unknown plugin quality can add further friction -- requiring more coding, more expertise, more care in solution validation, and more careful manual steps before you can feel confident that basic monitoring is working well.
Assuming your technical prowess is such that you can help your organization vault these barriers, you now have visibility into the health of individual hosts, VMs, and application instances. But modern enterprise IT systems and applications -- particularly the mission-critical kind -- don’t usually live on single hosts. Instead, they achieve resilience through clustering: distributing hosts across multiple fault domains and using clever networked software to enable failover. They achieve performance by load balancing and parallelization: scaling out monolithic application servers or individual microservices across available infrastructure, sometimes even doing so dynamically. And they share dependencies: for example, many enterprise applications may share a single authentication mechanism, access a centralized enterprise database cluster, or run on a single cloud or container orchestration framework, like OpenStack or Kubernetes.
Free monitoring solutions often can’t model resilient clusters and complex business services with many interdependent elements. Part of the purpose of distributed application architectures is that they avoid single points of failure. Your enterprise CRM solution doesn’t keel over when one web server in a multi-server, load-balanced tier goes south, or when one server in a database cluster dies.
Free monitoring software tends to focus on individual components, and may not be easily configurable to model system-level resiliency, service interdependence, or estimate performance impacts of partial cluster failures. That means you need to choose between two unsatisfactory options. You can configure the monitoring platform to alert on failure of any individual cluster component; thus risking over-alerting, operator exhaustion, and associated costs and risks. Or you can suppress alerts on clustered elements and trust that the clustering technology will keep critical services available 100% of the time. Do you feel lucky?
Free monitoring platforms often can’t be configured to estimate performance impacts of infrastructure-level failures on business service availability. Though distributed applications are designed to minimize single points of failure, any infrastructure-level issue will typically have performance impacts. Sometimes those will be enough to affect service levels in meaningful ways. But free monitoring platforms are seldom configurable to measure these impacts, determine whether SLAs or SLOs are in jeopardy, or provide an unambiguous record of success or failure in meeting service-level obligations over particular spans of time. That makes it harder to assert compliance or limit the reputational or financial impacts of availability issues, particularly if these don’t result in applications being completely down.
More like this
Monitoring data, like all operations data, is at its most valuable when it leverages a presentation layer that puts the information in the proper...
So, last Friday night, I decided to turn my infrastructure into code by learning Ansible, and capture the entire demo configuration.