The Value of Visibility in Your Data Center Monitoring

This article, written by Opsview Content Lead, John Jainschigg, originally appeared in InformationAge.com. The series objectively examines important topics in contemporary data center monitoring, including observability, automation, and cost control.

Keeping key enterprise applications up and running well is an absolute requirement for modern business. As estimated by Gartner, IDC and others, the cost of IT downtime averages out to around $5,600 per minute. A simple infrastructure failure might cost around $100,000; while failure of a critical, public-facing application costs more like $500,000 to $1 million per hour. When failures impact large-scale global logistics and cause widespread inconvenience to customers, as in last May’s British Airways airline operations systems failure, costs can quickly become staggering. BA estimated losing $102.19 million in hard costs including airfare refunds to stranded passengers, plus incalculable damage to reputation. BA’s parent company, IAG, subsequently lost $224 million in value, based on its then-current stock valuation.

Preventing such disasters, or intervening effectively and rapidly when they occur, means giving developers and operations staff (DevOps) visibility into IT infrastructure, networks, and applications. Modern IT monitoring solutions provide this visibility in many ways, including:

Issue: Ingest and Discovery - Manually configuring monitoring for hundreds or thousands of hosts is a time-consuming and potentially error-prone process. Operators sometimes lack a complete picture of all the hosts, apps, and business services in their purview. Solution: data center monitoring systems are increasingly able to automate or infer information from discovery, configuration management databases (CMDBs), deployment tools, cloud APIs and other sources. This helps operators identify and label entities, visualize dependencies, and configure monitoring, quickly and accurately, throughout the hybrid (i.e., on-premises and cloud-based) datacenter. Discovery may be done using WMI (Windows Management Instrumentation), SNMP network discovery, and other technologies.
Issue: Summary status display - Operators need ‘single panes of glass’ that aggregate lots of status information on monitored systems, letting them spot issues quickly and drill down to determine root causes. Solution: Mature IT monitoring platforms provide collapsible, outline-style summary displays or scheduled reports that let operators hide or reveal meaningful subsets of information about monitored hosts and systems. Color coded popups draw attention to issues. Clickable labels offer quick access to details of individual service checks, graphs, raw event logs and troubleshooting tools.
Issue: Dashboards - Too much monitoring data, too densely aggregated, can be hard to work with. Operators need to be able to quickly visualize key metrics and status information. Solution: Valueable IT monitoring systems let you create customizable dashboards with graphical widgets isolating specific hosts, metrics, and KPIs. Read-only access to prepared dashboards can be distributed to key stakeholders aware of application status, SLA compliance, etc.
Issue: Business service monitoring - IT and DevOps need to be able to visualize the status of all infrastructure elements and systems involved in providing key business services. Solution: Business Service Monitoring (BSM) is an enhanced dashboarding capability that lets operators create interactive views of complex application ‘stacks’ (e.g., the load balancers, web/application servers, database clusters, network gear and other elements supporting a typical, scaled-out, highly-available, tiered application). It’s ideal for keeping responsible developers, product managers, and others apprised of the status of applications they own and empowering them to help effectively if system status begins to degrade.
Issue: Reporting - Realtime status visualizations don’t tell the whole story. Proactive management and planning also means being able to view system-wide status, resource consumption trends and other information. Solution: Comprehensive reporting lets operators track compliance. It offers insight into service level agreements and objectives, scheduled maintenance and upgrades, keeps track of costs, and budgets for scale-out, among many other uses.
Issue: Alerting - Severe issues may require immediate operator attention, around the clock. Solution: Almost all IT monitoring solutions offer alerting via pager, email, and text messages. Many also integrate directly with on-call management systems and services. The ability to properly route the right alert to the right person at the right time is vitally important. Enterprise monitoring platforms either have this capability or integrate with proven solutions that ensure the right people have insights at the right time.
Issue: Mobility - Tying down operators to NOCs and desks is bad for morale and productivity. Solution: The best IT monitoring solutions offer useful mobile applications enabling operators to view status, key business services and other dashboards; and respond to alerts and notifications from anywhere.
Issue: Notifications and outbound integration - Once status info is aggregated from monitored systems, how can issues be originated, tracked, assigned, collaborated over, and resolved? Solution: Top monitoring platforms offer an increasingly-broad set of integrations with popular enterprise and SMB issue-tracking, service desk, and IT process management solutions. Look for integration with Slack, ServiceNow, Puppet, Ansible, etc., in an enterprise monitoring platform. Ask about extensibility - “can the platform easily extend its capabilities to integrate with future solutions?”

Minimum Signals in Data Center Monitoring

Doing data center monitoring right means not seeking to visualize every possible signal. Ideally, monitoring makes visible a minimum subset of signals giving maximum actionable insight:
Each metric collected comes with associated hard and soft costs. As IT estates grow in size and complexity, overheads associated with gathering, processing, storing, analyzing, displaying, querying, and reporting on metrics all increase. This eventually impacts application, network, and/or monitoring system performance.
Excess visibility also imposes significant cognitive load on operators. Too many complex, rarely-used, or operationally-irrelevant metrics can camouflage important signals (alerts), slowing effective incident response. Lack of selectivity about what signals to make visible, and how to evaluate and call attention to them can easily lead to excessive alerting.That promotes alert fatigue, burnout, and eventually causes real incidents to be ignored through a sort of “crying wolf” syndrome.
Operator time consumed investigating non-critical incidents is time lost to more important and impactful work. Building automation; put simply: getting visibility wrong costs real money, and can hamstring innovation.

Maximum Insight

Enormous knowledge and experience is needed to identify the necessary-and-sufficient signals to optimally monitor a given type of infrastructure, application, or business service. Without the proper tooling, understaffed, time-stretched IT staff are often hard-pressed to provide this level of assurance.
Top-flite IT monitoring solutions bridge the knowledge gap by packaging up optimal metrics sets in modules or plugins, enabling best-practice-compliant monitoring to be set up quickly and with confidence. For example, using a plugin, an operator can quickly implement the 20 to 40 service checks needed to monitor health, performance, and resource consumption by a MySQL database.
Less mature application performance monitoring (APM) systems and open source toolchains are used by developers to instrument software under construction, and visualize application state in test and production environments. APM solutions are usually not very helpful for operators who know little about application specifics, and whose job is to keep numerous complex applications running smoothly. Unlike IT operations monitoring, APM systems are diverse, and hew to a wide range of standards. For example, there are literally dozens of open source servers, exporters, scrapers, and other tools designed to extract metrics from HAproxy (a popular open source proxy server/load balancer) for consumption by Prometheus (a popular metrics visualization and database system).

Observability

Monitoring and visibility treat of “known unknowns” -- the well-understood performance characteristics/indicators and known hard failure modes of applications and components. Meanwhile, the term observability is more focused, and now used in discussing the superset of visibility that includes “unknown unknowns.” In particular, this refers to challenges in understanding and managing the behavior of dynamic, self-scaling, resilient, distributed applications. Basically, visibility is knowing which of a set of predictable issues might be occurring (enabling remediation); whereas observability gives you the insight to figure out what’s going on (enabling further inquiry).
Enterprise monitoring solutions are working hard to provide plugins and modules that make the inner workings of container orchestration and related systems more visible. At the same time, top players are evaluating a range of strategies for extracting a few, important signals from distributed and containerized applications, making them observable. We’ll discuss some of these methods in upcoming columns.