Part one of a series objectively examining important topics in contemporary data center monitoring, including observability, automation, and cost...
You are here
Keep Apps Available and Fix them Faster with Infrastructure Monitoring and APM
With rapid evolution happening across the software and operations spectrum, most larger organizations today find themselves managing technologically diverse IT estates. These may (and often do) comprise (incomplete list):
- Multiple on-premises data centers, running a mix of:
- Bare-metal servers running key enterprise messaging, authentication, document management, central virtualized desktop, backup/archive and other apps and services, many in three-tier clusters for resiliency, some spread across multiple availability zones or mirrored across several locations.
- Infrastructure-as-a-Service frameworks (e.g., VMware, perhaps OpenStack in some cases, many topologies now possible, including hosted control plane/on-premises compute/storage nodes), hosting a mix of:
- Long-running VMs hosting further enterprise and intranet apps and services in three- and two-tier arrangements
- Ephemeral (perhaps self-serviced) developer platforms, delivered singly or multiply
- (Possibly) Developer or experimental container orchestration frameworks on VM substrates
- Core network, power management, security and other infrastructure
- Public cloud-based data centers, running:
- Dev/Test and (possibly) production VM application clusters
- Elastic extensions of core enterprise applications
- (possibly) Dev/Test and Production container workloads on hosted orchestration
- Cloud object storage, cloud SQL or other database, PaaS, functions-as-a-service (e.g., Lambda) and other advanced application hosting and support services
- Offsite backup and other business continuity services
- Remote, co-located equipment aggregations, managed equipment, and other things you need to think about monitoring because they’re not full-service SaaS, etc.
- Remote site equipment (e.g., IP PBX analog port extenders, voicemail systems, etc.), facilities, security and other dispersed systems
Subsystems running within and across these complex physical and virtual domains may represent many different generations of IT software evolution, a range of deployment architectures and operational models, a mix of Linux and Windows platforms, and a plethora of components, some proprietary, some open source, some well-supported, and others hardly supported at all. Some of these components and their architected interoperations are intimately understood by your current IT and DevOps teams, others less so. Some of the older stuff may not even be comprehensively documented.
Different parts of this complex environment are evolving at different speeds, and for very different reasons. Some are powerfully strategic, exciting, new and challenging, so are receiving current investment and high levels of IT/DevOps attention. Others are arguably mission-critical, but maintained in a “keep-the-lights-on-but-don’t-waste-cycles-babysitting-this” mode. Still others are less than mission-critical, but represent time spent developing and maintaining what may be productive and important solutions at the departmental level. They’re mostly cared about by the people who use them.
Multiple Viewpoints on Value
The gradual redefining of IT into DevOps, and the shift to cloud have encouraged much recent discussion about what needs to be monitored, and how monitoring should work. In the new view, attention paid to underlying infrastructure is disparaged (some pundits even suggest deprecating it) in favor of monitoring applications in order to optimize their performance and cost-efficiency. This is the province of Application Performance Monitoring (APM).
This viewpoint makes perfect sense the closer you get to a fully-evolved, container-oriented, cloud-only world such as described by Google in their (excellent, and free to read) book on Site Reliability Engineering. Inarguably, applications are what produce value. In a public-cloud-modeled world where resources are closely metered (and perhaps billed for based on usage), applications generate costs directly. And if the underlying cloud substrate is fully abstracted and converged, or you purchase cloud from a provider willing to maintain the fiction that all units of a given resource type have equivalent value, cost accounting and optimization can even be conceptually simple: more about real efficiency than clever arbitrage.
In such advanced environments, mere availability of apps is no longer the focus of monitoring. Availability is mostly ensured by the infrastructure, which does its auto-restarting, self-healing, thing automatically. While Google’s Site Reliability Engineers (SREs) do sometimes need to deal with apps going completely dark, this is (by design and serious investment) quite rare for them, and tends to be caused by outside forces (like widespread internet outages) beyond even their control.
Costs, meanwhile, are also (to some degree, anyway) now at least partly incurred based on decisions the infrastructure makes autonomously, scaling up or back based on resource or performance metrics. The possibility that significant charges might be incurred as the result of an unanticipated traffic spike (or a bug in a scaling algorithm, or a bad metric) obliges serious concern about cost/performance optimization. Impacts of inefficiency and error are no doubt magnified by the increased speed of release cycles and the difficulty (not going away soon) of testing cloud-scale apps at actual cloud scales prior to production release. DevOps thought leader Cindy Sridharan has recently written about this issue.
Highly dynamic containerized workloads, meanwhile, can be hard to monitor by traditional methods. Depending on application and platform dynamics, traffic conditions and other variables, a given container may exist only for seconds, meaning that it can’t be remotely polled at rates used for long-running infrastructure and components. For this reason, modern APM systems heavily favor embedding monitoring agents directly into components or building them into containers. When containers start, the agent can then negotiate with the monitoring platform to report the new entity’s status and begin transmitting desired metrics, commonly expressed as time-series data.
It can also be difficult to determine what contribution a containerized component is actually making to anything you really care about (e.g., to application performance). Even an application’s developers may be hard-pressed to identify metrics that are truly meaningful. As a result, DevOps sometimes tend to monitor too many things, creating increased monitoring traffic, ballooning databases, and (on pay-by-service-check SaaS monitoring platforms) increasing costs. The Google SRE book contains a very interesting exposition of how dynamic distributed systems can be minimalistically but usefully monitored, involving use of four metrics they term “the four golden signals” (latency, traffic, errors, and saturation).
Almost certainly – now or soon – some of the operations of most larger organizations will take on this new emphasis, and will adopt APM as an important part of a proactive monitoring strategy. It’s a mistake, however, to think that APM alone is a fully-adequate solution, or that what it brings to the table is entirely new.
Distilled Operational Intelligence
In most organizations, as distinct from Google, diversity and complexity are the rule: lots of different components working in lots of different stacks in lots of different places. Realistically, only relatively few of these monitoring targets – as critical as each may be in practice – will be deeply understood by DevOps practitioners or considered worthy of serious investment to understand them better in the numerous contexts in which they may be applied.
For this reason, it’s vital that organizations be enabled to monitor this huge variety of things easily, and to do so in ways that express evolved best-practice – both about how to monitor, and about what metrics are helpful. At Opsview, we call this “distilled operational intelligence,” and we provide by shipping our product with several hundred built-in monitoring packs (called “OpsPacks”) onboard, and by curating a collection of several hundred additional, purpose-engineered Opspacks.
Using built-ins and drawing on our OpsPack library, new users can quickly (but also adequately) monitor compute, storage, and network hardware, Windows and Linux operating systems, popular open source and proprietary databases, public cloud platforms like Amazon and Azure, on-premises cloud frameworks like VMware and OpenStack, web and application servers, container engines like Docker, orchestrators like Kubernetes, plus dozens of specialized applications and components that make up part of modern stacks. Each unit of monitoring intelligence represents the accumulated knowledge and experience of our own engineers, our customers, and a broad community of allied users. Beyond this, Opsview Monitor is compatible with the open source Nagios platform, placing several thousand more community-supported plugins at users’ disposal.
Collaborating with APM
Mature IT infrastructure and application monitoring will notify effectively about system health, impending issues, and outages – using messaging to summon aid, presenting summary information in easy-to-interpret, color-coded dashboards, and permitting rapid, point-and-click drill-down for graphics and interactive root cause analysis. Some products, including Opsview Monitor, can also be quickly customized to aggregate and display information about collections of infrastructure and stacks of components supporting specific business services and applications. Called Business Service Monitoring, or BSM, this provides the complement to APM’s internal view of distributed service availability.
IT infrastructure monitoring can serve an important function effectively monitoring databases and storage systems accessed by stateless apps to store transactions or persist state. Monitoring connected databases (themselves still rarely containerized) can be a simple way of obtaining useful information about container apps that are otherwise difficult to monitor directly.
By interrogating the APIs of public cloud providers (e.g., Amazon CloudWatch), an IT infrastructure monitoring solution can summarize reported metrics from (portions of) your cloud estate, optionally combining these with metrics from on-premises systems to provide a coherent picture of how fully-hybrid or cloud-resident applications are performing. With equal ease, it can monitor cloud-based virtual machines and applications directly, providing a back-check on cloud-provider-reported performance and utilization data.
Container engines and orchestrators can be monitored both at the API level and in terms of underlying hardware state (be this physical or virtual). Metrics returned from container orchestrators include most of the “four golden signals” Google considers important, including time-series measurements of requests, latency, and indications of saturation. Normally, these metrics pertain to the entire orchestrator cluster, but can be customized to reflect only a given container namespace. It’s possible to tune these derived metrics, therefore (and without modifying containerized code) to reflect only the performance and health of a given target application.
By providing easily configured, flexible monitoring that captures significant information and makes it readily available for review, IT infrastructure monitoring closes the loop with APM, enabling informed interpretation of application-level metrics that would otherwise be produced in a vacuum.
More like this
A visionary DevOps/SRE culture is busy automating away previously manual tasks and building the next generation of applications on novel, software...
DevOps is about accelerating delivery of new products and services at scale, reliably and affordably. Doing this requires comprehensive IT ...