You are here
Monitoring Kubernetes with Opsview: Part 1
Kubernetes is an open source container orchestration framework based on the “Borg” infrastructure that drives Google’s vast data centers. Drop Kubernetes onto a set of networked physical or virtual machines running Linux (or, for worker nodes, Windows Server), and it turns them into a resilient fabric for hosting and managing containerized workloads, scaling them manually and automatically, and keeping them (and Kubernetes itself) available despite software crashes and hardware failures.
Kubernetes is hugely customizable, providing interfaces for plugging in different container runtimes (e.g., Docker, containerd, Rocket, LXD, Windows Containers, etc.), container networks (e.g., Flannel, Calico, Weave, etc.), volume storage and other components. It’s DevOps-friendly, providing a simple but useful web dashboard, a powerful CLI, and a standardized REST API for home-grown automation.
Engineered to run at ‘planet scales,’ Kubernetes minimizes pointless labor (what Google, in their famous SRE Book, call “toil”) and works to enhance operator effectiveness in a host of ways. No surprise that Kubernetes is now explosively popular. It’s becoming a platform of choice for enterprise software development, continuous delivery, large-scale cloud-native application hosting, and as an underlay for PaaS and serverless computing frameworks.
As Kubernetes becomes more mission-critical, the need to monitor it increases. But the way Kubernetes works (and the kinds of workloads it hosts) changes the classic monitoring game in interesting ways.
For starters, Kubernetes works very hard (and very well) at keeping cloud-native applications available. So long as apps save critical data (and, as needed, their state) to a resilient/distributed database cluster or persistent volume store (because container storage is ephemeral, so vanishes if a container crashes), Kubernetes will keep cloud-native apps available despite significant challenges.
If a containerized workload crashes, Kubernetes will restart its pod. If a worker node fails, it will redeploy affected pods on healthy nodes -- once a new node is provided and integrated to the cluster (a matter of minutes if done manually from a prepared base image -- mere seconds if automated) Kubernetes will rebalance into the new capacity. If the entire cluster is suddenly powered down, Kubernetes will typically resume gracefully as soon as power is restored, bringing itself back up, reintegrating nodes, and starting up all the workloads again. In the meantime, advanced features like Kubernetes Federation can be exploited to bring the app up on another cluster, in a healthy availability zone.
Prometheus, a cloud-native monitoring framework, is here shown displaying a graph of gateway invocations against a function running on OpenFaaS, itself hosted on Kubernetes. APM gives insight into specifics of application-layer performance.
The Crisis of Observability
Vendors of application performance monitoring (APM) solutions have argued that this means the infrastructure doesn’t matter, and monitoring needs to refocus on workloads and their performance. From one perspective, this is (sort of) true, but brings new challenges. Containers in highly-dynamic, self-scaling environments (for example, the containers that scale out to run ‘functions’ in a Kubernetes-hosted Functions-as-a-Service platform like OpenFaaS) may be very numerous, live very short lives, and be hard to keep track of and connect to via ssh or other ‘agentless’ strategies that poll infrequently.
They’re not, in other words, very observable. Monitoring them, therefore, tends to require embedded agents that ‘phone home,’ sidecar containers with onboard agents (e.g., telegraf), or other highly-custom solutions that, in many cases, can only be implemented and maintained by the app’s developers (and/or by the developers of frameworks like OpenFaaS that host apps on top of Kubernetes). These, naturally, need to be paired with specialized charts, graphs, and visualizations, at the cost of even greater development effort.
Clearly, there are many cases where custom-crafted APM is required and important. Giving developers the job of monitoring, however, isn’t guaranteed to be productive for them, useful to operations, or indicative of real end-user experience. As Baron Schwarz, CEO of APM innovators VividCortex noted in a presentation at April’s Percona Live event, some efforts end up collecting what he calls ‘vanity metrics,’ instead.
The Opsview Kubernetes Opspack lets Opsview display analogous metrics from Kubernetes (shown here) and lower in the stack. Complementing APM, monitoring the full stack gives equally-important performance insight and proves fundamental availability, while requiring less knowledge about application internals and less customization of metrics take-offs and visualizations.
Four Golden Signals
Even Google, whose infrastructure is arguably the most commodified and gargantuan of any contemporary entity, doesn’t seem to believe in infinitely-customized and -differentiated APM. Their SRE book describes a generic approach to monitoring that looks at what they call “the four golden signals” (Latency, Traffic, Errors, and Saturation) -- to figure out how well or poorly a service is operating, and roughly why. Interestingly, those signals can often be monitored from outside the low-observability domain of the containers themselves.
It’s also important to remember that most of us aren’t Google. We don’t have infinite hardware, so hardware failures may reduce our cluster’s capacity by critical amounts, and may take us real time to mitigate. For most of us, therefore, the infrastructure does matter; though Kubernetes’ extraordinary resilience tends, here as well, to change the emphasis of monitoring from alerting (to trigger rapid disaster-mitigation) to resource and performance management.
Regardless of whether (and how much) you invest in APM, we think it’s important to monitor the full Kubernetes stack as infrastructure -- doing so in a way that’s serviceable to operations and that leverages expert best-practice instead of requiring custom development.