Part one of a series objectively examining important topics in contemporary data center monitoring, including observability, automation, and cost...
You are here
Avoid Data Disasters with Observability
There’s a lot being said about observability these days. Particularly, a lot being said about the difference between monitoring and observability. One of the things that isn’t being consistently highlighted is the importance of having both, and even better, why not get both in one place?
Monitoring is the systematic watchdog that ensures you know when things aren’t running well or aren’t running at all. Thoughtful monitoring has the operational intelligence built-in (like Opspacks) to determine what’s relevant for alerting and what’s not. Observability refers to the ability to see or figure out what’s causing the event that has triggered an alert, like the Investigate option in Opsview.
As seen in figure 1 below, the Opsview Host Navigator has the ability investigate directly from the monitored system.
Data disasters are typically corruption of some sort or a total data loss. They are almost always prefaced by an event or series of events of which a monitoring implementation would have notified you. Let’s say you had the monitoring in place and got some sort of alerts. Without some ability to investigate (observe) the details of the alert, many admins and operators are tempted to dismiss the alert as not a high priority at the time.
Corruption can come from hard disks failing, bad power supplies, a multitude of issues with physical systems. Data can also be corrupted by misbehaving applications. Knowing when your CPU, RAM or disk are beyond thresholds can be the first preventative measure you take against data corruption.
There’s also the issue of outright data loss. That’s usually caused by a total system failure or a combination of events. If your monitoring system only monitors infrastructure, it might not be representing the whole picture of the status of your estate at a given time. The same is true if the monitoring system is an application monitoring system. There are a lot of great monitoring tools out there, but it’s important to have a monitoring platform that can unify all of the metrics and alerts you’re receiving, particularly during crisis.
There are management tools, as well, like Microsoft System Center Operations Manager, that have some monitoring capability. But, really, that sort of tool is designed for management, not for monitoring. The comprehensive view, the ability to monitor applications, infrastructure, network devices, all in one place, is what brings the proper level of detail to an event. The ability to investigate the event directly from within the monitoring platform is what gives you observability.
When you’ve received an alert, you’ll want to investigate the cause of the event. So, in the Event Viewer you can directly investigate both the host that generated the event, as well as the event itself (in Opsview terms, the Service Check). See Figure 2 below.
Perspective, whether you pass the alert to someone else or take action yourself, is important. But most people would agree that the criticality of maintaining healthy data is not a matter of perspective, it’s mandatory. Ongoing data integrity is a matter of insights, alerts and being proactive. Your data isn’t the only thing affected, though. We’ll be talking more about the intersection of monitoring and observability in coming blogs, so be sure to subscribe and keep in touch.
If you’d like to review your current toolset or are interested in how Opsview can give you both unified insights as well as observability, contact us today, we love talking to people.