You are here

Blog

SLA monitoring with Nagios and Opsview

This blog is written by Sam Marsh, Product Manager at Opsview.

A typical word that comes up in discussion with enterprise customers is availability - what does this actually mean though?

Availability basically means identifying for how many hours in a day, weeks in a year, and so forth, was my IT operational and functional - i.e. 'my website was accessible and people could buy products from it for 99.5% of the year' – with the other 0.5% accounted for by outages and planned downtime.

In IT, we talk about SLAs – Service Level Agreements. This is where a vendor (hosting company, cloud provider or managed service provider) agrees with the customer as part of the contract what the availability of that service will be. In essence, the tighter the SLA, the more money you’ll usually pay. Let's say for example, a 97.5% SLA may cost you £100 a year (arbitrary figure), whereas a 99.999% (“five nines”) SLA could cost you upwards of £100,000 a year – as this level of agreement essentially says:

The maximum allowable interruption of service will be 5.26 minutes per year or less.

An MSP agreeing to a 99.999% SLA, and thus tying themselves into providing a service so resilient it will be impacted less than five minutes per year, will require a high contract value to offset this risk.

Whatever the agreed level of service, availability of key IT infrastructure and services still needs to be monitored. In Opsview, we can use SLA reports for intervals including daily, weekly, monthly and yearly, with these emailed automatically to you or the end customer, showing the performance against SLA of that service over that time period. If you have these in Nagios they are easily migrated to Opsview.

However - that's the easy bit – it's all GUI driven and anyone can do it.

What I want to do is look at something a bit cooler, and something not ‘out of the box’ (what I do best).

Opsview ships with a plugin called check_odw_hostgroup_availability. We can use this to monitor the availability of a specific host group – where a host group is a “group of hosts” (pretty obvious!) – Essentially, grouping together all of your “Linux Servers” i.e. LinuxServer001, LinuxServer002, etc into a group called “Linux Servers” as below:

Host Groups

And we can click onto the host group to see the health of the hosts, then click into the host and view the health of the service checks.

Hosts

Host Information

NOTE:

You need to modify the plugin check_odw_hostgroup_availability at the moment (this fix will be committed in time for the next version of Enterprise/Pro). Change the lines at 72/73 from:

my $threshold = $np->set_thresholds(
warning => $np->opts->warning
critical => $np->opts->critical


to

my $threshold = $np->set_thresholds(
warning => $np->opts->warning.":100",
critical => $np->opts->critical.":100"

This adds the range function to the warning/critical checks – otherwise the plugin seems to work inversely which is bizarre!

Example:

In our example, I want to monitor the host group “Linux Servers” (This could be anything - e.g. a host group called “Tony's Tyres” as an example end customer)

To do this, I need to first create a service check called “HG Availability – Linux Servers” as below (image snipped so it isn't too big):

After creating our check – I’m going to add it, along with a few others for different host groups, to a “dummy host” called “HG-Availability-Checks” as below: 

Next, reload Opsview and we should be able to see our host groups availability statistics against our dummy host:    



So, we can now see the “SLA %” of each of our hosts and their uptime over the past 7 days. It doesn't mention in the options if it's possible to set it to longer, so we are looking at modifying this so we can specify a number of days. A quick look at the code shows:

my $end_time =
DateTime->now->subtract( hours => 1 )->strftime( "%F %H:00:00" );
my $start_time = DateTime->now->subtract(
days => 7,
hours => 1
)->strftime( "%F %H:00:00" );

So one imagines you can modify the “days => ” value – adding a “–days” or something similar?

Getting funky

So now we have our metrics and we can see the health of our host groups in terms of availability, we can do some better visualisation things.

1. Use keywords:

We can use keywords to display the health of our new service checks either ‘at a glance’, i.e. if any of the checks in the keyword are critical, then the keyword itself goes critical - or in a more detailed view:

Top level - At a glance:

Detailed level:

2. Dashboards

Pro or Enterprise customers can also use the dashboards to display this data. I’ve used performance gauges here to show our SLA’s for our 6 host groups:    

Conclusion

So there we go – we now have the ability to not only monitor hosts, but using a little known Opsview plugin we can also monitor the SLA availability of that host group over a given time. And, because it returns performance data, we can use it for graphics, reporting, and much more. Very nice!

Get unified insight into your IT operations with Opsview Monitor

webteam's picture
by Opsview Team,
Administrator
Opsview is passionately focused on monitoring that enables DevOps teams to deliver smarter business services, faster.

More like this

Sep 19, 2012
Blog
By Opsview Team, Administrator
Scenario:

A customer contacted us with a requirement to monitor the functional status of their website and all its vital components to ensure...

Apr 09, 2014
Blog
By Opsview Team, Administrator
What is it?
Jan 30, 2015
Blog
By Adam Such, Integrations Lead

We have previously covered setting up alerts and reports against your Business Services, in