A customer contacted us with a requirement to monitor the functional status of their website and all its vital components to ensure...
You are here
SLA monitoring with Nagios and Opsview
This blog is written by Sam Marsh, Product Manager at Opsview.
Availability basically means identifying for how many hours in a day, weeks in a year, and so forth, was my IT operational and functional - i.e. 'my website was accessible and people could buy products from it for 99.5% of the year' – with the other 0.5% accounted for by outages and planned downtime.
In IT, we talk about SLAs – Service Level Agreements. This is where a vendor (hosting company, cloud provider or managed service provider) agrees with the customer as part of the contract what the availability of that service will be. In essence, the tighter the SLA, the more money you’ll usually pay. Let's say for example, a 97.5% SLA may cost you £100 a year (arbitrary figure), whereas a 99.999% (“five nines”) SLA could cost you upwards of £100,000 a year – as this level of agreement essentially says:
The maximum allowable interruption of service will be 5.26 minutes per year or less.
An MSP agreeing to a 99.999% SLA, and thus tying themselves into providing a service so resilient it will be impacted less than five minutes per year, will require a high contract value to offset this risk.
Whatever the agreed level of service, availability of key IT infrastructure and services still needs to be monitored. In Opsview, we can use SLA reports for intervals including daily, weekly, monthly and yearly, with these emailed automatically to you or the end customer, showing the performance against SLA of that service over that time period. If you have these in Nagios they are easily migrated to Opsview.
However - that's the easy bit – it's all GUI driven and anyone can do it.
What I want to do is look at something a bit cooler, and something not ‘out of the box’ (what I do best).
Opsview ships with a plugin called check_odw_hostgroup_availability. We can use this to monitor the availability of a specific host group – where a host group is a “group of hosts” (pretty obvious!) – Essentially, grouping together all of your “Linux Servers” i.e. LinuxServer001, LinuxServer002, etc into a group called “Linux Servers” as below:
And we can click onto the host group to see the health of the hosts, then click into the host and view the health of the service checks.
You need to modify the plugin check_odw_hostgroup_availability at the moment (this fix will be committed in time for the next version of Enterprise/Pro). Change the lines at 72/73 from:
my $threshold = $np->set_thresholds( warning => $np->opts->warning critical => $np->opts->critical
my $threshold = $np->set_thresholds( warning => $np->opts->warning.":100", critical => $np->opts->critical.":100"
This adds the range function to the warning/critical checks – otherwise the plugin seems to work inversely which is bizarre!
In our example, I want to monitor the host group “Linux Servers” (This could be anything - e.g. a host group called “Tony's Tyres” as an example end customer)
Next, reload Opsview and we should be able to see our host groups availability statistics against our dummy host:
So, we can now see the “SLA %” of each of our hosts and their uptime over the past 7 days. It doesn't mention in the options if it's possible to set it to longer, so we are looking at modifying this so we can specify a number of days. A quick look at the code shows:
my $end_time = DateTime->now->subtract( hours => 1 )->strftime( "%F %H:00:00" ); my $start_time = DateTime->now->subtract( days => 7, hours => 1 )->strftime( "%F %H:00:00" );
So one imagines you can modify the “days => ” value – adding a “–days” or something similar?
So now we have our metrics and we can see the health of our host groups in terms of availability, we can do some better visualisation things.
1. Use keywords:
We can use keywords to display the health of our new service checks either ‘at a glance’, i.e. if any of the checks in the keyword are critical, then the keyword itself goes critical - or in a more detailed view:
So there we go – we now have the ability to not only monitor hosts, but using a little known Opsview plugin we can also monitor the SLA availability of that host group over a given time. And, because it returns performance data, we can use it for graphics, reporting, and much more. Very nice!
More like this
We have previously covered setting up alerts and reports against your Business Services, in