Learn what the three most critical aspects are of monitoring MariaDB and how you can do it with Opsview.
You are here
Best Monitoring Practices- A Guide On Fixing Frequently Encountered Monitoring Issues
I think it is hard to argue the need for monitoring when downtime of a host or service can easily lead to revenue loss. So what makes a good monitoring system? Is it the software itself, its users, or a mixture of both?
Software is obviously very important - if it does not have the features you need, you always have the option to switch to something else. But ultimately, it does not matter how advanced the software is; poor use of a monitoring solution can be as useful as having no monitoring at all!
With experience using a variety of monitoring tools over the years and having helped a wide range of customers with questions regarding Opsview Monitor, I have seen some common pitfalls that people fall into. To help you steer the ship of your IT department in the right direction, here is a list of frequently encountered issues and advice to ensure that your monitoring is performing properly.
Too many unhandled problems
We've all done it - we've had a notification come in and thought 'I'll deal with that later'. And before you know it, days have passed resulting in an endless list of alerts and you don’t know which ones are most recent and where to start investigating first.
You and your teammates should have a common understanding and list of set procedures for working with your monitoring system. Nothing should be left for more than a few hours without one of the following things happening:
a) acknowledging the problem with an appropriate comment
b) applying downtime if the issue is known to be in effect for a known period of time
c) fixing the problem that has been alerted
d) fixing the alerting levels if they are not appropriate
If you are going to look into an alert, openly acknowledge it with your team first so they don't start investigating the same issues, wasting time in the process. I remember how frustrating it can be to spend some time looking into a problem to find one of my colleagues is also doing the same thing and we kept "trampling on each other toes" in the process.
Also, if you are going to do work on a host or service and you know it will be affected, put it into downtime before you start so no one will panic when you shut services or hosts down.
Why use a monitoring solution to inform you of infrastructure issues when in reality the alert is meaningless and you use up valuable hours investigating it?
To remedy the frustration of false alerts, it is helpful to create a template to assign to groups of hosts that are all set up the same so they can be monitored in exactly the same way. But there is always one, isn't there; that one server built slightly different than the others in some small way that you always forget about and it always catches up to you.
This is where host exceptions come in. You can amend the alerting levels for that one host on a check-by-check basis (exception) and even for specific times of day (timed exception) so you won’t get woken up at 3 am when the backups kick off and disk usage goes through the roof, or CPU gets maxed out when services are restarted at scheduled times.
Unless you are checking the UI every few minutes, there is always a chance of missing important updates, especially when you have gone for that important coffee break to catch up on gossip with your colleagues.
This is where notifications come in. Why keep looking at the monitoring system when it can tell you there is an issue? The method of notification should be up to you, whether using the very common email or SMS methods, an app on your phone or something more proactive such as Nagstamon, which can run as a desktop widget and polls the monitoring system status (giving warning alarms when problems are discovered).
What's most important is not having to remember to check the UI every few moments while you are caught up in a technical issue that you don't want to get distracted from unless it's absolutely urgent.
Inappropriate downtime and/or acknowledgements
You get approached by a user saying there is a problem, but when you look at the monitoring system and see nothing wrong, it is natural to assume that the issue is with the user. It is only later you find out that a host or service was put into downtime a month ago for some reason and was never followed up on. This is where regular audits will help. All hosts in a non-okay but handled state should be reviewed to make sure any downtime or acknowledgement is still appropriate. If it isn’t appropriate, then remove it so problems are not masked or ignored. How ‘regular’ behaves within your environment is entirely up to you; perhaps weekly, but certainly once a month is recommended and proves to be beneficial.
These are only a few of the issues you may encounter, and having known processes and procedures in place on how to use your monitoring system for peak performance will help significantly in the battle against unplanned downtime. However, you will only get out of your monitoring system what you put in - they need regular attention to make sure checks are appropriate and valid. And if you end up in the situation where you cannot trust your monitoring solution so you leave it to fester, it will only come back to bite you in a big way just when you need it most!
More like this
Alerts happen. They are the reason why monitoring applications were created: to alert us when servers need attention. The difference between an...
In today’s Enterprise IT environments, 24x7 uptime is becoming an increasingly common requirement. Supporting...