Opsview Monitor Event Handlers
Monitoring software does not serve the purpose of replacing technical talent. Instead, it is there to make the lives of technical resources easier by removing tedious tasks from their day. By freeing up time, these resources can be better utilized by the business to do more development and spend less time maintaining the existing systems.
The end goal in any monitoring project should be maintaining uptime by quickly resolving problems as they come up. Often the question “Have you tried turning it off and on again?” may offer the solution to any outages that arise, so a new question must be asked. If a simple, repetitive task is most often the answer; why is human capital being wasted on something that a computer can do?
This approach means that unavoidable bugs should no longer cause systemic outages. These bumps in the road can even occur in production and they can be fixed automatically so that they don’t lead to more serious problems. This can amount to less firefighting, less code regression and more time spent on the next thing.
The trick is to adopt automatic resolution into your agile methodology. Each sprint yields new things that can break so it makes sense to create rules and scripts that can automatically fix things when they break. However, automatically restarting services serves to cover up the problem rather than truly fix it. This means that to effectively police this we will need to report on how often these corrective steps are taken. The more frequent the automatic resolution, the worse the bug. The best thing about this is that bugs are detected without causing a prolonged interruption of service.
Automation cannot be effectively implemented without a little bit of planning. By rolling out automation hastily, you might exclude tasks that would have been helpful or introduce noise to the system by automating too many things. This section will help you to determine when automation is most appropriate. It will also help you to determine where in the process these methods should be introduced so that it can be most effective without causing more problems than it alleviates.
This whitepaper will answer the following questions:
• Which tasks are better automated?
• Under what conditions should this automation occur?
• How would this be achieved in a tool like Opsview Monitor?
What Kind of Automation Do I Want?
This is going to vary by environment. The best way to determine which tasks should be automated is by consulting your ticketing system.
• What are some of your most common problems?
• What actions are most often taken by the team to fix these common problems?
• Do any patterns emerge?
When patterns emerge we have found tasks that can potentially be automated. This will make the entire team more effective by taking the tedium out of their lives so that they are better focused on jobs that will require the creativity and intuition that computers are not good at.
The examples of this paper are going to focus on three different automated tasks:
• Restart a service
• Renice a process
• Kill a process
Each of these corrective actions will be performed under different conditions which we will explore in the next section.
When Should This Occur?
We have established what corrective actions we wish to try. Now it is time to carefully consider when these actions are most appropriate. The first thing that we will consider is the concept of a soft state or a buffer period between an initial failure and when an alert is sent out for human intervention. Next we will introduce dependencies between services as it relates to the ability to monitor. This will not only help to track down root causes of failures more easily but will also prevent excessive automated action which could get quite noisy if introduced incorrectly.
To best leverage automation in your monitoring system it is important to give the automated tasks a chance to try and fix the problem before human intervention is required. This is best accomplished by introducing a time buffer or “soft state” where event handlers run before any notifications are sent out.
There are two different possibilities for introducing a buffer period. You could either introduce a single wait period for a predefined amount of time or you could introduce a retry period of smaller time increments and set your buffer period to be a number of retry intervals.
Opsview Monitor takes the later approach. By introducing multiple shorter retry periods two benefits arise.
First, the monitoring tool will notice if the corrective action has worked more quickly. Second, different corrective actions can be attempted at each retry attempt. This prevents your monitoring tool from trying the same solution over and over again and expecting different results.
In order to take automatic corrective action, you must have a handle on the service level dependencies at play. For example, there are many Apache service checks to determine the availability and performance of your web server. We do not want each and every one of these service checks to restart the apache2 or httpd service every time that they fail. This would result in far too many restarts. Instead you want a parent level service check to be responsible for this task when there is a systemic problem that is affecting all apache service checks.
It is important to note that dependencies in a monitoring tool must be considered from the point of view of the monitoring server. For example, the figure below indicates many service checks that may be associated with a common web application running a LAMP stack. In order to monitor these end metrics the host must first be up. This is verified with the host check command.
Next it is important to check that the port required for monitoring that service is active. In this example, the next level of dependency is to check the tcp response of ports 5666 (agent), 80 or 443 (HTTP/HTTPS), and 3306 (MySQL Listener) to make sure that there isn’t a network problem like a firewall restriction.
Finally, we verify that the service is running which is commonly checked by verifying that the number of processes for that user is greater than zero. It is important to consider that you may require the agent to monitor these processes. If that is the case, process or service status monitoring must be dependent on the previously checked agent response. Now it is possible to gather the more granular details about the web application.
Automatic resolution should take place at the highest parent service that makes sense. For example killing parent processes when zombie processes get out of control or renicing runaway processes should be done at the immediate service check level. However, restarting a service should be done at the process or service status level.
Detecting Bugs without Loss of Service
Now we have effectively maximized uptime by taking some of the tedious tasks out of the DEVOPS team’s day. But have we actually fixed any of these problems? This is where it is important to run events reports against longer time windows. Frequent restarts or other regular corrective action may maximize uptime, but it may only be treating the symptom of a more serious underlying condition.
To determine how often corrective action is taken you will have to isolate the parent level service checks that run your event handlers. It is then important to note all “SOFT” problem states that never go to a “HARD” state. By noting how many soft state events occur before resolution you will know which corrective action fixed the problem that time.
An example of such a report is shown on the next page. The first section outlines the number of state changes and the severity by day. This gives you a clear way of associating business events like code releases with the number of event handlers that are run. The second section lists a description of all of the events so that you can tell what the problems are and which corrective actions were taken in response.
This section will outline the way that event handlers work in Opsview Monitor. It will identify the steps that Opsview Monitor goes through from event to response. This section will provide some tips about writing your event handlers and remedy scripts that will run on the monitoring server and end device respectively. Finally, it will walk through an actual lab exercise that you may choose to put into practice in your own environment.
How Do Event Handlers Work in Opsview Monitor?
When using Opsview Monitor, understanding the workflow of the event handlers is essential to implementing an effective automation strategy. This section of the article will isolate the process that occurs between detecting a state change and executing a script to potentially correct the condition. Throughout the process there will be environment variables to consider that are vital to understanding what the current conditions are and what action should be taken.
When a state change occurs there are multiple steps that occur before a script is run on the end device. First we must consider the monitoring server side. The service check or host has an event handler definition. This entry is the name and set of parameters for a script run in the location,
/usr/local/nagios/libexec/eventhandlers/. These scripts have environment variables available to them. This means that an event handler script can be aware of the host, service check, state type, service status and check attempt that was true at the time the script was run which in turn means that different agent commands can be issued at the event handler script level for Critical events vs. Warning events. The event handler script can also iterate through multiple remedy steps based on the check attempt. Consequently, you can try to restart a service when it first fails but if that doesn’t work you can try to stop and start the service instead before classifying it as a hard failure where a human needs to be involved.
The agent command that is issued to the end device needs to be interpreted. This means
that when the agent receives a “check_nrpe –c” command, it needs to understand what the –c parameter means. To do this, the agent consults the file nrpe.cfg which has a mapping between “-c” entries and local scripts. Once this is achieved, it is just a matter of making sure that nagios user has sufficient privileges to run the UNIX commands in the script.
Writing Event Handlers
There are a variety of environment variables available at the event handler level. The general rule of thumb is that all Nagios macros that are usually available in event handlers are also available in Opsview Monitor. The rule is to reference $NAGIOS_<MACRO NAME> inside of the script logic. This means that $HOSTADDRESS$ would be named $NAGIOS_HOSTADDRESS in the Opsview Monitor event handler.
It is important to eliminate redundancy inside of event handlers because you don’t want to try the same corrective action more than once under the same conditions. You also want to make sure that some form of corrective action takes place every retry interval.
The most important macros to use in your event handler script (on the monitoring server side) are listed below:
• $NAGIOS_HOSTADDRESS – the IP Address or DNS
name of the end point.
• $NAGIOS_SERVICESTATE – “OK,” “WARNING” or
• $NAGIOS_SERVICESTATETYPE – “SOFT” or
• $NAGIOS_SERVICEATTEMPT” – numeric number of check attempts starting at one.
These macros can be leveraged to create different corrective actions for a “HARD OK” instead of a “SOFT OK.” It can also be used to take separate corrective actions for each value of “Service Attempt.” Each condition should have some logger action and a check_nrpe command issued to the end device to make the agent run a local script. The syntax of a check_nrpe command is listed below:
$ cd /usr/local/nagios/libexec/ $ ./check_nrpe -H $NRPE_HOSTADDRESS$ -c <agent_command> -a “<parameters>"
There are three main considerations when creating a remedy script. The first is where to place the script. Best practice dictates that all remedy scripts should be stored under /usr/local/nagios/libexec/eventhandlers/ just like the event handler scripts on the master. It may be appropriate to follow a naming convention such as eh_* for better organization and improved sorting.
The second consideration is command definition at the agent’s configuration files. Verify that the final line in /usr/local/nagios/etc/nrpe.cfg on the agent side is an uncommented include_dir=/usr/local/ nagios/etc/nrpe_local which points the agent’s configurations to a local directory. The file nrpe.
cfg may be overwritten as a part of agent upgrades so all local configurations should be stored under the nrpe_local/ directory. Any file with a “.cfg” extension in nrpe_local or any subdirectory will be included in the agent’s configurations. It is therefore best practice to create a file named “eventhandlers.cfg” under the nrpe_local directory for remedy script command definitions. This file will contain command definitions that map agent commands to local scripts.
The third consideration that must be made about remedy scripts is what the nagios user is allowed to do on the end device. These corrective actions are going to be carried out by the nagios user and so certain privileges may have to be raised for the user. This may mean giving limited sudo access to the nagios user. Ensure that any corrective action that you want to take is possible as the nagios user before deploying to production. It will be your job to determine the trade-offs of corrective action and the potential security risk of raising the nagios user’s privileges at the end device.
command[eh_kill]= sudo /usr/local/nagios/libexec/eventhandlers/eh_kill $ARG1$ command[eh_renice]= sudo /usr/local/nagios/libexec/eventhandlers/eh_renice $ARG1$ command[eh_restart]= sudo /usr/local/nagios/libexec/eventhandlers/eh_restart $ARG1$
This is what the –c argument to an agent command will be mapped to. For example, running the following command at the monitoring server:
$ /usr/local/nagios/libexec/check_nrpe –H $NAGIOS_HOSTADDRESS –c eh_restart –a -s apache2 –r”
Would be equivalent to the nagios user running the following command at the end point:
$ sudo /usr/local/nagios/libexec/eventhandlers/eh_restart –s apache2 –r
Let’s Do It
Putting this into practice should be developed in the reverse of the event handler process flow so that each step can be effectively tested. First we write the script that issues the UNIX commands like service, renice, or kill. We will then test these commands as the root user. Next, we will make sure that the nagios user has sufficient privileges to issue these commands by granting conditional sudo access in such a way that it cannot be exploited. We then define the mapping between an agent command and the sudo execution of this script and restart the Opsview Monitor agent.
Back on the monitoring server, the event handler logic needs to be written so that separate agent commands can be issued depending on the conditions that exist on the end device. It is then that the event handler can be assigned to the parent-level service check. The detailed steps are as follows:
1. Write remedy scripts that issue UNIX commands.
2. Determine which commands or scripts you need to give sudo access to for the nagios user.
$ whereis service service: /usr/sbin/service /usr/share/man /man8/service.8.gz $ whereis renice renice: /usr/bin/renice /usr/share/man/man1/renice.1.gz $ whereis kill kill: /bin/kill /usr/share/man/man1/kill.1.gz
Change the permissions of the nagios user to grant the limited sudo access.
$ visudo nagios ALL=(root) NOPASSWD: /usr/sbin/service, /usr/bin/renice, /bin/kill
Alternatively, this comma separated list could be your remedy scripts themselves. By running the script in sudo mode, you are essentially applying sudo to all commands inside of the script.
$ chown root:root /usr/local/nagios/libexec/eventhandlers/eh_restart_service $ chown root:root /usr/local/nagios/libexec/eventhandlers/eh_renice_process $ chown root:root /usr/local/nagios/libexec/eventhandlers/eh_kill_process $ visudo nagios ALL=(root) NOPASSWD: /usr/local/nagios/libexec/eventhandlers/eh_restart_service, /usr/local/nagios/libexec/eventhandlers/eh_renice_process, /usr/local/nagios/libexec/eventhandlers/eh_kill_process
Now that these scripts can be executed at the root level by the nagios user it is imperative to change the ownership of these scripts to root so that they cannot be moved or modified for malicious means.
4. Define your agent commands in nrpe.cfg.
$ vi /usr/local/nagios/etc/nrpe_local/eventhandlers.cfg command[eh_kill]=/usr/local/nagios/libexec/eventhandlers/eh_kill $ARG1$ command[eh_renice]=/usr/local/nagios/libexec/eventhandlers/eh_renice $ARG1$ command[eh_restart]=/usr/local/nagios/libexec/eventhandlers/eh_restart $ARG1$
If you have chosen to give the nagios user sudo access to the script, rather than the commands, make sure that the command definition includes sudo before the script’s path.
5. Don’t forget to restart or reload the Opsview-agent service to pick up these new configurations.
$ service opsview-agent restart
6. Test your agent commands on the monitoring server side.
$ cd /usr/local/nagios/libexec $ ./check_nrpe –H <agent address> -c eh_service_restart –a “-s <service> –r”
7. Write the event handler scripts on the monitoring server side.
#!/bin/sh case $SERVICESTATE$ in “OK”) case $SERVICESTATETYPE$ in “SOFT”) ;; “HARD”) #Recovery Action ;; “WARNING”) ;; “UNKNOWN”) ;; “CRITICAL”) case $SERVICEATTEMPT$ in 1) /usr/local/nagios/libexec/check_nrpe –H $HOSTADDRESS –c eh_restart_service –a “-s $1 -r” ;; 2) /usr/local/nagios/libexec/check_nrpe –H $HOSTADDRESS –c eh_restart_service –a “-s $1 -m” ;; ;; esac exit 0
8. Assign the event handler to the appropriate service checks.
Now the apache service will be restarted every time that there is a problem connecting to the server on port 80 or 443.