Whitepaper

Technical Guide : Setting up an On-Call Schedule using Shared Notification Profiles

In today’s Enterprise IT environments, 24x7 uptime is becoming an increasingly common requirement. Supporting global markets and a constant web presence has meant the need for business continuity has expanded.

To keep up with ‘Always On’ architectures, IT maintenance staff are required to be available outside of business hours should problems arise. While these schedules are very commonly in place, they are often difficult to implement. Opsview Monitor’s Shared Notification Profile enables IT teams to construct a reliable process for out of hours maintenance.

The first requirement when developing an out of hours maintenance process for unplanned outages is that the process must be collectively exhaustive with minimal management overhead when changes need to be made. This technical guide will detail a strategy that does all filtering at one step so that it is easy to make sure that all hosts and services can send alerts to the relevant people outside of normal operating hours.

The second requirement of this on-call schedule is to limit notifications to the most appropriate audience so that the team can work effectively while maintaining a desirable work-life balance. The strategy detailed in this guide calls for the first line of alerts to follow the MECE principle* meaning that they are mutually exclusive while remaining collectively exhaustive. It then may be expanded to broader audiences as an issue escalates.

The third requirement is that changes can be easily made to the plan when hosts and services are added or removed from the system or if there are staff or schedule changes that need to be made.

This guide details a system that will automatically incorporate new hosts and services into the business unit’s notifications which are easy to subscribe to or assign users to.


Strategy

Opsview Monitor can filter alerts at many different levels. The desired strategy in this guide is to try to accomplish all filtering in one place so that change management can be kept under control. The first level that an alert can be filtered at is the service check. The screenshot below shows that all of
the status types must be checked or else alerts will never be raised past that level. There are also options for entering the time window that alerts will be sent at and the time between escalations which can be optionally inherited from the host level.

The second filter is at the host level. Similar to the screenshot above, there is a ‘Notifications’ tab where statuses must be checked and a time window needs to be selected or alerts will not be raised beyond that point.

According to our desired strategy, it is most appropriate to allow all alerts to be raised through the service check and host level by always checking all boxes and setting the ‘Notification Period’ to 24x7. When this is the case, should any problem status occur, Opsview Monitor will raise an internal alert and log an event. If and only if there is a relevant Notification Profile for this alert will an email or text message be sent. By leveraging these Notification Profiles, a weekly on-call schedule can be created and the shifts can be divided by business unit or team, notification method, escalation level, and time window. These shifts can be subscribed to or assigned to users so that the business can ensure coverage outside of business hours without entire teams being constantly bombarded by emails.

Time Periods

The first step for creating an on-call schedule in Opsview Monitor is to create time period entries for all possible shifts. There are some standard entries for time periods including 24x7, working hours and non-working hours. Verify that the working hours time period exists in your environment and note the 24hr time format. This is the time period that will be used most often in private Notification Profiles.

Time periods for the on-call shifts need to be created. These time shifts can be as granular and specific as required by the business. Some sample time periods that could be applied would be the after work and next morning shift and the weekend shift examples below. Time notation can be comma separated to represent multiple time windows within each day. 

 

One of these time periods will be selected when a Notification Profile is created. These time periods can be reused across many profiles and may have many other applications in Opsview Monitor such as timed exceptions and check periods for hosts and service checks.

Roles

Roles are an important tool in Opsview Monitor which are used to restrict access to specific objects and actions per user or group. A suggested configuration of the ‘Status Access’ tab of a Role for this tutorial is shown below. The idea is that anyone who is on-call should be able to view objects, acknowledge events, schedule downtime, receive alerts and test service checks on objects they have access to.

The next step is to choose which objects a user can interact with. In this menu, host groups, service groups and Hashtags can be selected. The objects that a user under this role will have access to are those that sit within both the selected Hostgroups and Service Groups plus any objects within selected Hashtags.

For example, if the host group Linux Servers was selected and the service group Application – Apache Server was selected, the set of objects accessible
to the user is all Apache services running on Linux servers. Only the Linux hosts running apache will be accessible and only the apache services will
be viewed. The next piece of logic after the host group, service group intersection is the union
with all objects in selected Hashtags. Role access by Hashtag is the easiest way to make flexible permissions since a Hashtag can be made up of any combination of individual host and service that the user wants.

This is the step where the MECE principle must be followed as all object level filtering occurs here. It is therefore important to make sure that at least one role covers every service that needs to be alerted on outside of regular hours thereby
making it collectively exhaustive. If possible, avoid any overlap within these roles. If there is overlap, multiple people will get the same email. This
can cause confusion and unnecessarily disturb an employee during their free time.

Shared Notification Profiles are specific to a single role so a role should be created for every business unit or technology area that will require a specialist to be on-call.

Notification Profiles

Notification Profiles are a collection of preferences indicating how a user wishes to receive alerts. Notification Profiles can be private or public so that users can subscribe and unsubscribe to them as necessary. By creating a set of Notification Profiles that are separated by business unit, shift, and alert type; a business can create a custom schedule to meet their unique needs.

Notification Method

The first step in creating a Notification Profile is to select the method, by which, to receive alerts. All available notification methods will be listed by check box and one or many can be selected.


Select Objects

Select which objects the profile should send alerts for. This is selected the same way as when a role is created. It is important to note that the user’s role will further filter these objects after the fact. It is therefore important to make sure that the user’s role includes all objects that the user would want to receive notifications for.

A suggested way to approach this is to select the checkboxes for all objects and allow the role to do all of the filtering rather than the profile. This allows for new hosts and services to be automatically picked up by the Notification Profile so long as the new objects are available to the role. This reduces the management cost associated with making changes to hosts, services, and Hashtags. As long as the role has been modified, the Notification Profile will pick up the change as well.

Select Status

A Notification Profile can filter alerts based on the status of the host or service. For instance, a profile can be created that only sends alerts for critical errors while another may be created for all alert statuses, including recovery and flapping. This is why the recommended strategy of this guide is to select all statuses at the service and host level. If any filtering is required, it can be done at this step without having to worry about if the alert was lost along the way.

In the case of an on-call schedule, it is encouraged to enable all options in this step as well. On-call teams are likely to be much smaller than peak hours teams. This limited team is more likely to require all of the notifications. If the on-call staff is greater than one person, it is important to include the ‘Recovery’ status so that everyone on that shift is alerted when problems are resolved. This will further ensure that employees aren’t needlessly called into the office.

why the recommended strategy of this guide is to select all statuses at the service and host level. If any filtering is required, it can be done at this step without having to worry about if the alert was lost along the way.

In the case of an on-call schedule, it is encouraged to enable all options in this step as well. On-call teams are likely to be much smaller than peak hours teams. This limited team is more likely to require all of the notifications. If the on-call staff is greater than one person, it is important to include the ‘Recovery’ status so that everyone on that shift is alerted when problems are resolved. This will further ensure that employees aren’t needlessly 

Select Time Period

This is where the on-call shift is defined. Any time period that we created in previous steps can be selected. These are public to all users. A profile should be created for all specific shifts for the group of objects and the alert method. This is why this white paper recommends using the 24x7 time period at the host and service level. This filtering can be done when defining the on-call shifts in the Notification Profile.


Select Escalation Level

Opsview Monitor allows users to define which step they are in an escalation path. Service checks and host checks have a parameter called re-notification time. This is the span of time between the initial alert with all subsequent alerts. A notification that goes unacknowledged for the re-notification time will increment starting from 1. For the purposes of this tutorial, the recommendation is to clone each profile associated with a primary on-call shift and set the ‘Send from Alert’ to 2 on the new clone. This is a way to have two levels that follow the MECE principle before getting more people involved. Notifications that are escalated past the first backup are more likely to oversee multiple teams so cloning profiles would not be appropriate for those cases and roles can and should expand. It is important to note that re-notification only increments if a notification is triggered. If nobody is subscribed to the first alert and the notification never gets sent, Opsview Monitor will never increment to the second notification no matter how much time has passed. Make sure that all primary shifts are properly assigned before creating or assigning backups.


Shared Notification Profiles

Notification Profiles can be private or shared. Private Notification Profiles are created while editing a user. Shared Notification Profiles are created for a specific role and will only be visible to that role. Shared Notification Profiles minimize overall management of notifications and allow the on-call process to be configured centrally by Opsview Monitor administrators.

Subscribe to Appropriate Profiles

There are two different ways that a Shared Notification Profile can be applied to a user. The first is when editing a user under the “Users and Roles” menu and navigating to the ‘Notifications’ tab. This will bring up the menu shown below. This is where zero, one, or many profiles can be assigned to the user and is the best method for Opsview Monitor administrators to assign profiles to others.

It is additionally possible for a given user to subscribe to additional Notification Profiles without having the ‘Configure Profiles’ option in their role. This method does not give the user the ability to make changes to a profile but allows them to subscribe to any shared profiles that are under the domain of their role. This is achieved by navigating through the user tab in the top right of Opsview Monitor. Rather than logging out, a user can select the ‘Access Profile’ menu and arrive on a screen like the one below where user information can be changed and Notification Profiles can be selected.

Conclusion

The reader of this guide should now be able to set up a schedule for maintenance outside of business hours. To summarize the process, time periods must first be created for all of the shifts that may be required. Roles need to be created for the first line of response and their backups. These roles should ideally demonstrate the MECE principle or Mutually Exclusive while Collectively Exhaustive. This will ensure that every host and service will be able to send alerts when appropriate while limiting the audience to those that need to be alerted. Roles can be expanded as the escalation level increases for supervisors and management.

Shared Notification Profiles then need to be created that cover the time period, role, escalation level, and notification method desired. To make the management of this stage easier, it is a good idea to select all objects and all statuses. By selecting all Host Groups, all Service Groups, and all Hashtags, the user’s role can be edited to include new hosts, services, and Hashtags and the notifications will be automatically updated.

By leveraging this feature in Opsview Monitor, an Enterprise can have centralized control over processes that are put in place to maximize uptime in an efficient way. On-call schedules are a necessity in today’s IT climate but it is important to allow employees to have an attractive work-life balance if the business wants to continue attracting and retaining top talent. Through the use of Shared Notification Profiles and the MECE principle, maximum uptime and a happy staff can be achieved.