IT Monitoring: Know Your Audience
Monitoring data, like all operations data, is at its most valuable when it leverages a presentation layer that puts the information in the proper context for any audience. When Quality of Service (QoS) values like uptime and throughput make up strict SLA requirements it becomes essential to make sure that the correct metrics and status information goes to the right audience.
Complex IT environments leverage the concept of redundancy as a way of creating resilience and of improving overall performance. While high availability, disaster recovery, and load balancing clusters provide invaluable peace of mind for stakeholders, it manages to signiﬁcantly complicate SLA reporting at the same time. How can this qualitative peace of mind be translated into a quantitative and reportable SLA value?
At the same time, it is important to make sure that isolated outages are still resolved before they can propagate into a true loss of service. The more efficiently that isolated outages can be prioritized and resolved, the better the overall SLA report will be in the end. It is, therefore, important to stay ahead of isolated outages in order to ensure SLA requirements for redundant and resilient IT offerings.
It is worth pointing out that these two levels of granularity are valuable to two completely separate target audiences. Real life, end user availability is valuable for service consumers such as customers, executive stakeholders, or compliance departments. SLA reporting against individual IT services running on hosts is best utilized by administrators and team leads. This is the primary focus of this white paper; that building business rules into a monitoring solution and properly reporting to application consumers and application administrators appropriately can improve visibility and communication between the two parties and contribute to the overall success of the business.
This guide will go over the use cases for the Opsview Monitor BSM feature and for Opsview Monitor Hashtags and will demonstrate how to get the most out of both features. It will also cover the creation of BSM components and matching Hashtags to be used as a means of drilling into the component for extra information. It will then cover some effective dashboard conﬁgurations, notiﬁcation rules, and reporting practices.
Modeling Your Business in Opsview Monitor
The presentation layer is a vital element to any business intelligence platform. Without being able to decode the underlying data into the business terms that they represent, the information would provide little beneﬁt to anyone. Therefore, like any other business intelligence tool, Opsview Monitor must provide a presentation layer so that real architecture rules can be modeled in the tool providing accurate end user availability information. By being thorough and doing this correctly, risk can be identiﬁed and resolved before problems effect the service consumer rather than allowing a costly outage to happen and tracing the failure back to a root cause.
Business Service Monitoring (BSM) Components
The ﬁrst essential things to deﬁne in Opsview Monitor are the BSM Components. BSM components are a functional grouping of hosts and services together with an operating region used to determine the overall health and priority of the grouping. These are commonly used to deﬁne clusters, farms or failovers.
It is a way to simplify the complicated architecture rules to the application by approaching it in logical pieces. The way that a component is deﬁned starts with the Host Template feature. This is the same feature that allowed service checks to be applied to hosts in a bulk by function. It is similarly able to group the check results by the function. This will serve as the starting point for creating any component.
Selecting the desired Host Template ﬁlters the host selection box to only display hosts that are currently using that template. These are all hosts that are being monitored in the same way because functionally they are all very similar. It is then the job of the Opsview Monitor administrator to determine which of these common hosts are contributing to a shared goal like a cluster of Solaris servers would be. This supports an arbitrary number of hosts to be grouped together with an operational zone applied to the newly created BSM Component. This Operational Zone indicates the total percentage of the component that needs to be healthy for the entire component to be considered effective. This now means that small failures within a cluster are ﬂagged as a potential impact to service rather than immediately being marked as a failure.
Before BSM Components, Hashtags were the only way to group together services to evaluate the impact of a critical event. The ﬁgure below shows the process for creating a complementary Hashtag to the BSM component that was made in the previous step. It is best practice to create these complementary Hashtag/component pairs for every reportable component. The reason for this will become clearer in later stages of this exercise.
For a BSM component complement, it is best to make a Hashtag with the same name as the component that is grouped by service. This way each of the functions the cluster needs to provide a service are accounted for and can be broken down by each node.
Conﬁguring this complementary Hashtag to reﬂect the same hosts and services as the component is a manual process. The hosts are selected ﬁrst. These are the same hosts selected to create the component. Next there is a check box to “Filter by selected hosts” so that choosing the correct service checks is less daunting. By selecting the same checks that would be included in the host template, Opsview Monitor will now have a grouping that is effectively the component minus the resiliency rules.
The next step is to model the consumable service. Examples of these consumables could include VoIP phone systems, email, the company website, collaboration portals, and other various applications. The status of these consumables are of particular interest to anyone who is looking from the outside in. The list of those interested could include customers, executive stakeholders, compliance departments and auditors. In order to provide an accurate status of these consumables they must ﬁrst be properly modelled within the monitoring software. By understanding the anatomy of the application, website, or workﬂow and representing its uptime needs properly it is possible to foster a culture that concentrates on proactive troubleshooting rather than ﬁghting ﬁres. This may appear to be an intimidating task but the difficult part has already been accomplished. These consumable business services are simply a grouping together of BSM components which have already been assigned priority and itemized SLA requirements when the operational zone was deﬁned.
When creating a BSM service, there will be a section named the “Components Drawer”. This can be ﬁltered with the text box adjacent to it to make components easier to ﬁnd. To create a BSM service, simply click and drag components from the drawer into the BSM service.
Suitable Visibility for Separate Target Audiences
Monitoring is a three phase effort: data collection, presentation, and response. Now that Opsview has been conﬁgured to model consumable business services, both in BSM and Hashtags, it is time to move on to the second and third phases of the project. First, real-time presentation views will be created, then notiﬁcations rules will be put in place Hashtags, BSM components and BSM services according to the recipient and ﬁnally historical reporting will be deﬁned for both administrators and for external audiences alike.
Dashboards are valuable both for tactical and strategic audiences. High-level views are often the most important things for customers or executives to see. This view would have very little detail and will rarely use Hashtags, if at all. One of the most powerful tools at your disposal for executive dashboards is the ﬁltering feature on the BSM Summary dashlet. By placing a BSM Summary dashlet in the dashboard and conﬁguring the settings correctly it is possible to view a subset of services that may be of importance to you and it is also possible to ﬁlter by status. By unchecking the “Operational” box, the widget is now set to be a traffic light for application level statuses. If any BSM services appear in the widget at all, it means that they are either in scheduled maintenance mode, impacted by an underlying problem, or in a full failure. This is an ideal way to see IT operations at the highest levels possible.
The next task is to create dashboards that show details about the BSM service or BSM component for application owners or administrators respectively. A good practice for dashboards is to create a user for the purpose of holding shared dashboards for other users. For the purpose of displaying Business Service views, the user “BSM” with local authentication should be created. This user should have VIEWALL access, CONFIGUREBSM access, and DASHBOARDEDIT access at a minimum.
In the dashboard tab for this user the following layout can be created per BSM to provide some detailed value for the top level Business Service.
For larger environments with many applications and services it may be a good idea to create an additional user named “COMPONENTS” to hold component level dashboards to be shared with others. Dashboards for this user may look like the example below.
Each of these users should share their dashboards with the roles that might ﬁnd value in them. This practice now essentially provides saved dashboards that can be pulled up and deleted as needed by other Opsview Monitor users. This saves dashboard space without having to recreate these views from scratch every time that they are needed. The list of shared dashboards that are available should display BSM: <dashboard name> or Component: <dashboard name> for each choice, making it easy to ﬁnd.
With the addition of the BSM feature in Opsview Monitor it is now possible to set a notiﬁcation rule for BSM services or components rather than the previous choices of host groups, service groups, and Hashtags. As with everything else, the audience dictates the alerting requirements. BSM service notiﬁcations are ideal for application owners and should then escalate to management levels. BSM component notiﬁcations are directed towards team leads of various disciplines such as database, server, and network teams with an escalation to the appropriate architect. This leaves the legacy host group, service group, and Hashtag based notiﬁcations for the front lines of monitoring where every new issue should be investigated as quickly as possible.
The ultimate goal of reporting as it relates to monitoring is to tell a story that is accurate. Accuracy can actually be ﬂattering when it comes to SLAs in such complicated architectures. High availability, load balancing, site failovers, and other architecture concepts are implemented because they make for a more stable environment. This means that reporting on SLAs as it relates to a BSM service is going to be a real depiction of end user availability. Opsview Monitor now provides SLA reporting against the application or business service so that this information can be sent to management or a customer.
It is important, however, to make sure that this report will be telling a positive story at the end of the day, week, month, or year. An ideal way to make sure that the ﬁnal, often automated, report is going to satisfy the SLA requirements is to stay in front of application outages by maintaining the components that make them up. To do this, Daily Service Level Reports and Daily Performance Reports can be run against the Hashtag that was created to complement the components that make up the BSM service.
These can be regularly scheduled to be sent to the appropriate administrators so that individual service checks and host failures can be corrected before they exceed the component’s operational zone. By reporting both at the BSM service level and at the component/Hashtag level, different audiences can be provided the correct level of granularity for their individual needs.
Monitoring is made up of three factors: data collection, presentation, and action. In order to use this information for high value business intelligence reasons it is important to always keep in mind, “Who is my audience?” Monitoring data can be used for an immediate response by an administrator that specializes in networks, servers, or hardware as appropriate. This same data can be rolled up for team leads, architects, and application owners. Finally, at the highest level, transparency should be provided for executive stakeholders and the expected consumers of these applications and business services. This way risks can be identiﬁed at every stage and prevented instead of root causes being determined after outages occur. This promotes a culture of better communication and better relationships between Information Technology and other business units in the organization leading to better overall operations.