Three Approaches to Reducing IT Alert Storms
Many IT operations teams spend way too much time wading through a sea of alerts (sometimes referred to as the ‘alert storm’), most of which are just symptoms of the true cause of an issue. Of course, experienced, smart IT people often times have seen the pattern of the alerts before, so they can venture a fairly good guess as the real issue. They are right more often than not, so they solve the issue and business continues.
But what if this experienced, smart IT person is needed on other IT projects required to drive the business? Or worse, this person leaves the company (maybe because they would like a new challenge!). That sea of alerts now becomes a real problem for the more junior person. They spend way too much time trying to figure out the issue, and while they are doing this, there is a good chance that an important business service is performing poorly, or worse, it is down!
Further, with the increasing complexity and dynamic nature of today’s IT environments, it is no longer reasonable to ask IT teams, no matter how experienced, to try to use experience and history to sort out a sea of red. It is an inefficient use of their time and it puts the business at risk. Fortunately, there is a better way. Three key approaches can go a long way to reducing the sea of red:
1. Understand parent-child relationships:
Infrastructure is typically architected with core components (called “parents”) with edge devices (called “children”) connected to them. These parent-child relationships can apply to the WAN, LAN, datacenter, cloud services, etc. The important benefit to understanding the parent-child relationship is that alerts can now be understood and managed based on these relationships. For example, if a core router has an issue and sends an alert, it is very likely that the devices connected to it will also send an alert, which can begin an alert storm. The alert storm makes it difficult for the IT team to sort out the root cause (the core router issue), from the symptoms (the alerts from connected devices). Suppressing the “child” alerts and only sending the root cause “parent” alerts can significantly reduce the number of alerts and allows the IT team to focus on the source of the issue.
More advanced IT monitoring products include some variation of the parent-child capability. This can be very sophisticated, including things such as dependency discovery and mapping, to the required basics, including flexibility to create the relationships across the infrastructure and suppress symptomatic alerts. Generally, the more sophisticated the capability, the more costly the solution. For the largest enterprises and service providers, more sophisticated capabilities may be needed. For most mid-sized enterprises, the required basics will meet the need.
2. Alert management:
Once the alert storm is addressed, it is then important to make sure that the remaining, important alerts are sent to the right members of the IT team by using effective notification methods. Otherwise, IT teams are in danger of having too many people or not the right people trying to resolve the issue.
Good alert management begins with good process. Identifying infrastructure owners and backups, along with who is on call at certain times, etc. provides the basis for deciding alert management. IT teams then need a product which provides the flexibility to set up alerts and notifications that can meet their process.
Many IT monitoring products will have some alert management capabilities, including selecting the people to be notified and methods of notification (email, text, etc.). More advanced solutions may include built-in integrations with sophisticated notification tools, such as PagerDuty, VictorOps and OnPage. It is important to match these capabilities against your alert management needs and process.
3. Business Service Monitoring (BSM):
BSM takes the parent-child relationship to the business services level by relating IT infrastructure to business services. With BSM, IT teams can monitor and alert on the availability and performance of business services, rather than just infrastructure components. BSM alerts are more meaningful in that they capture the impact to business services, which then allows prioritization of resources, ensuring that the most critical service performance issues or outages are addressed first.
More advanced monitoring products should have BSM capabilities to help IT organizations see business service level alerts. Capabilities can range from sophisticated pattern recognition to the required basics of relating infrastructure to services. As with the parent-child capability, the more sophisticated capabilities may be needed by very large enterprises and service providers, while the required basics will work for most mid-sized enterprises.
If you find yourself with the “sea of red” problem, Opsview can help. Using the three methods described above, we are helping our customers reduce unnecessary alerts, operate more efficiently, and reduce costly downtime. Try Opsview Monitor for free.