Business Service Monitoring: Impact vs. Outage
It is tempting to create large host templates for applying service checks to a given device. However, the ease of implementation can quickly be outweighed by the lack of flexibility when creating SLA requirements for Opsview’s BSM components. This is because the results of some checks are going to be more important than others.
Databases and server monitoring are a perfect example. The host templates provided by Opsview are going to have relevant checks for the database server and every service check in the template will apply to this host so at first, it makes perfect sense to apply all of the service checks at once. The problem is that this template contains availability, capacity, and performance monitors for the database server and each of these categories may have different business rules to determine failure conditions. This means that in a single component, the percentage of in-memory sorts is going to be considered just as important as being able to connect to the database in an acceptable amount of time. This simply isn’t the case so the argument can be made that separating the host templates into smaller, more flexible BSM components would be the best overall approach for modeling an application.
Separating the host templates
This exercise is going to list three different host templates that can be made out of the existing “Database – Oracle RDBMS” template. These new templates separate availability, capacity, and performance service checks into different groups so that distinct SLA rules and failure conditions can be attached to these three aspects of database health.
- Database - Oracle RDBMS – Availability: Connected Users, Corrupted Blocks in Database, Flash Recovery Area Free, Invalid Objects, Maximum Sessions, RMAN Backup Errors, Time to Connect, Tnsping Check, Used Space in Flash Recovery Area
- Database – Oracle RDBMS – Capacity: Datafiles Possible Maximum Number, Free Space Fragmentation Index, Tablespace Can Allocate Next, Tablespace Free, Tablespace Remaining Time, Tablespace Usage
- Database – Oracle RDBMS – Performance: Datafile IO Traffic, Enqueue Contention, Enqueue Waiting, Latch Contention, Latch Waiting, PGA In-memory Sort Ratio, Redo Buffer Allocation Retries, Redo IO Traffic, All Rollback Segment Checks, All SGA Checks, Soft Parses Percentage, Stale Optimizer Statistics, Time Between Redo Log File Switches, User Object Top 10 Checks
The Operational Zone
When creating a business service, the construction of a component can be an extremely powerful business rules engine. By separating checks into separate components, different rules can be implemented and enforced.
For example, consider an architecture with six database servers across three data centers with local clustering. The rule may be that as long as any four servers are operational globally, the application can handle the expected load. That would be an example of a single BSM component with an operational zone of 66.7%. However if the rule is that at least one server at each location must be operational for the application to continue functioning, that would be implemented as three BSM components, each with an operational zone of 50%.
For the original exercise, the operational zone is going to be used in different ways to reflect different business rules relating to availability, capacity, and performance. In the example below, availability is considered the ability to connect to and use the database.
The design of the application dictates that as long as one of the two servers is operational, the application will be able to support the load. When considering capacity, if either of the two servers were to run out of disk space it would be considered useless and that would put the entire application at risk. In this situation, the operational zone is set to 100% meaning that capacity issues are of the highest priority. Finally, the performance checks have been assigned an operational zone of 0%. This way the performance checks can still be associated with the application but they can never be used to determine a failure condition for the overall application. Instead, if any of these service checks should fail it will only indicate that the overall operation of the business service is impacted but not in failure.
By separating groups of service checks or groups of hosts into different BSM components, interesting and powerful business rules can be applied to these architectures with very little effort or management cost. This will result in higher value business rules that can be applied to SLA reporting, notifications, and overall IT decision making.