Blog

Role Optimization In Opsview Monitor 5.2

tvoon's picture

Background

One of the ways Opsview is better than Nagios is by having roles. This allows administrators to group users together to give them the same access to only the hosts and services that they are interested in.

We've been using the NDO Utils software to insert data into our live-up-to-the-second database, which we call Runtime. We have customized this code to work better with Opsview. For example, we changed from using the file2db command to our own database importer back in Opsview 4.2, which gave us a 50% improvement at that time and the flexibility to make future changes. But the schema that comes out of the box for NDO Utils is not optimized for querying, so we developed helper tables to hold the most common lookups, allowing simpler joins for our developers to get the information they want from Runtime.

The helper tables are created after the Opsview reload has finished and Nagios sends a large import file. We measure this time as the "post-reload time".

When we started analyzing some of our customer systems, we found the biggest helper table was opsview_contact_objects. For some of our customers, this was a multi-million row table!

We couldn't just get rid of the table. This table links each user to each object and is fundamental to providing our access control. It is used by all our REST API calls to ensure that users can only see the objects they are allowed to.

But we had a cunning plan - what about moving it to the role level?

So we started designing for a new table, naturally called opsview_role_objects. We estimated this would give us a ten-fold reduction in the amount of data required in this data.

Plan

There were various techniques we wanted to use when we embarked on this project:

Removing unnecessary tables and columns

We had some redundant tables in Runtime. For example, historically, we had two tables, opsview_contact_host and opsview_contact_services, which were amalgamated to create a larger opsview_contact_objects, but some older code was still referencing the other two tables. We would clean up code to only use the single table in the future.

Also, the base NDO Utils schema includes a lot of tables and columns of data that are stored, but Opsview doesn't use them. In the front end, we work on the principle of keeping it simple, so we started to aggressively remove data we no longer needed.

Reducing data sent from Nagios

With this in mind, we planned on changing Nagios to only send the required information. It turns out that some data was being sent twice - once at the start of a process and once at the end - so we simplified these.

Since we use files as the interface between Nagios and our database writer, this would reduce the amount of I/O required for normal operation - the post-reload import file is always particularly large.

Simplifying join tables to create helper tables

Some tables were overly complex, using a combination of primary keys and foreign keys to get the necessary information. For instance, to get a lookup of users to services, you needed to join against five tables:

INSERT INTO opsview_contact_services
SELECT DISTINCT
 c.id, s.service_object_id
FROM
 opsview_contacts c,
 nagios_contactgroup_members cgm,
 nagios_contactgroups cg,
 nagios_service_contactgroups scg,
 nagios_services s
WHERE
 c.contact_object_id = cgm.contact_object_id
 AND cgm.contactgroup_id = cg.contactgroup_id
 AND cg.contactgroup_object_id = scg.contactgroup_object_id
 AND scg.service_id = s.service_id
 AND s.config_type = 1

We redesigned this so we only needed to join two tables:

INSERT INTO opsview_role_objects
SELECT DISTINCT role_id,object_id
FROM
 configdump_role_contactgroups,
 configdump_object_contactgroups
WHERE
 configdump_role_contactgroups.contactgroup_id = configdump_object_contactgroups.contactgroup_id

The trick is to populate those two tables quickly and efficiently, and for that we used....

Multi value inserts

Instead of inserting rows one at a time, we batched up the data so we could insert all at once. This would be much more efficient because it saves the database layer from having to parse each statement and re-index after each insert. Instead of 1000 insert statements, we have one insert statement with 1000 values.

Development

This was a complicated project. It affected our Nagios configuration generator, Nagios parsing, NDO data output, our database writer and all ACL REST API queries, moving along the whole backend stack.

NDO flow diagram

However, with our automated tests, we were able to pick up a lot of issues during development. Our fantastic Quality Assurance team still managed to find four bugs during their testing, which were quickly fixed and more automated tests were added for the future.

Results

And the results were astounding! We based our test system on one of our largest customers and we found:

* The size of the import file was reduced by 60%
* Time to import this data into Runtime database reduced by 60%
* Post reload time reduced by 93% (!)
* Number of rows in the ACL table reduced by 90% (!)
* REST API calls with ACL reduced between 5-10%

Roles optimisation graph

These are incredible numbers! While this all happens behind the scenes, Opsview just feels a lot snappier as its REST API calls are quicker to respond.

Best of all, this enhancement will automatically be available when you update to Opsview 5.2. Our amazing database upgrade scripts will detect your current version of Opsview and apply each database change necessary to get this new feature.

This makes all Opsview systems much more efficient, but will give the biggest benefits to our largest customers.