Opsview's New Monitoring Architecture

In addition to adding new monitoring architecture capabilities, a main focus of Opsview's development has been to improve performance and deliver greater scalability, ensuring that customers continue to enjoy a great user experience as they expand their use of the product across their growing and changing infrastructure environments.

Increasingly, customers are demanding that monitoring tools work at significant scale. Rather than monitoring thousands or tens of thousands of hosts (endpoints), large enterprises and MSPs now may need to monitor hundreds of thousands of hosts on a single system. To deliver on this expectation and to also deliver best-in-class performance to customers operating at smaller scales, Opsview has created an entirely new architecture for its latest release of Opsview 6.

This new architecture has allowed for the Nagios® engine to be completely removed, while still retaining full compatibility with Nagios plugins. This provides backwards compatibility with plugins for existing Opsview users so that a smooth upgrade path is assured, and also means that customers are still able to use any of the thousands of plugins available from the community.

Microservices in the Monitoring Architecture

For maximum flexibility, scalability and performance, Opsview 6’s new monitoring architecture is based on microservices, where a number of separate components (processes) perform discrete tasks. Instead of linking these microservice processes using sockets or classic inter-process communications (IPC) methods, we connect them via a message bus: a high-performance, abstract communication system with built-in resiliency features, that’s performant, robust, and easy to use in highly-dynamic environments.

Detailed diagram of Opsview Monitor 6.0 Microservices

Each component is managed with its own local configuration file and can be run alongside other components on a single server, or scaled out horizontally across multiple servers. Components which affect scalability, such as those which are executing plugins to collect monitoring data, can be duplicated as many times as is required in order to deliver greater throughput and better utilize underlying resources. Further performance and predictability gains can be made by grouping components of the same type to create function-dedicated servers.

RabbitMQ and CouchDB

RabbitMQ has been used to provide reliable, scalable messaging across the system, including end-to-end encryption both in transit and at rest, and was chosen over alternatives such as Kafka mainly due to its simplicity of configuration, particularly in clustered environments. An additional benefit of using a queue-based tool for messaging is that integration with third party systems becomes even easier than in earlier releases of Opsview. Now, a few lines of code can create a queue listener which will pass live monitoring data such as check results or alerts directly into a data lake or analysis tool which may already be in place in your environment.

CouchDB is used to provide a fast and resilient NoSQL data store which can be used by Opsview to store live monitoring state and other runtime data needed by the various system components. All data exchanged over the message bus is JSON-formatted, and so a datastore which speaks JSON natively is an ideal choice, especially as it also ensures fault tolerance in a highly distributed system.

Even though the ability to scale horizontally exists due to the microservices-based monitoring architecture, a smaller system monitoring only a few thousand hosts could be run on perhaps just three or four servers – a database, an orchestrator server (where the web UI can be found), and two collector nodes for running the checks. This is the same architecture as would be used on earlier versions of Opsview -- highlighting the fact that microservices do not necessarily create additional complexity.

In larger environments, scaling to tens of thousands of hosts and beyond is best achieved using several separate physical or virtual servers, which can be bare metal or virtual machines, on-premises or in a public or private cloud. Figure 2 shows an monitoring architecture diagram for a larger system, monitoring 10,000 hosts. There are many variables to consider when architecting a large system, and this diagram is just an example of one possible solution. Adding additional collector clusters, schedulers and executors would allow this system to be scaled much further still.

Figure 2. An example deployment of Opsview Monitor 6.0, used for benchmarking.

Deploying and managing all these components, plus additional software, may sound complex, but in Opsview 6, these tasks don’t need to be challenging. Since early in the development of this new architecture, Opsview has used Ansible – a leading open source IT automation framework – for Opsview lifecycle management. It is now possible to deploy and configure the entire system through Ansible playbooks which are provided out of the box. Components are installed, configured and connected to each other without manual intervention, greatly reducing both the learning curve and the risk of manual configuration errors being introduced.

Whatever the size of your IT environment, Opsview 6’s new monitoring architecture can scale to meet your needs. With fast, automated deployment and comprehensive integrations with tools such as configuration management and service desks, getting broad and deep monitoring coverage of your critical business systems is easier than you think.