Done right, IT monitoring provides clarity and promotes operational effectiveness. Done wrong, it can make your staff crazy and limit business...
Each component is managed with its own local configuration file and can be run alongside other components on a single server, or scaled out horizontally across multiple servers. Components which affect scalability, such as those which are executing plugins to collect monitoring data, can be duplicated as many times as is required in order to deliver greater throughput and better utilize underlying resources. Further performance and predictability gains can be made by grouping components of the same type to create function-dedicated servers.
RabbitMQ and CouchDB
RabbitMQ has been used to provide reliable, scalable messaging across the system, including end-to-end encryption both in transit and at rest, and was chosen over alternatives such as Kafka mainly due to its simplicity of configuration, particularly in clustered environments. An additional benefit of using a queue-based tool for messaging is that integration with third party systems becomes even easier than in earlier releases of Opsview Monitor. Now, a few lines of code can create a queue listener which will pass live monitoring data such as check results or alerts directly into a data lake or analysis tool which may already be in place in your environment.
CouchDB is used to provide a fast and resilient NoSQL data store which can be used by Opsview Monitor to store live monitoring state and other runtime data needed by the various system components. All data exchanged over the message bus is JSON-formatted, and so a datastore which speaks JSON natively is an ideal choice, especially as it also ensures fault tolerance in a highly distributed system.
Even though the ability to scale horizontally exists due to the microservices-based architecture, a smaller system monitoring only a few thousand hosts could be run on perhaps just three or four servers – a database, an orchestrator server (where the web UI can be found), and two collector nodes for running the checks. This is the same architecture as would be used on earlier versions of Opsview Monitor -- highlighting the fact that microservices do not necessarily create additional complexity.
In larger environments, scaling to tens of thousands of hosts and beyond is best achieved using several separate physical or virtual servers, which can be bare metal or virtual machines, on-premises or in a public or private cloud. Figure 2 shows an architecture diagram for a larger system, monitoring 10,000 hosts. There are many variables to consider when architecting a large system, and this diagram is just an example of one possible solution. Adding additional collector clusters, schedulers and executors would allow this system to be scaled much further still.