's-Hertogenbosch is a medium sized city in the southern Netherlands. The city’s council is responsible for delivering data services to public...
When we set out to create a new architecture for Opsview Monitor, our original goal was to build a system that could scale to handle at least three times the monitoring workload of our largest customer at the time. We also wanted to ensure that – when existing customers wanted to upgrade to our new release, they wouldn’t also need to upgrade hardware.
This offered some interesting challenges. We wanted to design a system that could scale horizontally across a large number of machines, potentially in the cloud, but also run on a single machine, if required. Our architectural strategy is detailed in this interview with Alex Burzynski, Opsview’s Chief Architect. To summarize, we broke up the big pieces of functionality into a number of simple, scalable, and tunable microservices; made those services communicate using a robust message-brokering solution called RabbitMQ; and, where needed, arranged to save their states in a resilient, NoSQL database called CouchDB.
The flexibility of this approach lets us deploy Opsview Monitor 6 on a single machine with minimal hardware requirements. It also lets us scale different parts of the system depending on the particular needs of the customer -- the number of hosts or devices they need to monitor, and the complexity of their monitoring requirements.
For more information on the detailed architecture of Opsview Monitor 6, please read Neil Ferguson's blog on the architecture.
Many variables can affect a system’s ability to scale. For monitoring, the key is to quickly and efficiently process the results of service checks so that results can be displayed in the user interface or alerts can be sent promptly, if needed.
To benchmark our new architecture, we wanted to test Opsview Monitor 6’s event processing capacity – a challenge, since our test lab is limited to a few tens of hosts rather than many tens of thousands. The first thing that we needed to do was create a scalable simulator that could be deployed many times in parallel to generate this kind of massive load. We then had to design a deployment of Opsview Monitor 6 that we believed would be able to handle the load we were about to throw at it.
We started with a relatively straightforward setup: separating out the main functional components on their own machines with the intention of adding additional machines to support each function, as might be required. This meant using six separate machines to host the Orchestrator, Database, Results Processing (2 machines), Message Bus and Datastore functions. The Timeseries database (InfluxDB) was sharded over 8 individual machines to overcome metric storage limits in InfluxDB. In addition, we had two separate machines with multiple instances (up to 20 on each machine) of Collectors to generate the load.
Figure 1. A six-server deployment of Opsview Monitor 6, used in benchmarking.
Actual testing took place over several days of gradually increasing the load and monitoring the impact on overall system performance. Along the way, we encountered a few database and process-timing conflicts, requiring small tweaks. Overall, however, changes required were relatively minor.
Our original design goal was to be able to process 1,000 results per second: three times the throughput of our largest customers who were processing approximately 300 results per second. Opsview Monitor 6 ended up exceeding expectations: the initial six server setup outlined above easily handled 3,200 results per second – equating to 50,000 hosts and one million individual service checks -- all without requiring additional machine resources or showing signs of stress.
Having smashed the original design goal and achieved ten times the scale currently required by our largest customers, we intend to keep tuning individual components and working out the most efficient physical implementations to meet each new set of customer requirements. We’ll be experimenting with a platforms as well: comparing price and performance achievable by scaling up platform specifications (more CPUs, more RAM, etc.) versus using larger numbers of commodity machines.