Scalability: Ensuring Your Monitoring Keeps Up With Your Infrastructure

Overview

When choosing an Enterprise monitoring tool, there are many considerations. One that is almost always at the top of the list is scalability. Picking a tool that does all the things you need it to do is critical, but ensuring that it is not going to grind to a halt as you expand your monitoring environment is key.

You can just keep deploying more and more monitoring systems to ensure the system limits are not reached, but this quickly becomes very hard to administer and can add a lot of extra cost in terms of both licensing and hardware. Distributed monitoring offers a solution that ensures your systems remain easy to administer while also ensuring the system can scale to meet your needs.

As well as allowing your monitoring to scale, distributed monitoring also ensures that you can deliver consistent monitoring across multiple sites. It is common to have monitoring responsibility for multiple locations so using distributed monitoring means that you overcome the issues of sending all your monitoring traffic over your wide area links between sites.

Distributed Monitoring

What is Distributed Monitoring?

Distributed monitoring in Opsview means having a single “Orchestrator” system that collates all of the monitoring data. This Orchestrator system is then supported by a network of monitoring collectors known as “Collectors”.

The key advantages of distributed monitoring are that the load is taken off of your Orchestrator system which you use for day to day administration and rely on for those critical alerts and reports and that you can locate your Collector systems in locations that best suit your monitoring needs.

Single Point of Management

Managing enterprise scale infrastructure can be challenging. For monitoring, having a single point of management is a real advantage. Opsview's distributed architecture provides all the advantages of running the checks at the right places within your IT infrastructure while ensuring that all of these results are handed back to a central system for management. This means that all notifications can be administrated from a single location, all reports can be generated on the same system and include all of the data they require. When using distributed monitoring in conjunction with the Business Service Module, services that comprise of a variety of systems that span multiple physical locations are still correctly handled.

Distributing the Monitoring System Load

When you are monitoring, each individual check is putting some load on the monitoring system. Optimizing these checks will reduce this load, but there will always be a limit to how many checks a single system can handle. Distributing the load across multiple systems is a proven method of ensuring that those limits are not reached.

Most monitoring systems are designed to be as efficient as possible. Opsview provides detailed guidance on the load considerations for deployments. However, when monitoring large scale enterprise infrastructure, even with modern hardware, a single system is just not going to be able to cope. Through calculation or measurement we can evaluate the load on a system and before it becomes a problem add an extra Collector system that takes any future monitoring load.

Here are some example calculations that will allow you to calculate total Service Checks Per Second. Service Checks Per Second are the key factor when assessing performance and we recommend a maximum figure of around 20-25 checks per second.

Let us look at an example. Say we have 2000 hosts, with 10 checks per host, using a 5-minute interval.
2000 (hosts) * 10 (service checks) / 300 (seconds) = 66 service checks per secondThis is over our figure of 20-25 checks per second so we would need to attach 3 Collectors to the host to hit a rate of 22 checks per second.

Remembering we can utilise each core to handle a separate worker thread, we can divide our figure of 66 by the number of cores our Collector servers will have. For example, if we have 3 dual CPUs in our collector servers, this brings the number of service checks per second on each core to 11.

66 service checks per second / 6 (number of cores in 3 CPUs) = 11.

Distributing monitoring load on your infrastructure

It is an all too common story that badly designed monitoring systems, instead of providing valuable insight into the status of systems, actually put an unnecessary load upon the infrastructure they are designed to monitor and end up causing more problems than they solve. The traffic caused by monitoring being sent over networks can quickly add up and as problems occur more data can be generated potentially meaning one issue ends up causing many. Distributed Collectors give you the flexibility to design your monitoring system to ensure that it fits the needs of your infrastructure.

A common example of where Collectors are used to manage traffic is for optimizing the flow of traffic between datacenters. Networking within a datacenter is likely to have an abundance of spare bandwidth, however a tunnel from one datacenter to another is much more likely to be a bottleneck. Distributing a Collector into each datacenter means that checks are only run over the local network and just the results are passed over the tunnels between datacenters, greatly reducing the bandwidth usage for monitoring.

Handling network outages

A networking problem between data centers can present a problem for monitoring systems. One that will commonly result in incorrect reporting of a system being down rather that detecting that the network over which the monitoring is being run has a problem. A distributed architecture has the advantage of providing resilience to issues in the interconnectivity between sites. Should the network between the Opsview Orchestrator and an Opsview Collector go down the Collector system will begin to buffer the data for collection by the Orchestrator system when normal networking returns.

Configuration for Security

Along with the other advantages of being able to choose by design where you wish to locate your monitoring systems it can also significantly reduce the effort required to work around infrastructure security. Firewalls within infrastructure can be a common hurdle for deploying monitoring, different checks use different ports and transports. Using centralized monitoring, these will need to be taken into consideration and the firewall opened up. In some circumstances opening the firewall may not even be an option. Using distributed monitoring you can carefully design where checks are run from. The Collector monitoring collector can be located behind the firewall and then pass results back to the Orchestrator.

In Opsview, communication between the Orchestrator and the Collector is very secure using a Secure Shell Tunnel. The communication can be configured to run from the Orchestrator to the Collector or from the Collector to the Orchestrator making it easy to manage your network security.

Ensuring Availability

Clustering Collectors

It is common to implement redundancy in any monitoring system. Opsview has extensive support for both high availability and disaster recovery for the Orchestrator system, so why not apply this approach to the distributed Collector systems as well?

Opsview Collector monitoring systems fully support clustering, allowing you to deploy two or more Collector systems and then should one go down for whatever reason the other will automatically take over.

Conclusion

Opsview's distributed architecture is an incredibly flexible way to ensure that your monitoring system keeps up with your infrastructure. So take the headache out of monitoring security and deployment across sites and try Opsview Collectors. You can find information on how to set up an Opsview Collector in the Opsview Knowledge Center.