Opsview-proxy

9 replies [Last post]
awijntje's picture
awijntje
User offline. Last seen 1 day 11 hours ago. Offline
Certified Opsview AdministratorOpsview Enterprise SubscriberOpsview Sensei - 5th DanOpsview Sensei - 4th DanOpsview Sensei - 3rd DanOpsview Sensei - 2nd DanOpsview Sensei - 1st Dan
Joined: 27 Jun 2010
Posts:
Points: 9070

He guys,

Sorry to put in another feautre-request (especially one as long and complex as this one) but as it solves some issue's I'm having (and I expect that more people will have these issues) I though I'd submit it any way....

First of I'll describe what kind of monitoring we want and how we have now setup our system.

Then I'll describe how I think Opsview could solve these issue by creating (what I call) an opsview-proxy.

The issue:

We need to monitor multihomed servers with a private and a public-interface.

on the private interface we can connect with NRPE from our private opsview-slave.

The public interface can (for security reasons) only be monitored from our public opsview-slave.

Now let's say where monitoring an SMTP server.

So we monitor smtpd/sendmail/exima/postfix through the check_procs plugin on NRPE (from our private opsview-slave).

We monitor with check_smtp on our public-interface (from our public opsview-slave).

But we also need the host-dependency between the check_procs and check_smtp (no sense in running check_smtp if check_procs tells me smtpd is not running.).

In essence what we want is to monitor a single host from different opsview-slaves.

For now we've implemented the setup as seen in current_setup.

In short we have our private-opsview-slave use NRPE to run the check_smtp from the public opsview slave.

Now this may work but has a lot of drawbacks.

1. if the link from private to public opsview slave fails all checks on our public network will faill (giving false-positives).

2. any reports can be considered tainted as they do not show the actual situation of the service.

3. higher load on our private-opsview-slave as it has additional checks.

4. No load-balancing or redunantie in our public opsview slave.

5. incorrect performance info on our public opsview slave.

6. service checks become very complex and hard to maintain.

7. The nagios running on the public opsview slave has no knowledge of the checks it's running.

 

So how could Opsview solve this problem?

Well I've been thinking about this and I think I've sorted out most of the issues on how this should work..

The opsview-proxy is in essence just a slave(cluster) with all the benefits off a regular slave/cluster.

The difference is in the way it's configured.

The proxy should have in its configuration all the checks that are run against the public interface, including passive-checks for any host-dependencies (later more on this).

So how do we achieve this configuration and how do we handle host-dependencies...

First off adding proxies would work similar to adding slaves and slave-clusters (maybe an extra checkbox to define the slaves/cluster as a proxy). 

Then there should be a checkbox (or similar) in the service check that enables the use of a proxy for that specific check.

In the host configuration it should be possible to select multiple proxy's (in the same way you can select parents).

Then when nagconfgen is run it should generate the following configuration for the proxy.

Any service check that is proxy enabled should go to the opsview-proxy (when multiple proxies are selected it should go to these aswell)

Any host-dependencies should generate passive-checks on the opsview-proxy to which the private-opsview-slave (or opsview-master) can send the results of the check we depend on..

In our example the proxy actively runs check_smtp and passivly recieves results for the check_procs (this is to prevent check_smtp from running when check_procs is critical).

See the opsview-proxy picture.

 

Hopefully I was able to convey my thoughts here correctly (and in a way people can understand).

It is my opinion that this could be a major feature for Opsview (and as far as I can tell no other monitoring platform offers this feature)..

And I'm very interested in hearing others peoples thoughts on the matter..

Sincerly,

Alan

0
Your rating: None
tonvoon's picture
tonvoon
User offline. Last seen 5 days 6 hours ago. Offline
Opsview Sensei - 1st Dan
Joined: 26 May 2010
Posts:
Points: 65

I can't quite work out what your proposal is, without a lot more thoughts.

So a quick idea to throw in: what if slave clusters were not flat (all "owned" by the master)? What if there was a hierarchy? And results flow from up to "higher up" slaves before reaching the master?

In your scenario, have 1 master with 1 private slave cluster which has 1 public slave cluster.

The other idea is that we have "hosts" with multiple interfaces. You run checks against a "management interface" or the "public interface (default)". This means that Nagios still considers a host as the dependency, but you can run checks from different directions.

Your rating: None
awijntje's picture
awijntje
User offline. Last seen 1 day 11 hours ago. Offline
Certified Opsview AdministratorOpsview Enterprise SubscriberOpsview Sensei - 5th DanOpsview Sensei - 4th DanOpsview Sensei - 3rd DanOpsview Sensei - 2nd DanOpsview Sensei - 1st Dan
Joined: 27 Jun 2010
Posts:
Points: 9070

He Ton,

thanks for the reply I'll try to explain it more clearly with an example (I like the hierachy idea though!!).

You're second idea actually already exists (and we use that with great succes, as I'll describe here).

When we add a host to opsview we add the following information.

Primary hostname/IP: mailserverA.company

HostTitle: mailserverA.company

Other addresses: mailserverA.internal,mailserverA.company

monitoring server = slave1

host-attribute VIA = slave2
 

We then use $ADDRESS1$ to run host-based checks with NRPE.

And we use $ADDRESS2$ to run our service checks.

So have 2 checks.

1. check_nrpe -H $ADDRESS1$ -c check_procs -a '-c 2:2046 -C exim'

2. check_nrpe -H %VIA% -c check_smtp -a '-H $ADDRESS2$ -4 -w 2 -c 5'

where check2 is dependant on check1.

Now our mailserverA.internal is reachable from our slave1, but due to firewall restrictions our slave1 cannot reach the mailserverA.company ($ADDRESS2$).

So instead of running check_smtp from  our slave1 (as you would normally do) we use slave2 which does has access to the mailserverA.company to proxy our check2.

The check_nrpe -H %VIA% is actually telling the slave1 to connect to the NRPE daemon on our slave2 and have slave2 execute the check_smtp plugin with destination mailserverA.company.

Now this works just fine, but has the drawback that I have no loadbalancing/redundancie on slave2 (as slave2 is not even aware it's running  check2).

so if I could generate 1 config for slave 1 (with check1) and 1 config for slave2 (with check2 and a passive-check1) and then have slave1 inform slave2 of the status of check1 (much like slaves in a cluster sync states) this would keep my dependency intact (and even allow for slave1 and slave2 to become slave1-cluster and slave2-cluster if needed with full redundancy and failover).

 

I hope this makes things a bit clearer if not don't hesitate to let me know and I'll gladly try to explain it more clearly..

 

Sincerly Alan

Your rating: None
awijntje's picture
awijntje
User offline. Last seen 1 day 11 hours ago. Offline
Certified Opsview AdministratorOpsview Enterprise SubscriberOpsview Sensei - 5th DanOpsview Sensei - 4th DanOpsview Sensei - 3rd DanOpsview Sensei - 2nd DanOpsview Sensei - 1st Dan
Joined: 27 Jun 2010
Posts:
Points: 9070

So I've been thinking about more ways to explain my feature request.

Instead I've come up with a challenge for the community....

Here's the challenge..

First off when monitoring a service we provide our customers we need to consider these points.

1. how does the customer interact with our service..

2. which application provides this customer service and how do I monitor that.

3. what is the relationship between 1 and 2.

4. From a Nagios point of view how do we describe this relationship.

5. how do I make sure that failures in my monitoring-system do not impact my reporting/monitoring.

So let's say I have the website www.opsview.com. which runs on opsview.webserver.co.uk

So lets answer our 5 questions.

1. the customer uses http to acces our site, which means we should run a check_http against www.opsview.com to verify that we are offering content (note that we would also like to take the same network path as our customers use as this also tells us if our internet-connection is working).

2. our application is apache2 so we need check_procs to check for running apache processes on opsview.webserver.co.uk

3. if our application is not running, then running check_http becomes redundant so check_http depends on apache.

4. Nagios can only support this dependency when configured in a single host (as check_http needs to know the status of apache) so our checks should be configured in 1 host..

5. we need failover/redundancy for all our checks (so our slaves are allways clusters).

Now let's assume we have the following constraints.

we have 2 slave-clusters..

slave-cluster1 is in our own datacenter and has access to the internal network and thus to opsview.webserver.co.uk (but not to the internet).

slave-cluster2 is in another datacenter (colocation) and has access to the entire internet (and is reachable by ssh from our master) but does not have access to our internal network..

The opsview-agent is only allowed to listen on the internal network (for security reasons).

 

So how do we monitor our website within these constraints but still comply to our 4 points.

Now we could add 2 hosts in nagios but then we don't comply to points 3 and 4.

We could add 1 host and use the VIA-check (as described in my previous posting) this would comply to 4 points but not to point 5 (if the slave mentioned in the VIA fails, eventhough it's a slave-cluster the remaining slave will not take over the check as it has no knowlegde of the check).

 

So here's the real challenge for the community.....

Within the given constraints and setup and with the 5 points we need to comply to how would you monitor our service so we can report the following information to our managers.

the availibility of www.opsview.com where we also have to be able to explain why our site was unavailible (hardware/application failure vs networkfailures).

Hope everyone is up for this one....

I'm looking forward to some interesting solutions..

Sincerly Alan

Your rating: None
tonvoon's picture
tonvoon
User offline. Last seen 5 days 6 hours ago. Offline
Opsview Sensei - 1st Dan
Joined: 26 May 2010
Posts:
Points: 65

Unstructured answer coming up!

Re: (4). The "same host dependency" is actually an Opsview limitation. We couldn't see a good UI mechanism that "makes sense" for cross hosts dependencies.

Re: (3). It shouldn't matter. If you are running check_procs which complains that apache is missing, then just leave check_http to run and fail. See below for how to reduce your alerts....

Re: (1), you want this "user experience" to be as close to the user's as possible. check_http is good, but Nagios::Plugin::WWW::Mechanize is better (http://search.cpan.org/dist/Nagios-Plugin-WWW-Mechanize/). We write custom plugins that interact with the web site to get a much better approximation of user experience.

You may want a collection of those. I would put those all into a keyword like, say, "opsera-com-user". This is what you report against because this is the user experience. It doesn't matter if 2 of the 3 web clusters has failed, as long as the user experience stays the same.

Then have a keyword which consists of all components, say "opsview-com-components". This includes your apache process check + anything else you want monitored for that business line. At this point, you can either create a notification profile for this keyword (and get alerts about each individual service in this keyword), or you could create a new service using check_opsview_keyword which takes the highest level alert within the list of services in this keyword (eg, its effectively the same as the viewport summary for that keyword) and get notified on that service only.

So what happens is you don't bother defining all the dependencies between all the different bits of software, but instead have two buckets for each business line: a user experience bucket & a components that impact that business. So these two viewports should then tell you (A) your users have a problem or (B) your technical staff have some things to look into.

Your rating: None
tonvoon's picture
tonvoon
User offline. Last seen 5 days 6 hours ago. Offline
Opsview Sensei - 1st Dan
Joined: 26 May 2010
Posts:
Points: 65

And!

keywords are not slave dependent, so you get a cross site view of your businesses. It shouldn't matter where the hosts and services are actually run - you just need that central view.

I forget how good Opsview can be....

Your rating: None
awijntje's picture
awijntje
User offline. Last seen 1 day 11 hours ago. Offline
Certified Opsview AdministratorOpsview Enterprise SubscriberOpsview Sensei - 5th DanOpsview Sensei - 4th DanOpsview Sensei - 3rd DanOpsview Sensei - 2nd DanOpsview Sensei - 1st Dan
Joined: 27 Jun 2010
Posts:
Points: 9070

He Ton,

 

hahahah wel that just shows I still need to learn a lot about what Opsview can do......

I like your approach here and I'll see if and how I can apply this to our situation....

 

One drawback as far as I can tell, and that is you still need to add a single host as 2 hosts (one for the customer and one for the technical guys).

And I would really like to prevent this doubling of hosts in our system.

 

Just to be on the safeside I'm going to see if I can achive our goals using your advice and see if that resolves my issue (and if not I'll get back to you on that).

Sincerly,

Alan

Your rating: None
awijntje's picture
awijntje
User offline. Last seen 1 day 11 hours ago. Offline
Certified Opsview AdministratorOpsview Enterprise SubscriberOpsview Sensei - 5th DanOpsview Sensei - 4th DanOpsview Sensei - 3rd DanOpsview Sensei - 2nd DanOpsview Sensei - 1st Dan
Joined: 27 Jun 2010
Posts:
Points: 9070

He Guys...

Over 40 views and no-one has any thoughts on this (well except for Ton)??

Maybe I need to add a prize or something??

I actually expected that the questions I posted are something most of use would have run into (or will run into sooner or later)?

Maybe we should start a separate forum where we can discuss the concepts of monitoring (and less the technical implementations)?

Sincerly,

Alan

Your rating: None
awijntje's picture
awijntje
User offline. Last seen 1 day 11 hours ago. Offline
Certified Opsview AdministratorOpsview Enterprise SubscriberOpsview Sensei - 5th DanOpsview Sensei - 4th DanOpsview Sensei - 3rd DanOpsview Sensei - 2nd DanOpsview Sensei - 1st Dan
Joined: 27 Jun 2010
Posts:
Points: 9070

He Ton,

 

just been thinking about what you said (and if I was UK-based I would probably have visited you to give you a big kiss hahahahaha).

I was so hung up on getting check_http dependant on apache while as you pointed out it doesn't really matter (well it does but not in the way I proposed)........

 

All I now need to figure out is how to make this work from within 1 host.

I'll have a go at hacking some code for this......

As all it needs to do is take some checks and place them on the selected monitoring slave (the one selected in the host config).

And take the other checks and place them on the seconday/proxy monitoring slave.

this should then enable us to still configure 1 host but have multiple slaves/cluster monitoring it......

 

Aah this is why I love open source.......you allways learn something new and if it doesn't meet your needs you just build it....

 

Thanks for all your help!!!

Alan

Your rating: None
awijntje's picture
awijntje
User offline. Last seen 1 day 11 hours ago. Offline
Certified Opsview AdministratorOpsview Enterprise SubscriberOpsview Sensei - 5th DanOpsview Sensei - 4th DanOpsview Sensei - 3rd DanOpsview Sensei - 2nd DanOpsview Sensei - 1st Dan
Joined: 27 Jun 2010
Posts:
Points: 9070

He Guys,

 

I hope some here can help me...

I just figured out what I need to change to atleast be able to make a proof of concept...

What I want to change are 3 things, problem is that I'm not sure which scripts or files I need to modify (been looking in various files but haven't figured it out yet).

What I want to change is:

1. change the UI for the servicechecks to add a tickbox which marks the check as a proxy-check (updating the DB is implied).

2. change the host configuration UI so next to monitored by there's another pulldown menu to select the proxy server (again DB modification is implied).

3. change nagconfgen so when generating the config for the monitored by server it skips the checks marked as proxy, and when generating the config for the proxy it only adds the checks marked as proxy.

For the first two I have no idea in which file I should look/modify.

For the nagconfgen I'm not quite sure how the code works, does it make the config per slave (looks like it)?

Hope someone can help.

Sincerly,

Alan

Your rating: None