You are here

remote check inconsistent

3 posts / 0 new
Last post
raymondo
raymondo's picture
remote check inconsistent

Hi All,

I'm pulling my hair out here.

I'm running Opsview-core and am calling check_by_ssh which executes a script on the remote linux server which check if a database process is running or not.

It all works fine for a few minutes and returns the correct result, then I suddenly start getting "CRITICAL - Plugin timed out after 60 seconds" and "(Return code of 255 is out of bounds)" messages for no apparent reason.

I was oringally using check_nrpe but had similar timeout issues which i though might have been caused due to too many calls on nrpe, but that seems not to be the case.

I have between 300 and 350 of these checks taking place per server, I have increased the sshd_config MaxSessions value to 1000 but I'm still getting these errors at random.

Any assistance will really be appreciated.

Ray

smarsh
smarsh's picture
Re: remote check inconsistent

300 checks per server? Wow, that is a LOT. How many hosts do you have with 300 checks? To put this in perspective, Opsview recommends no more than 20-25 concurrent service checks at any time - as per the guideline here - http://docs.opsview.com/doku.php?id=opsview4.4:designing-system#service_checks_per_second

If you are doing more than 30-35 concurrently you need to scale this out. As you are using core, you cannot use slaves to do this - so you'll have to have multiple instances of core to distribute the workload.

Sam.

raymondo
raymondo's picture
Re: remote check inconsistent

I seem to have found a solution.

I have started 20 nrpe client daemons listening on different ports on the server to be monitored, I then split my service checks across the various ports.