How to differentiate between down on monitoring error from down from target device problem

My admins complain because sometimes they receive notifications on a "down" device that turns out to be a monitoring problem, e.g., authentication failure on target device, wmi errors, locked ssh accounts used for monitoring on target device.

How can I distinguish between a down due to a "real" issue on the target device, which should be looked into by the device/service admin, and a down due to some monitoring issue, which should be solved the our monitoring team?

Article Comments

Hello there rbarnes.

Unfortunately there's no simple answer, sometimes a monitoring error may be a symptom for something else. Getting an SNMP error when trying to monitor bandwidth can mean that something is wrong as the device isn't responding to SNMP requests, could a firewall be blocking traffic, could it be that the SNMP Agent on the device failed? If so, why? Investigation will always be required when dealing with this kind of information.

When your admins receive an alert, say for instance an WMI Error saying that the memory is down, it's possible to quickly reach the device's overview within PRTG and compare the down sensor with other sensors, having sensors with multiple protocols (WMI, SNMP, ICMP) will allow you to better analyze if this is a "real issue" or a "monitoring issue".

A locked ssh account should also not happen "by itself", it means that something change/was modified in the environment, if someone intends to modify something in the environment which may trigger alerts, people need to be warned/alerted first, it's even possible to pause the respective sensors in PRTG to prevent alerts immediately after the change.

Some monitoring protocols are also less error-prone, you shoud always use SNMP or other Low or Very-Low performance-impact sensors, Sensors with a Higher Performance impact tend to be more error prone. Having your PRTG Server set-up tidy also helps avoiding false alarms.

Best Regards,

Feb, 2016 - Permalink

Thanks for the reply.

In our Solaris 10 environment it is common for PRTG to lock the account the sensors use to login, haven't been able to find out why. After 3 unsuccessful logins our Solaris policy locks the account until an admin unlocks it. I have set up the SSH sensor to write log on failure, but unfortunately since the sensor runs every 10 minutes by the time we check the log it has already been overwritten the a later failed login, so we cannot see the exact event that locked the account.

On wmi it is somewhat common to see a sensor that reports it is not able to log in. However, the Active Directoy account is fine, and working well on other servers.

Feb, 2016 - Permalink

Hello rbarnes, thank you for your reply.

We're currently re-working the engine we use for SSH-based sensors in PRTG, so you will most likely see some improvements in future releases.

Regarding your WMI sensors, you should consider switching them for SNMP-based sensors which are much more reliable/stable. Otherwise please increase the sensor's scanning interval, make sure that the PRTG Probe isn't overloaded and check our WMI Troubleshooting KB-article: My WMI sensors don't work. What can I do?

Best Regards,

Feb, 2016 - Permalink

Article Comments

Search

Attention

Related Articles