monitoring - Best way to deal with monitor alert overload and desensitization? -
we're in process of adding monitoring various servers , processes on our network, , currently, various monitors email development group if seems amiss - no customer payments on website in x minutes, web services support process unresponsive, daily automated ftp vendor failed, etc. while of these informational , need addressed (tomorrow or monday fine, example), critical , result of actual customer outages, need restored possible.
the problem there many emails people getting desensitized them , beginning ignore critical ones. though have point person changes each week, still find critical alerts sit there, unclaimed , unresponded, hours sometimes.
what other people doing better address these types of monitor , alert situations? should have dashboard or summary email gives day? critical things - group email still best way go? i'm curious see others doing see things addressed quickly, ensure developers aren't overwhelmed inaction.
in rhq ( http://rhq-project.org/ ) have dampening events - meaning e.g. email sent every 5 alerts etc.
also possible have alert disable sending , have 2nd called recovery alert, (if error situation goes away) re-enables sending if next error situation shows up.
see http://www.rhq-project.org/display/jopr2/alerts more info.
Comments
Post a Comment