2005-07-01 00:00:00
This script was written at the time I was hired by KPN i-Diensten. It is reproduced/shared here with their permission.
We are currently in the process of distributing a standard set of Nagios monitoring scripts to over 300 client systems. One of the metrics we would like to monitor is the three load averages (or as Dr. Gunther calls them: the LaLaLa triplets).
Since these 300 servers aren't all alike, we are bound to run into systems with one, two, four, eight or more processors. That way there is no nice way of making one standard configuration, since you'll have to define separate LA levels for WARN and CRIT. Why? Cause a quad system can take much more load than a single core system.
One way to get around this would be by defining separate host groups, based on the amount of processors in a system. You could then define a unique check_load command for each CPU host group.
I've gone the other way around though...
My work-around for this is by replacing check_load with check_load2. This script takes no command line parameters and works on the basis of standard multipliers. We are of the opinion that the number of processors multiplied by a certain factor (150%? 200%? and so on) is a good enough way to define these WARN and CRIT levels. These multipliers can easily be modified (at the top of the script) to fit what -you- think is a worrying level of activity.
This script was tested on Redhat ES3, Solaris 8 and Mac OS X 10.4. It should run on other versions of these OSes as well.
EDIT:
Oh! Just like my other recent Nagios scripts, check_load2 comes with a debugging option. Set $DEBUG at the top of the file to anything larger than zero and the script will dump information at various stages of its execution.
#!/usr/bin/bash # # CPU load monitor plugin for Nagios # Written by Thomas Sluyter (nagiosATkilalaDOTnl) # By request of KPN-IS, i-Provide, the Netherlands # Last Modified: 22-06-2006 # # Usage: ./check_load2 # # Description: # Ethan's original version of the check_load script is very flexible. # It allows you to specifically set WARN and CRIT levels regarding # the CPU load of the system you're monitoring. # However: flexibility is not always a good thing. Say for example that # you want to monitor the CPU load across a few hundred of systems having # various CPU configurations. You -could- define host groups for single, dual # quad (and so on) processor systems and assign unique check_load command # definitions to each group. # Or you could write a script which checks the amount of active CPUs and # then makes an educated guess at the WARN and CRIT levels for the system. # In most cases this should really be enough. # # Limitations: # This script should work properly on all implementations of Linux, Solaris # and Mac OS X. # # Output: # Depending on the levels defined at the top of the script, # the script returns an OK, WARN or CRIT to Nagios based on CPU load. # # Other notes: # If you ever run into problems with the script, set the DEBUG variable # to 1. I'll need the output the script generates to do troubleshooting. # See below for details. # I realise that all the debugging commands strewn throughout the script # may make things a little harder to read. But in the end I'm sure it was # well worth adding them. It makes troubleshooting so much easier. :3 # # You may have to change this, depending on where you installed your # Nagios plugins PATH="/usr/bin:/usr/sbin:/bin:/sbin" LIBEXEC="/usr/local/nagios/libexec" . $LIBEXEC/utils.sh ### DEBUGGING SETUP ### # Cause you never know when you'll need to squash a bug or two DEBUG="1" DEBUGFILE="/tmp/foobar" rm $DEBUGFILE ### REQUISITE NAGIOS COMMAND LINE STUFF ### print_usage() { echo "Usage: $PROGNAME" echo "Usage: $PROGNAME --help" } print_help() { echo "" print_usage echo "" echo "Semi-intelligent CPU load monitor plugin for Nagios" echo "" echo "This plugin not developped by the Nagios Plugin group." echo "Please do not e-mail them for support on this plugin, since" echo "they won't know what you're talking about :P" echo "" echo "For contact info, read the plugin itself..." } while test -n "$1" do case "$1" in --help) print_help; exit $STATE_OK;; -h) print_help; exit $STATE_OK;; *) print_usage; exit $STATE_UNKNOWN;; esac done ### SETTING UP THE WARN AND CRIT FACTORS ### # Please be aware that these are -factors- and not real load average values. # The numbers below will be multiplied by the amount of processors to come # to the desired WARN and CRIT levels. Feel free to adjust these factors, if # you feel the need to tweak them. WARN_1min="2.00" WARN_5min="1.50" WARN_15min="1.50" [ $DEBUG -gt 0 ] && echo "Factors: warning factors are at $WARN_1min, $WARN_5min, $WARN_15min." >> $DEBUGFILE CRIT_1min="3.00" CRIT_5min="2.00" CRIT_15min="2.00" [ $DEBUG -gt 0 ] && echo "Factors: critical factors are at $CRIT_1min, $CRIT_5min, $CRIT_15min." >> $DEBUGFILE ### DEFINING SUBROUTINES ### function gather_procs_linux() { NUMPROCS=`cat /proc/cpuinfo | grep ^processor | wc -l` [ $DEBUG -gt 0 ] && echo "Numprocs: Number of processors detected is $NUMPROCS." >> $DEBUGFILE } function gather_procs_sunos() { NUMPROCS=`/usr/bin/mpstat | grep -v CPU | wc -l` [ $DEBUG -gt 0 ] && echo "Numprocs: Number of processors detected is $NUMPROCS." >> $DEBUGFILE } function gather_procs_darwin() { NUMPROCS=`/usr/bin/hostinfo | grep "Default processor set" | awk '{print $8}'` [ $DEBUG -gt 0 ] && echo "Numprocs: Number of processors detected is $NUMPROCS." >> $DEBUGFILE } function gather_load_linux() { REAL_1min=`cat /proc/loadavg | awk '{print $1}'` REAL_5min=`cat /proc/loadavg | awk '{print $2}'` REAL_15min=`cat /proc/loadavg | awk '{print $3}'` [ $DEBUG -gt 0 ] && echo "Gather_load: Detected load averages are $REAL_1min, $REAL_5min, $REAL_15min." >> $DEBUGFILE } function gather_load_sunos() { REAL_1min=`w | grep "load average" | awk -F, '{print $4}' | awk '{print $3}'` REAL_5min=`w | grep "load average" | awk -F, '{print $5}'` REAL_15min=`w | grep "load average" | awk -F, '{print $6}'` [ $DEBUG -gt 0 ] && echo "Gather_load: Detected load averages are $REAL_1min, $REAL_5min, $REAL_15min." >> $DEBUGFILE } function gather_load_darwin() { REAL_1min=`sysctl -n vm.loadavg | awk '{print $1}'` REAL_5min=`sysctl -n vm.loadavg | awk '{print $2}'` REAL_15min=`sysctl -n vm.loadavg | awk '{print $3}'` [ $DEBUG -gt 0 ] && echo "Gather_load: Detected load averages are $REAL_1min, $REAL_5min, $REAL_15min." >> $DEBUGFILE } function check_load() { WARN="0"; CRIT="0" [ `echo "if(($NUMPROCS * $WARN_1min) > $REAL_1min) 0; if(($NUMPROCS * $WARN_1min) <= $REAL_1min) 1" | bc` -gt 0 ] && let WARN=$WARN+1 [ `echo "if(($NUMPROCS * $WARN_5min) > $REAL_5min) 0; if(($NUMPROCS * $WARN_5min) <= $REAL_5min) 1" | bc` -gt 0 ] && let WARN=$WARN+1 [ `echo "if(($NUMPROCS * $WARN_15min) > $REAL_15min) 0; if(($NUMPROCS * $WARN_15min) <= $REAL_15min) 1" | bc` -gt 0 ] && let WARN=$WARN+1 [ $DEBUG -gt 0 ] && echo "Check_load: warning levels are `echo "$NUMPROCS * $WARN_1min"|bc`, `echo "$NUMPROCS * $WARN_5min"|bc`, `echo "$NUMPROCS * $WARN_15min"|bc`," >> $DEBUGFILE [ `echo "if(($NUMPROCS * $CRIT_1min) > $REAL_1min) 0; if(($NUMPROCS * $CRIT_1min) <= $REAL_1min) 1" | bc` -gt 0 ] && let CRIT=$CRIT+1 [ `echo "if(($NUMPROCS * $CRIT_5min) > $REAL_5min) 0; if(($NUMPROCS * $CRIT_5min) <= $REAL_5min) 1" | bc` -gt 0 ] && let CRIT=$CRIT+1 [ `echo "if(($NUMPROCS * $CRIT_15min) > $REAL_15min) 0; if(($NUMPROCS * $CRIT_15min) <= $REAL_15min) 1" | bc` -gt 0 ] && let CRIT=$CRIT+1 [ $DEBUG -gt 0 ] && echo "Check_load: critical levels are `echo "$NUMPROCS * $CRIT_1min"|bc`, `echo "$NUMPROCS * $CRIT_5min"|bc`, `echo "$NUMPROCS * $CRIT_15min"|bc`," >> $DEBUGFILE [ $WARN -gt 0 ] && (echo "NOK: load averages are at $REAL_1min, $REAL_5min, $REAL_15min"; exit $STATE_WARNING) [ $CRIT -gt 0 ] && (echo "NOK: load averages are at $REAL_1min, $REAL_5min, $REAL_15min"; exit $STATE_CRITICAL) } ### FINALLY, THE MAIN ROUTINE ### NUMPROCS="0" case `uname` in Linux) gather_procs_linux; gather_load_linux; check_load;; Darwin) gather_procs_darwin; gather_load_darwin; check_load;; SunOS) gather_procs_sunos; gather_load_sunos; check_load;; *) echo "OS not supported by this check."; exit 1;; esac # Nothing caused us to exit early, so we're okay. echo "OK - load averages are at $REAL_1min, $REAL_5min, $REAL_15min" exit $STATE_OK
kilala.nl tags: nagios, unix, programming,
View or add comments (curr. 7)
Posted by Elias P.
We've added support for HP-UX to check_load2.
Regards, Elias P.
This is the patch:
--- /root/remote_install/checks/check_load2 2007-10-16 10:14:27.000000000 +0200
+++ /tmp/check_load2 2007-10-16 09:12:16.000000000 +0200
@@ -1,4 +1,4 @@
-#!/bin/bash
+#!/usr/bin/bash
#
# CPU load monitor plugin for Nagios
# Written by Thomas Sluyter (nagios@kilala.nl)
@@ -118,12 +118,6 @@
[ $DEBUG -gt 0 ] && echo "Numprocs: Number of processors detected is $NUMPROCS." >> $DEBUGFILE
}
-function gather_procs_hpux()
-{
- NUMPROCS=`/usr/contrib/bin/machinfo | grep "Number of CPUs" | awk '{print $5}'`
-[ $DEBUG -gt 0 ] && echo "Numprocs: Number of processors detected is $NUMPROCS." >> $DEBUGFILE
-}
-
function gather_load_linux()
{
REAL_1min=`cat /proc/loadavg | awk '{print $1}'`
@@ -148,14 +142,6 @@
[ $DEBUG -gt 0 ] && echo "Gather_load: Detected load averages are $REAL_1min, $REAL_5min, $REAL_15min." >> $DEBUGFILE
}
-function gather_load_hpux()
-{
- REAL_1min=`w | grep "load average" | awk -F, '{print $4}' | awk '{print $3}'`
- REAL_5min=`w | grep "load average" | awk -F, '{print $5}'`
- REAL_15min=`w | grep "load average" | awk -F, '{print $6}'`
-[ $DEBUG -gt 0 ] && echo "Gather_load: Detected load averages are $REAL_1min, $REAL_5min, $REAL_15min." >> $DEBUGFILE
-}
-
function check_load()
{
WARN="0"; CRIT="0"
@@ -182,7 +168,6 @@
Linux) gather_procs_linux; gather_load_linux; check_load;;
Darwin) gather_procs_darwin; gather_load_darwin; check_load;;
SunOS) gather_procs_sunos; gather_load_sunos; check_load;;
- HP-UX) gather_procs_hpux; gather_load_hpux; check_load;;
*) echo "OS not supported by this check."; exit 1;;
esac
Posted by Thomas
Thanks for that Elias! I'll update my script @Nagios Exchange as soon as possible.
Posted by Stefan.S (website)
function gather_procs_irix()
{
NUMPROCS=`/usr/sbin/mpadmin -n | wc -l`
[ $DEBUG -gt 0 ] && echo "Numprocs: Number of processors detected is $NUMPROCS." >> $DEBUGFILE
}function gather_load_irix()
{
REAL_1min=`/usr/bsd/w | grep "load average" | awk -F, '{print $4}' | awk '{print $3}'`
REAL_5min=`/usr/bsd/w | grep "load average" | awk -F, '{print $5}'`
REAL_15min=`/usr/bsd/w | grep "load average" | awk -F, '{print $6}'`
[ $DEBUG -gt 0 ] && echo "Gather_load: Detected load averages are $REAL_1min, $REAL_5min, $REAL_15min." >> $DEBUGFILEcase `uname` in
Linux) gather_procs_linux; gather_load_linux; check_load;;
Darwin) gather_procs_darwin; gather_load_darwin; check_load;;
SunOS) gather_procs_sunos; gather_load_sunos; check_load;;
IRIX64) gather_procs_irix; gather_load_irix; check_load;;
*) echo "OS not supported by this check."; exit 1;;
esac
And i found one issue. The script will never come to critical state cause it will exit always exit in warning state. All you have to do is to change the the critial check and the warning check like this.
function check_load()
{
WARN="0"; CRIT="0"
[ `echo "if(($NUMPROCS * $WARN_1min) > $REAL_1min) 0; if(($NUMPROCS * $WARN_1min) <= $REAL_1min) 1" | bc` -gt 0 ] && let WARN=$WARN+1
[ `echo "if(($NUMPROCS * $WARN_5min) > $REAL_5min) 0; if(($NUMPROCS * $WARN_5min) <= $REAL_5min) 1" | bc` -gt 0 ] && let WARN=$WARN+1
[ `echo "if(($NUMPROCS * $WARN_15min) > $REAL_15min) 0; if(($NUMPROCS * $WARN_15min) <= $REAL_15min) 1" | bc` -gt 0 ] && let WARN=$WARN+1
[ $DEBUG -gt 0 ] && echo "Check_load: warning levels are `echo "$NUMPROCS * $WARN_1min"|bc`, `echo "$NUMPROCS * $WARN_5min"|bc`, `echo "$NUMPROCS * $WARN_15min"|bc`," >> $DEBUGFILE
[ `echo "if(($NUMPROCS * $CRIT_1min) > $REAL_1min) 0; if(($NUMPROCS * $CRIT_1min) <= $REAL_1min) 1" | bc` -gt 0 ] && let CRIT=$CRIT+1
[ `echo "if(($NUMPROCS * $CRIT_5min) > $REAL_5min) 0; if(($NUMPROCS * $CRIT_5min) <= $REAL_5min) 1" | bc` -gt 0 ] && let CRIT=$CRIT+1
[ `echo "if(($NUMPROCS * $CRIT_15min) > $REAL_15min) 0; if(($NUMPROCS * $CRIT_15min) <= $REAL_15min) 1" | bc` -gt 0 ] && let CRIT=$CRIT+1
[ $DEBUG -gt 0 ] && echo "Check_load: critical levels are `echo "$NUMPROCS * $CRIT_1min"|bc`, `echo "$NUMPROCS * $CRIT_5min"|bc`, `echo "$NUMPROCS * $CRIT_15min"|bc`," >> $DEBUGFILE
### here is the change
[ $CRIT -gt 0 ] && (echo "NOK: load averages are at $REAL_1min, $REAL_5min, $REAL_15min"; exit $STATE_CRITICAL)
[ $WARN -gt 0 ] && (echo "NOK: load averages are at $REAL_1min, $REAL_5min, $REAL_15min"; exit $STATE_WARNING)
}
}
after the change the script will check first if critical state and after check warning state.
Posted by Stefan Schwiedel (website)
If someone uses nagiosQL, this little script will help you.
You can change the letters for options too :)
not if you start "./check_load2 -j 2.0" you will change the default (hardcoded) value in the script of WARN_1min to a load of 2.0 per CPU.
Hope that will help somebody in big environments.
You must change only the first lines of check_load2.
# You may have to change this, depending on where you installed your
# Nagios plugins
#
PATH="/usr/bin:/usr/sbin:/bin:/sbin"
LIBEXEC="/usr/Tools/nagios/libexec"
. $LIBEXEC/utils.sh
### DEBUGGING SETUP ###
# Cause you never know when you'll need to squash a bug or two
DEBUG="1"
DEBUGFILE="/tmp/nagios_check_load_debug"
rm $DEBUGFILE
### REQUISITE NAGIOS COMMAND LINE STUFF ###
PROGNAME=$(basename $0)
print_usage() {
echo "Usage: $PROGNAME"
echo "Usage: $PROGNAME --help"
echo " [ -j <warn level 1 min> -k <warn 5 level min> -l <warn level 15 min> ]"
echo " [ -m <critical level 1 min> -n <critical 5 level min> -o <critical level 15 min> ]"
}
print_help() {
echo ""
print_usage
echo ""
echo "Semi-intelligent CPU load monitor plugin for Nagios"
echo ""
echo "This plugin not developped by the Nagios Plugin group."
echo "Please do not e-mail them for support on this plugin, since"
echo "they won't know what you're talking about :P"
echo ""
echo "For contact info, read the plugin itself..."
}
### SETTING UP THE WARN AND CRIT FACTORS ###
# Please be aware that these are -factors- and not real load average values.
# The numbers below will be multiplied by the amount of processors to come
# to the desired WARN and CRIT levels. Feel free to adjust these factors, if
# you feel the need to tweak them.
### this default factor x number of cpus + warning level
WARN_1min="1.20"
WARN_5min="1.10"
WARN_15min="1.00"
### this default factor x number of cpus + critical level
CRIT_1min="2.40"
CRIT_5min="2.20"
CRIT_15min="2.00"
while getopts j:k:l:m:n:o:h OPTION
do
case ${OPTION} in
j) WARN_1min=${OPTARG};;
k) WARN_5min=${OPTARG};;
l) WARN_15min=${OPTARG};;
m) CRIT_1min=${OPTARG};;
n) CRIT_5min=${OPTARG};;
o) CRIT_15min=${OPTARG};;
h) print_help;;
?) print_usage;
exit 2;;
esac
done
[ $DEBUG -gt 0 ] && echo "Factors: warning factors are at $WARN_1min, $WARN_5min, $WARN_15min." >> $DEBUGFILE
[ $DEBUG -gt 0 ] && echo "Factors: critical factors are at $CRIT_1min, $CRIT_5min, $CRIT_15min." >> $DEBUGFILE
### DEFINING SUBROUTINES ###
......
Posted by Hajo Kuras
Hello,
first of all thanks for the good work!
I have fixed the gather_load_sunos() function, as the string returned by "w" contains a variable number of strings depending on the uptime of the system. When freshly restarted, the script does not work as it was.
Best Regards,
Hajo
function gather_load_sunos()
{
REAL_1min=`w -u | sed -e 's/..*average: //g' | awk -F, '{print $1}'`
REAL_5min=`w -u | sed -e 's/..*average: //g' | awk -F, '{print $2}'`
REAL_15min=`w -u | sed -e 's/..*average: //g' | awk -F, '{print $3}'`
[ $DEBUG -gt 0 ] && echo "Gather_load: Detected load averages are $REAL_1min, $R
EAL_5min, $REAL_15min." >> $DEBUGFILE
}
All content, with exception of "borrowed" blogpost images, or unless otherwise indicated, is copyright of Tess Sluijter. The character Kilala the cat-demon is copyright of Rumiko Takahashi and used here without permission.
You are free to use this specific work, to share and distribute it and to adapt it for your own purposes. However, you must attribute this work as mine and you must share all of your alterations. Click on the logo, or follow this link for full details.