2010-03-16 21:43:00
In a BoKS infrastructure the master server automatically distributes database updates to its replicas. BoKS provides the admin with a number of ways to verify the proper functioning of these replicas, but none of these is easily hooked into monitoring software.
This script makes use of the following methods to verify infra sanity. * boksdiag list, to verify if replicas are reachable. * boksdiag sequence, to verify if a replica's database is up to date. * dumpbase -tN | wc -l, to verify the actual files on the replicas.
./check_boks_replication [-l LAG] [-h HOST] [-n] [-d -o FILE] -l LAG Maximum amount of updates for a replica table to be behind on. Typically this should not be over 50. Default is 30. -h HOST Hostname of individual replica to verify. -x EXCLUDE Hostname of replica to exclude. -p Disable the use of ping in connection testing, in case of firewalls. -n Dry-run mode. Will only return an OK status. -d Debug mode. Use with dry-run mode to test Tivoli. -o FILE Output file for debugging logs. Required when -d is used. Example: ./check_boks_sequence -l 20 -d -o /tmp/foobar Multiple -h and -x parameters are allowed.
This script is meant to be called as a Tivoli numeric script. Hence both the output and the exit code are a single digit. Please configure your numeric script calls accordingly:
0 = OK
1 = WARNING
2 = SEVERE
3 = CRITICAL
$ wc check_boks_replication.ksh 570 2668 17878 check_boks_replication.ksh $ cksum check_boks_replication.ksh 4063571181 17878 check_boks_replication.ks
kilala.nl tags: sysadmin, boks,
View or add comments (curr. 3)
Posted by Thomas
Just fixed a small "bug" in the script. While the theoretical maximum amount of BoKS database tables is currently 54 it is not guaranteed that this maximum amount is actually active on a server. If you try to dumpbase a nonexistent table it will return the following error:
dumpbase@SERVER Mon Sep 14 07:34:51 2009
FATAL ERROR: Illegal table number, Error 0 (0)
This will of course throw Tivoli for a loop since it only expects a number on stdout/stderr. This is why I've replaced this:
NUM=0
while [[ $NUM -lt 55 ]]
...
with this:
NUM=0
let MAXNUM=$(boksdiag sequence | grep ^T | tail -1 | cut -c 2,3)+1
while [[ $NUM -lt $MAXNUM ]]
...
Posted by Thomas
I may add another parameter to the command line to specify the $HOLDOFF. This parameter indicates how long the process on the master should wait before fetching status info from a replica.
T0 = Run commands on replica through cadm
T1 = Sleep
T1 + HOLDOFF = Fetch info from replica
The current default is 10 seconds, but that's not nearly enough if you have a very big database. A dumpbase on such a replica might take longer than that.
Maybe it's another idea to add a parameter that switches off the linecount check on dumpbase output. That way you'll prevent unneeded locking and it'll speed things up. The output would be less reliable though.
Posted by Thomas
Did two things:
1. Added the "no ping" option, by request of $CLIENT. This allows you to do a successful test, despite firewalls blocking a simple ping.
2. Fixed the database table comparison a little bit. Now all tables are checked, instead of simply those that have at least one line.
All content, with exception of "borrowed" blogpost images, or unless otherwise indicated, is copyright of Tess Sluijter. The character Kilala the cat-demon is copyright of Rumiko Takahashi and used here without permission.
You are free to use this specific work, to share and distribute it and to adapt it for your own purposes. However, you must attribute this work as mine and you must share all of your alterations. Click on the logo, or follow this link for full details.