2008-01-01 00:00:00
A PDF version of this document is available. Get it over here.
People have often asked me how one can check of a newly installed BoKS client is functioning
properly. With these three easy steps you too can become a milliona..!!.... Oops... Wrong show!
These easy steps will show you whether your new client is working like it should.
If all three steps go through without error your systems is as healthy as a very healthy good
thing... or something.
Since on or more of the replicas is/are out of sync login attempts by users may fail, assuming that
the BoKS client on the server in question was looking at the out-of-sync BoKS replica. Other
nasty stuff may also occur.
Standard procedure is to follow these steps:
All commands are run in a BoKS shell, on the master server unless specified otherwise.
# /opt/boksm/sbin/boksadm –S boksdiag list
Since last pckt
The amount of minutes/seconds since the BoKS master
last sent a communication packet to the respective
replica server. This amount should never exceed more
than a couple of minutes.
Since last fail
The amount of days/hours/minutes since the BoKS
master was last unable to update the database on the
respective replica server. If an amount of a couple of
hours is listed you’ll know that the replica server had a
recent failure.
Since last sync
Shows the amount of days/hours/minutes since BoKS last
sent a database update to the respective replica server.
Last status
Yes indeed! The last known status of the replica server in
question. OK means that the server is running perfectly
and that updates are received. Loading means that the
server was just restarted and is still loading the database
or any updates. Down indicates that the replica server is
down or even dead.
This should be pretty self-explanatory. Read the /var/opt/boksm/boks_errlog file on both the
master and the replicas to see if you can detect any errors there. If the log file doesn’t mention
something about the hosts involved you should be able to find the cause of the problem pretty
quickly.
Keon> boksdiag download –force $hostname
This will push a database update to the replica. Perform another boksdiag list to see if it
worked. Re-read the BoKS error log file to see if things have cleared up.
Keon> ps –ef | grep –i drainmast
This should show two drainmast processes running. If there aren’t you should see errors about
this in the error logs and in Tivoli.
Keon> Boot –k
Keon> ps –ef | grep –i boks (kill any remaining BoKS processes)
Keon> Boot
Check to see if the two drainmast processes stay up. Keep checking for at least two minutes. If
one of them crashes again, try the following:
Check to see that /opt/boksm/lib/boks_drainmast is still linked to boks_drainmast_d, which
should be in the same directory. Also check to see that boks_drainmast_d is still the same file as
boks_drainmast_d.nonstripped.
If it isn’t, copy boks_drainmast_d to boks_drainmast_d.orig and then copy the non-stripped
version over the boks_drainmast_d. This will allow you to create a core file which is useful to TFS
Technology.
Keon> Boot –k
Keon> Boot
Keon> ls –al /core
Check that the core file was just created by boks_drainmast_d.
Keon> Boot –k
Keon> cd /var/opt/boksm/data
Keon> tar –cvf masterspool.tar master_spool
Keon> rm master_spool/*
Keon> Boot
Things should now be back to normal. Send both the tar file and the core file to TFS Technology
(support@tfstech.com).
Keon> boksdiag fque –master
If any messages are stuck there is most likely still something wrong with the drainmast processes.
You may want to try and reboot the BoKS master software. Do NOT reboot the master server!
Reboot the software using the Boot command. If that doesn’t help, perform the troubleshooting
tips from step 4.
Verify that the BoKS communication between the master and the replica itself is up and running.
Keon> cadm –l –f bcastaddr –h $replica.
If this doesn’t work, re-check the error logs on the client and proceed with step 7.
On the replica system run:
Keon> hostkey
Take the output from that command and run the following on the master:
Keon> dumpbase | grep $hostkey
If this doesn’t return the configuration for the replica server, the keys have become
unsynchronized. If you make any changes you will need to restart the BoKS processes, using the
Boot command.
Keon> dumpbase | grep RNAME | grep $replica
The TYPE field in the definition of the replica should be set to 261. Anything else is wrong, so you
need to update the configuration in the BoKS database. Either that or have SecOPS do it for you.
On the replica system, review the settings in /etc/opt/boksm/ENV.
If all of the above fails you should really get cracking with the debugger. Refer to the appropriate
chapter of this manual for details.
Most obviously we can’t do our work on that particular server and neither can our customers.
Naturally this is something that needs to be fixed quite urgently!
All commands are run in a BoKS shell, on the master server unless specified otherwise.
Keon> cd /var/opt/boksm/data
Keon> grep $user LOG | bkslog –f - -wn
This should give you enough output to ascertain why a certain user cannot login. If there is no
output at all, do the following:
Keon> cd /var/junkyard/bokslogs
Keon> for file in `ls –lrt | tail –5 | awk ‘{print $9}’`
> do
> grep $user $file | bkslog –f - -wn
> done
If this doesn’t provide any output, perform step 2 as well to see if us sys admins can login.
Pretty self-explanatory, isn’t it? Try if you can log in yourself.
Keon> cadm –l –f bcastaddr –h $client
Login to the client through its console port.
Keon> cat /etc/opt/boksm/bcastaddr
Keon> cat /etc/opt/boksm/bremotever
These two files should match the same files on another working client. Do not use a replica or
master to compare the files. These are different over there. If you make any changes you will need
to restart the BoKS processes using the Boot command.
On the client and master run:
Keon> getent services boks
This should return the same value for the BoKS base port. If it doesn’t either check /etc/services
or NIS+. If you make any changes you will need to restart the BoKS processes using the Boot
command.
On the client system run:
Keon> hostkey
Take the output from that command and run the following on the master:
Keon> dumpbase | grep $hostkey
If this doesn’t return the definition for the client server, the keys have become unsynchronized.
Reset them and restart the BoKS client software. If you make any changes you will need to restart
the BoKS processes using the Boot command.
This should be pretty self-explanatory. Read the /var/opt/boksm/boks_errlog file on both the
master and the client to see if you can detect any errors there. If the log file doesn’t mention
something about the hosts involved you should be able to find the cause of the problem pretty
quickly.
If all of the above fails you should really get cracking with the debugger. Refer to the appropriate
chapter of this manual for details (see chapter: SCENARIO: Setting a trace within BoKS)
NOTE: If you need to restart the BoKS software on the client without logging in, try doing so using a remote management tool, like Tivoli.
The whole of BoKS is still up and running and everything’s working perfectly. The only client(s)
that won’t work are the one(s) that have stuck queues. The only way you’ll find out about this is
by running boksdiag fque –bridge which reports all of the queues which are stuck.
All commands are run in a BoKS shell, on the master server unless specified otherwise.
Keon> ping $client
Also ask your colleagues to see if they’re working on the system. Maybe they’re performing
maintenance.
Keon> cadm –l –f bcastaddr –h $client
On the client system run:
Keon> hostkey
Take the output from that command and run the following on the master:
Keon> dumpbase | grep $hostkey
If this doesn’t return the definition for the client server, the keys have become unsynchronised.
Reset them and restart the BoKS client software using the Boot command.
This should be pretty self-explanatory. Read the /var/opt/boksm/boks_errlog file on both the
master and the client to see if you can detect any errors there. If the log file doesn’t mention
something about the hosts involved you should be able to find the cause of the problem pretty
quickly.
NOTE: What can we do about it?
If you’re really desperate to get rid of the queue, do the following
Keon> boksdiag fque –bridge –delete $client-ip
At one point in time we thought it would be wise to manually delete
messages from the spool directories. Do not under any circumstance touch the
crypt_spool and master_spool directories in /var/opt/boksm. Really:
DON’T DO THIS! This is unnecessary and will lead to troubles with BoKS.
We are required to run a BoKS debug trace when either:
getting rejected.
mail. TFS Tech support will usually request us to perform a number of traces and that we send
them the output files..
First off, let me warn you: debug trace log files can grow pretty vast pretty fast! Make sure that
you turn on the trace only right before you’re ready to use the faulty part of BoKS and also be
sure to stop the trace immediately once you’re done.
Now, before you can start a trace you will need to make sure that the BoKS client system only
performs transactions with one BoKS server. If you don’t you will have no way of knowing on
which server you should run the trace.
Login to the client system experiencing problems.
$ su –
# cd /etc/opt/boksm
# cp bcastaddr bcastaddr.orig
# vi bcastaddr
Edit the file in such a way that it only points to one of the available BoKS servers. Preferably a
BoKS replica. Please refrain from using the BoKS master server.
# /opt/boksm/sbin/boksadm –S Boot –k
# sleep 10; ps –ef | grep –i boks | awk '{print $2}' | xargs kill
# /opt/boksm/sbin/boksadm –S Boot
Now, how you proceed depends on what problems you are experiencing.
If people are having problems logging in:
Log in to the replica server and start Boks with sx.
# sx /opt/boksm/sbin/boksadm –S
# cd /var/tmp
Now, type the following command, but DO NOT press enter yet.
# bdebug –x 9 bridge_servc_r –f /var/tmp/BR-SERVC.trace
Open a new terminal window, because we will try to login to the failing client. BEFORE YOU
START THE TOOL USED TO LOGIN (SSH, Telnet, FTP, whatever) press enter at the command
waiting on the replica server. Attempt to login as usual. If it fails you have successfully set a trace.
Switch back to the window on the replica server and run the following command to stop the
trace.
# bdebug –x 0 bridge_servc_r
Repeat the same process once more, but this time around debug the servc process instead of
bridge_servc_r. Send the output to /var/tmp/SERVC.trace.
You can now read through the files /var/tmp/BR-SERVC.trace and /var/tmp/SERVC.trace to
troubleshoot the problem by your self, or you could send it to TFS Tech for analysis. If the
attempted login did NOT fail there’s something else going on: one of the other replica servers is
not working properly! Find out which one it is by changing the client’s bcastaddr file while every
time using a different BoKS server as a target.
If you are attempting to troubleshoot another kind of problem:
Tracing any other part of BoKS isn’t really altogether that different from tracing the login process.
You prepare in the same way (make bcastaddr point at one BoKS server) and you will probably
have to prepare the trace on bridge_servc_r as well (see the text block above; if you do not have
to trace bridge_servc_r TFS Tech will probably tell you so).
Yet again, BEFORE you start the trace on the master side by running
# bdebug –x 9 bridge_servc_r –f /var/tmp/SERVC.trace
You will have to go to the client system with the problematic situation and perform the following.
# cd /var/tmp
# bdebug –x 9 $PROG –f /var/tmp/$PROG.trace
$PROG in this case is the name of the BoKS process (bridge_servc_r, drainmast_download) or the
access method (login, su, sshd) that you want to debug.
Now, start both traces and attempt to perform the task that is failing. Once it has failed, stop
both traces again using bdebug –x 0 $PROG.
From time to time you may have problems with the BoKS SSH daemon which cannot be explained
in any logical way. At such a time a debug trace of the SSH daemon can be very helpful! This can
be done by starting a second daemon on an unused port temporarily.
On the troubled system, login and start a BoKS shell:
# /opt/boksm/sbin/boksadm –S
Keon> boks_sshd –d –d –d –p 24 /tmp/sshd.out 2>&1
From another system:
$ ssh –l $username -p24 $target-host
Try logging in; it shouldn’t work :) Now close the SSH session with Ctrl-C, which should also
close the temporary SSH daemon on port 24. /tmp/sshd.out should now contain all of the
debugging information you or TFS Technology could need.
kilala.nl tags: Troubleshooting, boks, unix control, keon,
View or add comments (curr. 0)
All content, with exception of "borrowed" blogpost images, or unless otherwise indicated, is copyright of Tess Sluijter. The character Kilala the cat-demon is copyright of Rumiko Takahashi and used here without permission.