2008-11-22 20:29:00
As I mentioned at the end of example 1 the problem with the seemingly random login denials was caused by a misbehaving replica server. We tracked the problem down to REPHOST, where we discovered that three of the database tables were not in sync with the rest. A whole number of hosts were being reported as non-existent, which was causing login problems for our users.
Now that we've figured out which server was giving us problems and what the symptoms were, we needed to figure out what was causing the issues.
One of our replica servers had three database tables that were not getting any updates. Their sequence numbers as reported by "boksdiag sequence" were very different from the sequence numbers on the master, indicating nastiness.
Just to be sure that the replica is still malfunctioning, let's check the sequence numbers again.
BoKS > boksdiag sequence
...
T7 13678d 8:33:46 5053 (5053)
...
T9 13178d 11:05:23 7919 (7919)
...
T15 13178d 11:03:16 1865 (1865)
...
BoKS > boksdiag sequence
...
T7 13678d 8:33:46 6982 (6982)
...
T9 13178d 11:05:23 10258 (10258)
...
T15 13178d 11:03:16 2043 (2043)
Yup, it's still broken :) You may notice that the sequence numbers on the replica are actually AHEAD of the numbers on the master server.
Because I was not sure what had been done to REPHOST in the past I wanted to reset it completely, without reinstalling the software. I knew that the host had been involved in a disaster recovery test a few months before, so I had a hunch that something'd gone awry in the conversion between the various host states.
Hence I chose to convert the replica back to client status.
BoKS> sysreplace restore
BoKS> convert -v client
Stopping daemons...
Setting BOKSINIT=client in ENV file...
Restarting daemons...
Conversion from replica to client done.
BoKS > cd /var/opt/boksm
BoKS > tail -20 boks_errlog
...
WARNING: Dying on signal SIGTERM
boks_authd Nov 17 14:59:30
INFO: Shutdown by signal SIGTERMboks_csspd@REPHOST Nov 17 14:59:30
INFO: Shutdown by signal SIGTERM
boks_authd Nov 17 14:59:30
INFO: Min idle workers 32boks_csspd@REPHOST Nov 17 14:59:30
INFO: Min idle workers 32
BoKS > sysreplace replace
I verified that all the BoKS processes running are newly created and that there are no stragglers from before the restart. Also, I tried to SSH to the replica to make sure that I could still log in.
The BoKS master server will also need to know that the replica is now a client. In order to do this I needed to change the host's TYPE in the database. Initially I tried doing this with the following command.
BoKS> hostadm -a -h REPHOST -t UNIXBOKSHOST
Unfortunately this command refused to work, so I chose to modify the host type through the BoKS webinterface. Just a matter of a few clicks here and there. Afterwards the BoKS master was aware that the replica was no more. `
BoKS > boksdiag list
Server Since last Since last Since last Count Last
REPHOST5 00:49 523d 5:19:20 04:49 1853521 ok
REPHOST4 00:49 136d 22:21:35 04:49 526392 ok
REPHOST3 00:49 04:50 726768 ok
REPHOST2 00:49 107d 5:05:33 04:49 425231 ok
REPHOST 02:59 02:13 11:44 148342 down
It'll take a little while for REPHOST's entry to completely disappear from the "boksdiag list" output. I sped things up a little bit by restarting the BoKS master using the "Boot -k" and "Boot" commands.
Of course I wanted REPHOST to be a replica again, so I changed the host type in the database using the webinterface.
I then ran the "convert" command on REPHOST to promote the host again.
BoKS > convert -v replica
Checking to see if a master can be found...
Stopping daemons...
Setting BOKSINIT=replica in ENV file...
Restarting daemons...
Conversion from client to replica done.
BoKS > ps -ef | grep -i boks
root 16543 16529 0 15:14:33 ? 0:00 boks_bridge -xn -s -l servc.s -Q !/etc/opt/boksm!.servc!servc_queue -q /etc/opt
root 16536 16529 0 15:14:33 ? 0:00 boks_servc -p1 -xn -Q !/etc/opt/boksm!.xservc1!xservc_queue
root 16535 16529 0 15:14:33 ? 0:00 boks_servm -xn
root 16529 1 0 15:14:33 ? 0:00 boks_init -f /etc/opt/boksm/boksinit.replica
root 16540 16529 0 15:14:33 ? 0:00 boks_bridge -xn -r -l servc.r -Q /etc/opt/boksm/xservc_queue -P servc -k -K /et
root 16552 16529 0 15:14:33 ? 0:00 boks_csspd -e/var/opt/boksm/boks_errlog -x -f -c -r 600 -l -k -t 32 -i 20 -a 15
root 16533 16529 0 15:14:33 ? 0:00 boks_bridge -xn -s -l master.s -Q /etc/opt/boksm/master_queue -P master -k -K /
...
...
BoKS > cd ..
BoKS > tail boks_errlog
boks_authd Nov 17 14:59:30
INFO: Min idle workers 32boks_csspd@REPHOST Nov 17 14:59:30
INFO: Min idle workers 32
boks_init@REPHOST Mon Nov 17 15:02:21 2008
WARNING: Respawn process sshd exited, reason: exit(1). Process restarted.
boks_init@REPHOST Mon Nov 17 15:14:31 2008
WARNING: Dying on signal SIGTERM
boks_aced Nov 17 15:14:33
ERROR: Unable to access configuration file /var/ace/sdconf.rec
On the master server I saw that the replica was communicating with the master again.
BoKS > boksdiag list
Server Since last Since last Since last Count Last
REPHOST5 04:35 523d 5:33:41 06:39 1853555 ok
REPHOST4 04:35 136d 22:35:56 06:42 526426 ok
REPHOST3 04:35 06:43 726802 ok
REPHOST2 04:35 107d 5:19:54 06:41 425265 ok
REPHOST 01:45 16:34 26:05 0 new
Oddly enough REPHOST was not receiving any real database updates. I also noticed that the sequence numbers for the local database copy hadn't changed. This was a hint that stuck in the back of my head, but I didn't pursue it at the time. Instead I expected there to be some problem with the communications bridges between the master and REPHOST.
BoKS > ls -lrt
...
...
-rw-r----- 1 root root 0 Nov 17 15:14 copsable.dat
-rw-r----- 1 root root 0 Nov 17 15:14 cert2user.dat
-rw-r----- 1 root root 0 Nov 17 15:14 cert.dat
-rw-r----- 1 root root 0 Nov 17 15:14 ca.dat
-rw-r----- 1 root root 0 Nov 17 15:14 authenticator.dat
-rw-r----- 1 root root 0 Nov 17 15:14 addr.dat
BoKS >
I was rather confused by now. Because REPHOST wasn't getting database updates I though to check the following items
Everything seemed completely fine! It was time to break out the big guns.
I decided to clear out the whole local cop of the database, to make sure that REPHOST had a clean start.
BoKS > Boot -k
BoKS > cd /var/opt/boksm
BoKS > tar -cvf data.20081117.tar data/*
a data/ 0K
a data/crypt_spool/ 0K
a data/crypt_spool/clntd/ 0K
a data/crypt_spool/clntd/ba_fbuf_LCK 0K
a data/crypt_spool/clntd/ba_fbuf_0000000004 6K
a data/crypt_spool/clntd/ba_fbuf_0000000003 98K
a data/crypt_spool/servc/ 0K
a data/crypt_spool/servm/ 0K
...
BoKS > cd data
BoKS > rm *.dat
BoKS > Boot
Checking the contents of /var/opt/boksm/data immediately afterwards showed that BoKS had re-created the database table files. Some of them were getting updates, but over 90% of the tables remained completely empty.
As explained in this article it's possible to trace the internal workings of just about every BoKS process. This includes the various communications bridges that connect the BoKS hosts.
I'd decided to use "bdebug" on the "servm_r" and "servm" processes on REPHOST, while also debugging "drainmast" and "drainmast_s" on the master server. The flow of data starts at drainmast, the goes through drainmast_s and servm_r to finally end up in servm on the replica. Drainmast is what sends data to replicas and servm is what commits the received changes to the local database copy.
Unfortunately the trace output didn't show anything remarkable, so I won't go over the details.
By now I'd drained all my inspiration. I had no clue what was going on and I was one and a half hours into an incident that should've taken half an hour to fix. Since I always say that one should know one's limitations I decided to call in Fox Tech tech support. Because it was already 1600 and I wanted to have the issue resolved before I went home I called their international support number.
I submitted all the requested files to my engineer at FoxT, who was still investigating the case around 1800. Unfortunately things had gone a bit wrong in the handover between the day and the night shift, so my case had gotten lost. I finally got a call back from an engineer in the US at 2000. I talked things over with him and something in our call triggered that little voice stuck in the back of my head: sequence numbers!
The engineer advised me to go ahead and clear the sequence numbers file on REPHOST. At the same time I also deleted the database files again for a -realy clean start.
BoKS > Boot -k
BoKS > cd /var/opt/boksm
BoKS > tar -cvf data.20081117-2.tar data/*
...
BoKS > cd data
BoKS > rm *.dat
BoKS > rm sequence
BoKS > Boot
Lo and behold! The database copy on REPHOST was being updated! All of the tables were getting filled again, including the three tables that had been stuck from the beginning.
The engineer informed me that in BoKS 6.5 the "convert" command is supposed to clear out the database and sequence file when demoting a master/replica to client status. Apparently this is NOT done automatically in BoKS versions 6.0 and lower.
We discovered that the host had at one point in time played the role of master server and that there was still some leftover crap from that time. During REPHOST's time as the master the sequence numbers for tables 7, 9 and 15 had gotten ahead of the sequence numbers of the real master which was turned off at the time. This had happened because these three tables were edited extensively during the original master's downtime. This in turn led to these tables never getting updated.
After the whole mess was fixed we concluded that the following four steps are all you need to restart your replica in a clean state.
I've also asked the folks at Fox Tech to issue a bugfix request to their developers. As I mentioned in step 1, the seqeunce numbers on the replica were ahead of those on the master. Realisticly speaking this should never happen, but BoKS does not currently recognize said situation as a failure.
In the meantime I will write a monitoring script for Nagios and Tivoli that will monitor the proper replication of the BoKS database.
kilala.nl tags: boks, sysadmin,
View or add comments (curr. 0)
All content, with exception of "borrowed" blogpost images, or unless otherwise indicated, is copyright of Tess Sluijter. The character Kilala the cat-demon is copyright of Rumiko Takahashi and used here without permission.
You are free to use this specific work, to share and distribute it and to adapt it for your own purposes. However, you must attribute this work as mine and you must share all of your alterations. Click on the logo, or follow this link for full details.