BoKS troubleshooting: corrupt message queues

2010-01-12 20:36:00

Today I ran into a problem I hadn't encountered before: seemingly out of the blue one of our BoKS client systems would not allow you to login. The console showed the familiar "No contact with BoKS. Only "root" may login." message. The good thing was that the master could still communicate with the client through the clntd channel, so at least I could do a sysreplace restore through cadm -s.

We were originally alerted about this problem after the client in question has started reporting it's /var partition had reached 100%. After logging in I quickly saw why: for over 24 hours the bridge_servc_s process had been dumping core, with hundreds of core dumps in /var/core. This also explained why logging in does not work, but master-to-client comms were still OK. /var/adm/messages also confirmed these crashes, showing that the boks_bridge process kept on restarting and dying on a SIGBUS signal.

The $BOKS_var/boks_errlog file showed these messages between a restart and a rekill of BoKS:

boks_init@CLIENT Tue Jan 12 09:52:09 2010
  INFO: Max file descriptors 1024
boks_sshd@CLIENT Tue Jan 12 09:52:09 2010
  WARNING: Could not load host key: /etc/opt/boksm/keys/host.kpg
boks_udsqd@CLIENT Jan 12 09:52:09 [servc_queue]
  WARNING: Failed to connect to any server (0/1). Last attempt to ".servc", errno 146
boks_init@CLIENT Tue Jan 12 09:52:09 2010
  WARNING: Respawn process bridge_servc_s exited, reason: signal SIGBUS. Process restarted.
boks_udsqd@CLIENT Jan 12 09:52:10 [servc_queue]
  WARNING: Dropping packet. Server failed to accept it
boks_init@CLIENT Tue Jan 12 09:52:13 2010
  WARNING: Respawn process bridge_servc_s exited to often, NOT respawned
boks_init@CLIENT Tue Jan 12 09:53:26 2010
  WARNING: Dying on signal SIGTERM

This indicates that none of the replicas was accepting servc request from the client, which again explains why one could not login, nor use suexec etc. Checking the $BOKS_var/boks_errlog file on the replicas explained why the servc requests were being rejected:

%oks_bridge@REPLICA Mon Jan 11 22:41:16 2010
  ERROR: Got malformed message from 192.168.10.113
%oks_bridge@REPLICA Tue Jan 12 01:04:06 2010
  ERROR: Got malformed message from 192.168.10.113
%oks_bridge@REPLICA Tue Jan 12 01:07:46 2010
  ERROR: Got malformed message from 192.168.10.113

And so on... After deliberating with FoxT tech support they concluded that the client must have had a message in its outgoing servc queue that had gotten damaged. They suggested that I make a backup of $BOKS_var/data/crypt_spool/servc and then remove the files in that directory. Normally it's not a good idea to remove these files, as they may contain password-change requests from users, but in this case there wasn't much else we could do. Remember though, leave the crypt_spool directory alone on the master and replicas, because that stuff's even more important!

What do you know? After clearing out the message queue the client worked perfectly. I'm now working with FoxT to find out which one of the few dozen messages was the corrupt one. In the process I'm trying to learn a little about the insides of BoKS. For example, looking at the message files it seems that either they were ALL deformed, or BoKS doesn't actually have a uniform format for them, because some contained a smattering of newline characters, while other files were one long line. I'm still waiting for a reply on that question.


kilala.nl tags: , ,

View or add comments (curr. 0)