BoKS troubleshooting: an example of a debugging session

2008-11-21 21:52:00

Recently we ran into a rather perplexing problem: a few of our customers had intermittent login problems. There seemed to be no pattern to this issue, with users from different departments being deing access to their servers at random points in time. Sometimes the problem would go away after a few hours, sometimes it took a few days. It took a few days before the penny dropped and we found out that one of our replica servers was misbehaving.

The paragraphs below outline my diagnosis and troubleshooting procedure.

Original issues

The issues seemed to focus on servers in one specific, physical location.

One of our DBAs created several incidents over the course of a month regarding login issues with user sybase@SYBHOST. Initially this problem was fixed by adding the "ssh_pk" authenticator, but the problem returned with intermittent login denial without an apparent reason.

A number of users from another department indicated intermittent login problems where they were allowed to login one day and denied access the next. My troubleshooting of the problem hadn't given me any real results so far. I'd ran debugging on SSH sessions which didn't clear much up.

For the remainder of this document I will focus on my troubleshooting process for the case involving user sybase.

Symptoms

These denials occur at seemingly random intervals and result in varying BoKS error messages. Most frequent is the rather useless "ERR 223, no authentication" which, as Fox Tech confirms, tells us absolutely nothing. At other times users receive an "ERR 203, no access route" eventhough said user does in fact have the requisite access routes.

Diagnosis

1. Map out the flow of data in this case.

In this case the DBAs attempt to use SSH (with keypair authentication) from sybase@UNIXHOST, to sybase@SYBHOST.

2. Verify the access routes involved in the exchange.

The BoKS database shows that both hosts are part of the hostgroup SYBASE.

BoKS > hgrpadm -l | grep UNIXHOST

...

SYBASE UNIXHOST

TRUSTED UNIXHOST

...



BoKS > hgrpadm -l | grep SYBHOST

...

SYBASE SYBHOST

TRUSTED SYBHOST

...

The BoKS database shows that user sybase is allowed SSH inside hostgroup SYBASE.

BoKS > sx /opt/boksm/sbin/boksadm -S dumpbase -t 2 | grep SYBASE:sybase

RUSER="SYBASE:sybase" ROUTE="ssh*:TRUSTED->SYBASE"

...

RUSER="SYBASE:sybase" ROUTE="ssh*:ANY/SYBASE->SYBASE"

...

3. Verify the authentication methods involved in the exchange.

The BoKS database confirms that sybase is allowed to use SSH keypairs.

BoKS > sx /opt/boksm/sbin/boksadm -S dumpbase -t 31 | grep SYBASE:sybase

RLOGNAME="SYBASE:sybase" TYPE="ssh_pk" VERSION="1.0" FLAGS="1"

4. Check the SSH keypair.

The public key of sybase@UNIXHOST is correctly installed in the authorized_keys file of user sybase@SYBHOST.

sybase@UNIXHOST > cat ~/.ssh/id_dsa.pub

ssh-dss AAAAB3NzaC1kc3MAAACBANSl ... WjUgDlUEIA5g== sybase@UNIXHOST



sybase@SYBHOST > cat ~/.ssh/authorized_keys

ssh-dss AAAAB3NzaC1kc3MAAACBAPd/ ... 8Cbt3Gl9hvTa== sybase@OTHERHOST

ssh-dss AAAAB3NzaC1kc3MAAACBANSl ... WjUgDlUEIA5g== sybase@UNIXHOST

The permissions on the .ssh directory for sybase@SYBHOST are also correct.

sybase@SYBHOST > ls -al ~/.ssh

drwx------ 2 sybase sybase 96 Aug 15 2007 .

drwxr-xr-x 3 sybase sybase 8192 Sep 12 15:58 ..

-rw------- 1 sybase sybase 1210 Oct 27 10:53 authorized_keys

5. Run SSH debug traces.

Because things seem alright so far it's time to check out what's going wrong on the inside of BoKS. The first step to take is to run an additonal debugging SSH daemon. This can be done using the following command. Key in this are the multiple -d flags and "-p 2222".

BoKS > /opt/boksm/lib/boks_sshd -d -d -d -D -g120 -p 2222 >/tmp/Trace.txt 2>&1

The customer is now instructed to attempt a login to port 2222 by adding "-p 2222" to his usual SSH command. This should of course still fail, but this time we can get a trace.

The trace output file gets pretty long because it no only shows the SSH debug information, but also debugging for the BoKS internals. After going through the hostkey exchange, BoKS will start authentication by requesting valid authentication methods.

debug2: userauth-request for user sybase service ssh-connection method none

debug2: input_userauth_request: setting up authctxt for sybase

...

debug2: get_opt_authmethod_from_servc: INSIDE - user = sybase, need_privsep = 0

debug2: boks_servc_call_vec: INSIDE boks_sshd@SYBHOST[6] 14 Nov 11:21:24:026533 in servc_call_str: To server: {FUNC=route-stat-user FROMUSER = sybase ROUTE = SSH:192.168.0.181->?HOST TOHOST=?HOST TOUSER=sybase FROMHOST = 192.168.0.181}

...

boks_sshd@SYBHOST[6] 14 Nov 11:21:24:264031 in servc_call_str: Return: {FUNC=route-stat-user FROMUSER=sybase ROUTE=SSH:192.168.0.181->?HOST TOHOST=?HOST TOUSER=sybase FROMHOST=192.168.0.181 $HOSTSYM=SYBHOST $ADDR=192.168.40.165 $SERVCADDR=192.168.23.9 METHODS=ssh_pk $SERVCVER=6.0.3}

debug2: get_opt_authmethod_from_servc: Must use BokS authentication methods: "ssh_pk"

debug2: get_opt_authmethod_from_servc: BokS optional authentication methods: ""

debug2: boks_ssh_restrict_authmethods: INSIDE - orginal authmethods = publickey,keyboard-interactive

debug2: boks_ssh_restrict_authmethods: DONE - returning methods = publickey

debug2: userauth-request for user

This confirms that authentication using SSH keypairs is allowed and is actually enforced. The key is now checked and (after some fidgeting) accepted.

debug2: input_userauth_request: try method publickey

debug1: trying public key file /home/sybase/.ssh/authorized_keys

...

debug2: userauth_pubkey: authenticated 1 pkalg ssh-dss

Accepted publickey for sybase from 192.168.0.181 port 63569 ssh2

Now that the user has been authenticated BoKS will check his access routes. Sadly this returns with ERR 203 (no access route)

boks_sshd@SYBHOST[6] 14 Nov 11:21:24:304336 in servc_call_str: To server: {FUNC=auth FROMUSER=sybase ROUTE=SSH:192.168.0.181->?HOST TOHOST=?HOST TOUSER=sybase FROMHOST=192.168.0.181 $ssh_pk=ok}

...

boks_sshd@SYBHOST[6] 14 Nov 11:21:24:314704 in servc_call_str: Return: {FUNC=auth FROMUSER=sybase ROUTE=SSH:UNIXHOST->SYBHOST TOHOST=SYBHOST TOUSER=sybase FROMHOST=192.168.0.181 $ssh_pk=ok01$HOSTSYM=SYBHOST $ADDR=192.168.40.165 $SERVCADDR=192.168.23.9 WC=#$*-./?_ UKEY=SYBASE:sybase MOD_CONV=1 SEC_USER=sybase VTYPE=ssh_pk MODLIST=optional_ssh_pk=+1,psw=+1,prompt=-1,timeout=+1,login=+1,verbose=+1 $STATE=6 ERROR=-203 $SERVCVER=6.0.3}

debug3: boks_ssh_do_authorization: Servc auth failed ERROR = -203

6. Force client to use one replica.

Please note that the SSH debug trace above shows that address 192.168.23.9 is used for the servc calls. This indicates that the client is communicating with replica REPHOST. In order to further aid the troubleshooting process it's best to force the client to communicate with just this one replica.

BoKS > cd /etc/opt/boksm

BoKS > vi bcastaddr



DONT_BROADCAST

ADDRESS_LIST

192.168.23.9 REPHOST.domain

~

~

:wq



BoKS > Boot -k

BoKS > Boot

7. Run a trace on the BoKS communications

Just to play it safe we'll need to check that the client's request is sent and received properly. This can be done by running a BoKS debug on the "servc_bridge_[s|r]" process, "s" being on the sending side and "r" on the receiving end.

Once again we'll be asking the customer to SSH to the system. However, right before he executes his command we'll run the following two commands.

Client: bdebug bridge_servc_s -x 9 -f /tmp/servcs.out

Replica: bdebug bridge_servc_r -x 9 -f /tmp/servcr.out

Right after the customer's SSH session is killed again we'll run the following commands.

Client: bdebug bridge_servc_s -x 0

Replica: bdebug bridge_servc_r -x 0

The two resulting files will be rather large and hard to read. Both log should only be given a cursory glance as they only pertain to the BoKS communications itself. In this case the logs indicate no problems at all, though they might have shown problems with hostkeys or network connectivity.

8. Run a trace on the BoKS database processing

Again we will ask the customer to attempt another (failed) login through SSH. This time we will trace another subset of BoKS, the "servc" process which handles the actual database lookup and verification.

Right before the client executes his SSH we'll run the following command.

Replica: bdebug servc -x 9 -f /tmp/servc-trace.out

Right after the customer's SSH session is killed again we'll run the following commands.

Replica: bdebug servc -x 0

The resulting log file will most likely be huge as it will contain all authentication requests handled by the replica during the trace. In order to get to the part of the log that is of interest to us it's best to do a search for the username (sybase). The first entry that we'll find is part of the setup of the authentication request.

servc@REPHOST[3] 14 Nov 11:43:35:660033 in servc_func_1: From client (SYBHOST) {FUNC=route-stat-user FROMUSER=sybase ROUTE=SSH:192.168.0.181->?HOST TOHOST=?HOST TOUSER=sybase FROMHOST=192.168.0.181}

BoKS will now go through a rather lengthy process of identifying the parties involved, which includes some BoKS-database and DNS voodoo to identify the hosts and their hostgroups. It's important to read all the log entries, searching for errors.

Having ascertained the identity of the parties involved, BoKS will start checking the appropriate access routes for the user. In this case you will see that BoKS will go over the access routes found at step 2 one by one. As part of this list it will also go over the access route that should have given sybase SSH access. However, instead we see the following.

14 Nov 11:43:35:930834 in fetchrec: Reading record from tab 2 at offset 1878504 (688 bytes)

14 Nov 11:43:35:931016 in get_route_key: got "ssh*:ANY/SYBASE->SYBASE"

14 Nov 11:43:35:931150 in am_methodcmp: ssh* == SSH ?

14 Nov 11:43:35:931254 in am_methodcmp: yes

14 Nov 11:43:35:931354 in hosttype_cmp: wild = ANY/SYBASE, host = UNIXHOST

14 Nov 11:43:35:931453 in domexpand: Enter. host="ANY/SYBASE"

...



14 Nov 11:43:35:931863 in domexpand: Return. "ANY/SYBASE.domain"

14 Nov 11:43:35:931963 in domexpand: Enter. host="UNIXHOST"

...

14 Nov 11:43:35:932367 in domexpand: Return. "UNIXHOST.domain"

...

14 Nov 11:43:35:932721 in host_wild_cmp: wild (SYBASE.domain) is a hostgroup

14 Nov 11:43:35:932824 in hostgroup_match_sub: enter

14 Nov 11:43:35:933336 in hostgroup_match_sub: no match

14 Nov 11:43:35:933641 in get_route_key: mismatch

This indicates that BoKS thinks that host UNIXHOST is not part of hostgroup SYBASE, even though we already confirmed that this is in fact the case (see step 2). This would seem to indicate that there are problems with the local copy of the BoKS database on replica REPHOST.

We won't have to continue reading the log file any further.

9. Verify data in database on faulty replica.

Suspecting database problems on the replica we check the following.

BoKS > hgrpadm -l | grep UNIXHOST

...

SYBASE UNIXHOST

TRUSTED UNIXHOST

...

Oddly enough the "hgrpadm" command, which interacts with the database, returns the proper results. However, dumping the local tables shows that we have problems.

BoKS > dumpbase -t 7 | grep UNIXHOST

BoKS > dumpbase -t 9 | grep UNIXHOST

BoKS > dumpbase -t 15 | grep UNIXHOST

10. Verifying the synchronisation status of the database

Run the following command on both the master server and the replica. Compare the figures for each table, looking for any discrepancies. A difference less than ten is alright, but anything in the dozens or higher is a problem. In this case I found the following.

BoKS > boksdiag sequence



Master Replica

...

T7 13678d 8:33:46 5053 (5053)

...

T9 13178d 11:05:23 7919 (7919)

...

T15 13178d 11:03:16 1865 (1865)

...

T7 13678d 8:33:46 6982 (6982)

...

T9 13178d 11:05:23 10258 (10258)

...

T15 13178d 11:03:16 2043 (2043)

This indicates that there are indeed synchronisation problems between this replica server and the master server.

11. Verifying the synch status of other replicas

Now that we've ascertained that there's one replica that's running badly, it's a good idea to check the other replicas as well. Run the "boksdiag sequence" command on the other replicas and verify the figures again.

In this case the figures for the other replicas all look fine, with one exception: REPHOST2 complains about database locking issues. Said error messages also pop up when running "dumpbase" commands on that replica, indicating software errors on that host as well.

boksdiag@REPLICA: INTERNAL DYNDB ERROR in blockbase(): Can't lock database

errno = 28, No space left on device

boksdiag@sREPLICA: INTERNAL DYNDB ERROR in bunlockbase(): Can't unlock database

errno = 28, No space left on device

T0 12549d 6:39:06 94193 (94193)

T1 13907d 7:13:45 637314 (637314)

...

 

In conclusion

In the end the problem was in fact down to REPHOST being out of synch with the rest of the BoKS domain. The troubleshooting continues with example 2.


kilala.nl tags: , ,

View or add comments (curr. 0)