Sometimes clusters do not guarantee high uptime

2007-10-20 13:36:00

Oh me, oh my... Clustering software does not always guarantee high uptime :/

At $CLIENT we've been having some nasty problems with our development SAP box. The box is part of a Veritas cluster and actually runs a bunch of Solaris Zones. The problems originally started about two months ago when we ran into a rare and newly discovered bug in UFS. It took a while for us to get the proper patches, but we finally managed to get that sorted out.

Remco installed the patches on Thursday morning, though he ran into some trouble. As always, patches can give you crap when it comes to cross-dependencies and this time wasn't any different. Around lunch time we thought we had things sorted out and went for the final reboot. All the zones were transferred to the proper boxen and things looked okay.

Until we tried to make a network connection. D:

None of the zones had access to the network, even though their interfaces were up and running. We sought for hours, but couldn't find anything. And like us, Sun was in the dark as well. In the end Remco and Sun worked all night to get an answer. Unfortunately they didn't make it, so I took over in the morning. Lemme tell you, once I was in the middle of all the tech and the phone calls and the managers, I found some more respect for Remco. He did a great job all through Thursday!

Just before lunch both Sun and one of the other guys came up with the solution. That was an awesome coincidence :) Turns out that the problems we were having are caused by timing issues during the boot-up of the Solaris Zones. Because we let Veritas Cluster handle the network interfaces things turned sour. Things would've worked better if we'd let the Zone framework handle things.

The stopgap solution: freeze all cluster resources to prevent fail-over, then manually restart all virtual interfaces for the zones. And presto! It works again!

Happily we went to lunch, only to come back to more crap!

Turns out that the five SAP instances we were running wouldn't fit into the available swap space anymore. Weird! Before yesterday, things would barely fit in the 30GB of swap space. And now all of a sudden SAP would eat about 38GB! o_O WTF?!

A whole bunch of managers wanted us to work through the whole weekend to sort everything out. Naturally we didn't feel to enthused, let alone the fact that the box's SLA doesn't cover weekend work.

In the end we tacked on some temporary swap space, started SAP and left for the weekend. We'll have to take more downtime on Monday for granted. It also leaves us with two big things to fix:

1. Modify the cluster/zone config for the network interfaces.

2. Find out why SAP has grown gluttonous and fix it.


kilala.nl tags: , ,

View or add comments (curr. 1)