Basic diagnostics for the high-availability cluster
Here are some basic diagnostic procedures on how to fix problems on the
hypatia/orestes
high-availability (HA) cluster in the Nevis particle-physics
Linux cluster.
The whole point of a high-availability cluster is that when things go wrong, resources are automatically shuffled between the systems in the cluster so that the users never notice a problem. This means that when things go wrong, they go
really wrong, and fixing the problem is usually not as simple as the following procedures suggest.
Of course, the usual first step is to contact
WilliamSeligman. These instructions assume that I am not available, and you're just trying to get things functional enough to send/receive mail, access the web services, etc.
General tips:
- ChengyiChi has the root password. It's also in an envelope in a secure location; ask AmyGarwood for access.
- Give these procedures time to work! For example:
- if the web server virtual machine is overloaded, it may take more then 20 minutes to restart on its own;
- if the DRBD partitions have to re-sync, it may take 20-30 minutes for other resources to start.
- These are extreme examples. Normally a delay of 5-10 minutes is typical for a serious-but-recoverable problem.
Simple recipes
Restarting the mail server
If you want to reboot the mail server virtual machine, you can do:
ssh -x root@franklin "/sbin/shutdown -r now"
or
ssh -x root@hypatia "/usr/sbin/crm resource restart VM_franklin"
or (if
franklin
is stuck)
ssh -x root@hypatia "/usr/sbin/crm resource status VM_franklin"
# You'll get a response like:
resource VM_franklin is running on: [system-name]
# Then "power-cycle" the virtual machine on the host on which it's running
ssh -x root@[system-name] "/usr/sbin/virsh destroy franklin"
Do NOT reboot
hypatia
or
orestes
by itself in the hopes of rebooting the mail server. The most likely outcome is the rebooted system’s UPS will shut down its power.
Restarting the web server
Same as above, replacing
franklin
with
ada
.
It's far more common for the web server to become overloaded than the mail server. The problem appears to be web robots (e.g., Google, Yahoo) probing the wiki and getting "stuck". They keep probing the same web address, the wiki software is further delayed with each query, and we have a
DoS
situation. Usually just restarting apache on the web server is sufficient:
ssh -x root@ada "/sbin/service httpd restart"
Give this a lot of time to complete. I've seen the web server so bogged down with processes that it takes 30 minutes for apache to restart.
Quick-but-drastic
Cycling the power on both
hypatia
and
orestes
at the same time may fix the problem. Or it may cause the cluster to hang indefinitely. It's usually better to try to understand what the problem is (multiple hard drive failures? bad configuration file?), but if you're rushed for time you can give it a try.
The problems with this approach:
- If the cause is a hardware problem or a configuration issue, power-cycling won't solve it.
- This guarantees that the DRBD partition will be rebuilt. It's hard to say how long this will take. You can monitor the status of the rebuild as described below (
cat /proc/drbd
).
- Cutting the power also "pulls the plug" on all the virtual machines, with all the risks associated with pulling the plug on a physical box. The disk images on the virtual machines can be corrupted, especially if they're halted in the middle of a disk write, which is frequent for the mail and web servers. There's a restore procedure for a damaged virtual machine given below, but it's better if a virtual machine isn't corruped in the first place.
Some things to check before taking this step:
- Make sure that the UPSes for both
hypatia
and orestes
are both on. If either or both are off, turn them on; that might be the cause of the problem.
- Check that the
rack-b-switch
is powered on and running.
- The power for
rack-b-switch
comes from bleeker-ups
; make sure that's up as well.
If you have no other choice, pray to whatever gods you worship (I prefer
Hermes
and
Hecate
), then cycle the power on both
hypatia
and
orestes
using the buttons on their front panels.
Pacemaker diagnostics
You can see the state of the resources managed by Pacemaker with the command
crm status
Resources are running normally
If things are working, you'll see something like this:
============
Last updated: Tue Jan 8 17:39:02 2013
Last change: Tue Jan 1 00:44:02 2013 via cibadmin on orestes.nevis.columbia.edu
Stack: cman
Current DC: orestes.nevis.columbia.edu - partition with quorum
Version: 1.1.6-3.el6-a02c0f19a00c1eb2527ad38f146ebc0834814558
2 Nodes configured, 2 expected votes
65 Resources configured.
============
Online: [ hypatia.nevis.columbia.edu orestes.nevis.columbia.edu ]
Master/Slave Set: AdminClone [AdminDrbd]
Masters: [ hypatia.nevis.columbia.edu orestes.nevis.columbia.edu ]
CronAmbientTemperature (ocf::heartbeat:symlink): Started orestes.nevis.columbia.edu
StonithHypatia (stonith:fence_nut): Started orestes.nevis.columbia.edu
StonithOrestes (stonith:fence_nut): Started hypatia.nevis.columbia.edu
Resource Group: DhcpGroup
SymlinkDhcpdConf (ocf::heartbeat:symlink): Started hypatia.nevis.columbia.edu
SymlinkSysconfigDhcpd (ocf::heartbeat:symlink): Started hypatia.nevis.columbia.edu
SymlinkDhcpdLeases (ocf::heartbeat:symlink): Started hypatia.nevis.columbia.edu
Dhcpd (lsb:dhcpd): Started hypatia.nevis.columbia.edu
IP_dhcp (ocf::heartbeat:IPaddr2): Started hypatia.nevis.columbia.edu
Clone Set: IPClone [IPGroup] (unique)
Resource Group: IPGroup:0
IP_cluster:0 (ocf::heartbeat:IPaddr2): Started orestes.nevis.columbia.edu
IP_cluster_local:0 (ocf::heartbeat:IPaddr2): Started orestes.nevis.columbia.edu
IP_cluster_sandbox:0 (ocf::heartbeat:IPaddr2): Started orestes.nevis.columbia.edu
Resource Group: IPGroup:1
IP_cluster:1 (ocf::heartbeat:IPaddr2): Started orestes.nevis.columbia.edu
IP_cluster_local:1 (ocf::heartbeat:IPaddr2): Started orestes.nevis.columbia.edu
IP_cluster_sandbox:1 (ocf::heartbeat:IPaddr2): Started orestes.nevis.columbia.edu
Clone Set: LibvirtdClone [LibvirtdGroup]
Started: [ orestes.nevis.columbia.edu hypatia.nevis.columbia.edu ]
Clone Set: TftpClone [TftpGroup]
Started: [ orestes.nevis.columbia.edu hypatia.nevis.columbia.edu ]
Clone Set: ExportsClone [ExportsGroup]
Started: [ orestes.nevis.columbia.edu hypatia.nevis.columbia.edu ]
Clone Set: FilesystemClone [FilesystemGroup]
Started: [ orestes.nevis.columbia.edu hypatia.nevis.columbia.edu ]
VM_proxy (ocf::heartbeat:VirtualDomain): Started hypatia.nevis.columbia.edu
VM_hogwarts (ocf::heartbeat:VirtualDomain): Started hypatia.nevis.columbia.edu
VM_wordpress (ocf::heartbeat:VirtualDomain): Started hypatia.nevis.columbia.edu
VM_tango (ocf::heartbeat:VirtualDomain): Started orestes.nevis.columbia.edu
VM_sullivan (ocf::heartbeat:VirtualDomain): Started hypatia.nevis.columbia.edu
VM_ada (ocf::heartbeat:VirtualDomain): Started orestes.nevis.columbia.edu
VM_nagios (ocf::heartbeat:VirtualDomain): Started orestes.nevis.columbia.edu
CronBackupVirtualDiskImages (ocf::heartbeat:symlink): Started hypatia.nevis.columbia.edu
VM_franklin (ocf::heartbeat:VirtualDomain): Started hypatia.nevis.columbia.edu
You can compare this with the
resource sketch elsewhere in this wiki, and the
output
from
crm configure show
. Without going into detail, note that in the above display:
- cloned resources are running on both nodes (exception: both instances of
IPClone
tend to run on a single node);
- other resources are distributed between the two nodes;
- virtual machine resources such as
VM_franklin
(the mail server) and VM_ada
(the web server) are all running.
Only one node up
If only one node is up, you might see something like this:
============
Last updated: Wed May 16 16:28:21 2012
Last change: Tue May 15 18:43:21 2012 via crmd on hypatia.nevis.columbia.edu
Stack: cman
Current DC: hypatia.nevis.columbia.edu - partition with quorum
Version: 1.1.6-3.el6-a02c0f19a00c1eb2527ad38f146ebc0834814558
2 Nodes configured, 2 expected votes
63 Resources configured.
============
Online: [ hypatia.nevis.columbia.edu ]
OFFLINE: [ orestes.nevis.columbia.edu ]
Master/Slave Set: AdminClone [AdminDrbd]
Masters: [ hypatia.nevis.columbia.edu ]
Stopped: [ AdminDrbd:1 ]
CronAmbientTemperature (ocf::heartbeat:symlink): Started hypatia.nevis.columbia.edu
StonithOrestes (stonith:fence_nut): Started hypatia.nevis.columbia.edu
Resource Group: DhcpGroup
SymlinkDhcpdConf (ocf::heartbeat:symlink): Started hypatia.nevis.columbia.edu
SymlinkSysconfigDhcpd (ocf::heartbeat:symlink): Started hypatia.nevis.columbia.edu
SymlinkDhcpdLeases (ocf::heartbeat:symlink): Started hypatia.nevis.columbia.edu
Dhcpd (lsb:dhcpd): Started hypatia.nevis.columbia.edu
IP_dhcp (ocf::heartbeat:IPaddr2): Started hypatia.nevis.columbia.edu
Clone Set: IPClone [IPGroup] (unique)
Resource Group: IPGroup:0
IP_cluster:0 (ocf::heartbeat:IPaddr2): Started hypatia.nevis.columbia.edu
IP_cluster_local:0 (ocf::heartbeat:IPaddr2): Started hypatia.nevis.columbia.edu
IP_cluster_sandbox:0 (ocf::heartbeat:IPaddr2): Started hypatia.nevis.columbia.edu
Resource Group: IPGroup:1
IP_cluster:1 (ocf::heartbeat:IPaddr2): Started hypatia.nevis.columbia.edu
IP_cluster_local:1 (ocf::heartbeat:IPaddr2): Started hypatia.nevis.columbia.edu
IP_cluster_sandbox:1 (ocf::heartbeat:IPaddr2): Started hypatia.nevis.columbia.edu
Clone Set: LibvirtdClone [LibvirtdGroup]
Started: [ hypatia.nevis.columbia.edu ]
Stopped: [ LibvirtdGroup:0 ]
Clone Set: TftpClone [TftpGroup]
Started: [ hypatia.nevis.columbia.edu ]
Stopped: [ TftpGroup:1 ]
Clone Set: ExportsClone [ExportsGroup]
Started: [ hypatia.nevis.columbia.edu ]
Stopped: [ ExportsGroup:0 ]
Clone Set: FilesystemClone [FilesystemGroup]
Started: [ hypatia.nevis.columbia.edu ]
Stopped: [ FilesystemGroup:1 ]
VM_proxy (ocf::heartbeat:VirtualDomain): Started hypatia.nevis.columbia.edu
VM_hogwarts (ocf::heartbeat:VirtualDomain): Started hypatia.nevis.columbia.edu
VM_wordpress (ocf::heartbeat:VirtualDomain): Started hypatia.nevis.columbia.edu
VM_tango (ocf::heartbeat:VirtualDomain): Started hypatia.nevis.columbia.edu
VM_sullivan (ocf::heartbeat:VirtualDomain): Started hypatia.nevis.columbia.edu
VM_ada (ocf::heartbeat:VirtualDomain): Started hypatia.nevis.columbia.edu
VM_nagios (ocf::heartbeat:VirtualDomain): Started hypatia.nevis.columbia.edu
CronBackupVirtualDiskImages (ocf::heartbeat:symlink): Started hypatia.nevis.columbia.edu
VM_franklin (ocf::heartbeat:VirtualDomain): Started hypatia.nevis.columbia.edu
Note that all the resources are running on a single node (
hypatia
in this example) and the other node is off-line.
This is not a disaster; this is what the HA software is supposed to do if there's a problem with one of the nodes. As long as everything is running, there's nothing you have to do. Wait until
WilliamSeligman is available to fix whatever is wrong with the off-line node.
Problems with a resource
What if you see that a resource is missing from the list; e.g., you don't see
VM_franklin
, which means the mail server's virtual machine isn't running? Or what if you see an explicit error message, like this:
============
Last updated: Fri Mar 2 17:17:25 2012
Last change: Fri Mar 2 17:12:44 2012 via crm_shadow on orestes.nevis.columbia.edu
Stack: cman
Current DC: hypatia.nevis.columbia.edu - partition with quorum
Version: 1.1.6-3.el6-a02c0f19a00c1eb2527ad38f146ebc0834814558
2 Nodes configured, unknown expected votes
41 Resources configured.
============
Online: [ hypatia.nevis.columbia.edu orestes.nevis.columbia.edu ]
Master/Slave Set: AdminClone [AdminDrbd]
Slaves: [ hypatia.nevis.columbia.edu orestes.nevis.columbia.edu ]
StonithHypatia (stonith:fence_nut): Started orestes.nevis.columbia.edu
StonithOrestes (stonith:fence_nut): Started hypatia.nevis.columbia.edu
Clone Set: ExportsClone [ExportsGroup]
Started: [ hypatia.nevis.columbia.edu orestes.nevis.columbia.edu ]
Failed actions:
ExportMail:0_monitor_0 (node=hypatia.nevis.columbia.edu, call=29, rc=-2, status=Timed Out): unknown exec error
ExportMail:0_monitor_0 (node=orestes.nevis.columbia.edu, call=29, rc=-2, status=Timed Out): unknown exec error
The first step is to try to clear the error message. In this case, the resource displaying the error is
ExportMail
. The command to try is
crm resource cleanup [resource-name]
For example, if you saw that
VM_franklin
was missing from the list or appeared in an error message, you could try
crm resource cleanup VM_franklin
.
In general, try to cleanup resources from the "top" down; that is, if more than one resource is missing or showing an error, cleanup the ones nearer the top of the above lists. This more-or-less corresponds to the order in which the resources are started by Pacemaker.
DRBD problems
Suppose you see something like this:
============
Last updated: Tue Mar 27 17:25:36 2012
Last change: Tue Mar 27 17:03:23 2012 via cibadmin on hypatia.nevis.columbia.edu
Stack: cman
Current DC: hypatia.nevis.columbia.edu - partition with quorum
Version: 1.1.6-3.el6-a02c0f19a00c1eb2527ad38f146ebc0834814558
2 Nodes configured, 2 expected votes
58 Resources configured.
============
Online: [ hypatia.nevis.columbia.edu orestes.nevis.columbia.edu ]
Master/Slave Set: AdminClone [AdminDrbd]
Slaves: [ hypatia.nevis.columbia.edu orestes.nevis.columbia.edu ]
StonithHypatia (stonith:fence_nut): Started orestes.nevis.columbia.edu
StonithOrestes (stonith:fence_nut): Started hypatia.nevis.columbia.edu
The DRBD resource is running, but both nodes are in the slave state. This means that the DRBD disks on the two nodes are synchronized, but neither node has mounted the disk.
This is not necessarily an emergency; it may just mean a wait of up to 15 minutes. This normally occurs if both systems have come up, and it took a while for the DRBD disks to synchronize between the two systems. You can try to speed things up with the command
crm resource cleanup AdminClone
You can check on the status of DRBD on both systems. The command is
cat /proc/drbd
If everything is normal, the output looks like this:
0: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r-----
ns:652411770 nr:1207301063 dw:1859712813 dr:191729621 al:34135 bm:1655 lo:5 pe:0 ua:6 ap:0 ep:2 wo:b oos:0
If one or both systems are DBBD slaves, they'll be labeled as "Secondary"; e.g.,:
0: cs:Connected ro:Secondary/Secondary ds:UpToDate/UpToDate C r-----
ns:652411770 nr:1207301063 dw:1859712813 dr:191729621 al:34135 bm:1655 lo:5 pe:0 ua:6 ap:0 ep:2 wo:b oos:0
You may even catch the systems in the act of syncing, if you're recovering from one system being down:
0: cs:SyncSource ro:Primary/Secondary ds:UpToDate/Inconsistent C r----
ns:2184 nr:0 dw:0 dr:2472 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:10064
[=====>..............] sync'ed: 33.4% (10064/12248)K
finish: 0:00:37 speed: 240 (240) K/sec
All of the above are relatively normal. Pacemaker should eventually recover without any further intervention.
However, it's much more serious if the output of
crm status
looks something like this:
============
Last updated: Mon Mar 5 11:39:44 2012
Last change: Mon Mar 5 11:37:23 2012 via cibadmin on hypatia.nevis.columbia.edu
Stack: cman
Current DC: orestes.nevis.columbia.edu - partition with quorum
Version: 1.1.6-3.el6-a02c0f19a00c1eb2527ad38f146ebc0834814558
2 Nodes configured, unknown expected votes
41 Resources configured.
============
Node orestes.nevis.columbia.edu: UNCLEAN (online)
OFFLINE: [ hypatia.nevis.columbia.edu ]
Here nothing is running. One of the systems (
hypatia
) has been
STONITHed
. The other thinks that it can't join the HA cluster. The DRBD output from
cat /proc/drbd
probably looks something like this:
0: cs:WFConnection st:Secondary/Unknown ds:Inconsistent/DUnknown C r---
ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0
Here's what probably happened: Both nodes went down. Only one node has come back up even partially. The drbd service is not sure whether its copy of the disk is up-to-date; it can't tell for certain until it can talk with the other node. With the other node down, drbd can't do that. The result is that cluster resource management hangs indefinitely.
This is the reason why random power-cycling of the nodes is a bad idea.
This problem is supposed to fix itself automatically. But if you see a node with an UNCLEAN status, pacemaker has already tried and failed. The next step is for you to fix the DRBD status manually.
DRBD manual repair
Stop pacemaker
On the UNCLEAN node(s), turn off pacemaker:
/sbin/service pacemaker stop
If pacemaker hangs while stopping
If more than two minutes have elapsed, and the script is still printing out periods, then you'll have to force-quit the pacemaker service:
Ctrl-C # to stop the pacemaker script
killall -v -HUP pacemakerd
# Check that there are no surviving pacemaker processes.
# If there are, use "kill" on them.
ps -elf | grep pacemaker
ps -elf | grep pcmk
Disable HA services at startup
/sbin/chkconfig pacemaker off
/sbin/chkconfig clvmd off
/sbin/chkconfig cman off
# Reboot the system:
/sbin/shutdown -r now
If a node is down, boot it up in single-user mode to disable its HA services
Start drbd manually
On both systems, start drbd. Type the following command on one node, then switch to the other node as quickly as possible and type it again:
/sbin/service drbd start
If one node is still down because of a hardware problem (e.g., two or more RAID drives have failed; power supply has blown; motherboard is fried) then you'll get a prompt on the surviving node; answer "yes" to disable waiting.
Fix any problems
On each node, check the DRBD status:
cat /proc/drbd
Hopefully DRBD will be able to resolve the problem on its own; you may see messages about the partitions syncing as shown above. Eventually you'll see the partition(s) have a status of
Secondary
.
If not, you'll have to do web searches to fix the problem. Useful links are:
The name of DRBD resource is
admin
(assigned in
/etc/drbd.d/admin.res
)
Once DRBD is working, on both nodes activate the HA resources at startup:
/sbin/chkconfig pacemaker on
/sbin/chkconfig clvmd on
/sbin/chkconfig cman on
Reboot the node(s). After issuing this command on one node, issue it on the other node as quickly as possible:
/sbin/shutdown -r now
Virtual machines
As of Jan-2013, these are the virtual machines running on the HA cluster:
VM |
function |
franklin |
mail server |
ada |
web server (including calendars, wiki, ELOG) |
sullivan |
mailing-list server |
tango |
SAMBA server (the admin staff calls this the "shared server") |
hogwarts |
home directories for staff accounts |
nagios |
Linux cluster monitoring |
proxy |
web proxy, occasionally used by Nevis personnel at CERN |
wordpress |
Wordpress server; as of Jan-2013 only used by VERITAS |
The
domain name of the virtual machine (what you see if you enter the
virsh list
command on
hypatia
or
orestes
) is the same as the
IP name; e.g., if the mail server happens to be running on
hypatia
and you want to reboot it, you can login to
hypatia
and type
virsh reboot franklin
, or you can
ssh root@franklin
and reboot the system from the command line. The name of the
pacemaker resource that controls the virtual machine is same as the domain name prefixed by
VM_
; e.g., if you want to use pacemaker to restart the mail server, you type
crm resource restart VM_franklin
on either HA node, without knowing on which system it's running.