Basic diagnostics for the high-availability cluster

Archived 20-Sep-2013: The high-availability cluster has been set aside in favor of a more traditional single-box admin server. HA is grand in theory, but in the three years we operated the cluster we had no hardware problems which the HA set-up would have prevented, but many hours of downtime due to problems with the HA software. This mailing-list post has some details.

Here are some basic diagnostic procedures for the hypatia/orestes high-availability (HA) cluster in the Nevis particle-physics Linux cluster.

The whole point of a high-availability cluster is that when things go wrong, resources automatically shuffle between the systems in the cluster so that the users never notice a problem. This means that when things go wrong, they go really wrong, and fixing the problem is usually not as simple as the following procedures suggest.

Of course, the usual first step is to contact WilliamSeligman. These instructions assume that I am not available, and you're just trying to get things functional enough to send/receive mail, access the web services, etc.

General tips:

  • You will need the cluster root password. sudo may work, but at minimum you'll have to modify your $PATH:
    export PATH=/usr/sbin:/sbin:${PATH}

  • ChengyiChi has the root password. It's also in an envelope in a secure location; ask AmyGarwood for access.

  • Give these procedures time to work! For example:
    • if the web server virtual machine is overloaded, it may take more then 20 minutes to restart on its own;
    • if the DRBD partitions have to re-sync, it may take 20-30 minutes for other resources to start.

  • These are extreme examples. Normally a delay of 5-10 minutes is typical for a serious-but-recoverable problem.

Simple recipes

Restarting the mail server

If you want to reboot the mail server virtual machine, you can do:

ssh -x root@franklin "/sbin/shutdown -r now"
or
ssh -x root@hypatia "/usr/sbin/crm resource restart VM_franklin"
or (if franklin is stuck)
ssh -x root@hypatia "/usr/sbin/crm resource status VM_franklin"
# You'll get a response like:
resource VM_franklin is running on: [system-name]
# Then "power-cycle" the virtual machine on the host on which it's running
ssh -x root@[system-name] "/usr/sbin/virsh destroy franklin"
Do NOT reboot hypatia or orestes by itself in the hopes of rebooting the mail server. The most likely outcome is the rebooted system’s UPS will shut down its power.

Restarting the web server

Same as above, replacing franklin with ada.

It's far more common for the web server to become overloaded than the mail server. The problem appears to be web robots (e.g., Google, Yahoo) probing the wiki and getting "stuck". They keep probing the same web address, the wiki software is further delayed with each query, and we have a DoS situation. Usually just restarting apache on the web server is sufficient:

ssh -x root@ada "/sbin/service httpd restart"
Give this a lot of time to complete. I've seen the web server so bogged down with processes that it takes 30 minutes for apache to restart.

Quick-but-drastic

Warning: Before trying this, make sure that the problem is being caused by a failure in the HA cluster software (pacemaker or drbd). Cycling the power on both hypatia and orestes at the same time may fix the problem. Or it may cause the cluster to hang indefinitely. It's usually better to try to understand what the problem is (multiple hard drive failures? bad configuration file?), but if you're rushed for time you can give it a try.

The problems with power-cycling the HA cluster:

  • If the cause is a hardware problem or a configuration issue, power-cycling won't solve it.

  • This guarantees that the DRBD partition will be rebuilt. It's hard to say how long this will take. You can monitor the status of the rebuild as described below (cat /proc/drbd).

  • Cutting the power also "pulls the plug" on all the virtual machines, with all the risks associated with pulling the plug on a physical box. The disk images on the virtual machines can be corrupted, especially if they're halted in the middle of a disk write, which is frequent for the mail and web servers. There's a restore procedure for a damaged virtual machine given below, but it's better if a virtual machine isn't corruped in the first place.

Some things to check before taking this step:

  • Make sure that the UPSes for both hypatia and orestes are both on. If either or both are off, turn them on; that might be the cause of the problem.

  • Check that the rack-b-switch is powered on and running.

  • The power for rack-b-switch comes from bleeker-ups; make sure that's up as well.

If you have no other choice, pray to whatever gods you worship (I prefer Hermes and Hecate), then cycle the power on both hypatia and orestes using the buttons on their front panels.

Pacemaker diagnostics

Login as root to hypatia or orestes. You can see the state of the resources managed by Pacemaker with the command

crm status

Resources are running normally

If things are working, you'll see something like this:

============
Last updated: Tue Jan  8 17:39:02 2013
Last change: Tue Jan  1 00:44:02 2013 via cibadmin on orestes.nevis.columbia.edu
Stack: cman
Current DC: orestes.nevis.columbia.edu - partition with quorum
Version: 1.1.6-3.el6-a02c0f19a00c1eb2527ad38f146ebc0834814558
2 Nodes configured, 2 expected votes
65 Resources configured.
============

Online: [ hypatia.nevis.columbia.edu orestes.nevis.columbia.edu ]

 Master/Slave Set: AdminClone [AdminDrbd]
     Masters: [ hypatia.nevis.columbia.edu orestes.nevis.columbia.edu ]
 CronAmbientTemperature   (ocf::heartbeat:symlink):   Started orestes.nevis.columbia.edu
 StonithHypatia   (stonith:fence_nut):   Started orestes.nevis.columbia.edu
 StonithOrestes   (stonith:fence_nut):   Started hypatia.nevis.columbia.edu
 Resource Group: DhcpGroup
     SymlinkDhcpdConf   (ocf::heartbeat:symlink):   Started hypatia.nevis.columbia.edu
     SymlinkSysconfigDhcpd   (ocf::heartbeat:symlink):   Started hypatia.nevis.columbia.edu
     SymlinkDhcpdLeases   (ocf::heartbeat:symlink):   Started hypatia.nevis.columbia.edu
     Dhcpd   (lsb:dhcpd):   Started hypatia.nevis.columbia.edu
     IP_dhcp   (ocf::heartbeat:IPaddr2):   Started hypatia.nevis.columbia.edu
 Clone Set: IPClone [IPGroup] (unique)
     Resource Group: IPGroup:0
         IP_cluster:0   (ocf::heartbeat:IPaddr2):   Started orestes.nevis.columbia.edu
         IP_cluster_local:0   (ocf::heartbeat:IPaddr2):   Started orestes.nevis.columbia.edu
         IP_cluster_sandbox:0   (ocf::heartbeat:IPaddr2):   Started orestes.nevis.columbia.edu
     Resource Group: IPGroup:1
         IP_cluster:1   (ocf::heartbeat:IPaddr2):   Started orestes.nevis.columbia.edu
         IP_cluster_local:1   (ocf::heartbeat:IPaddr2):   Started orestes.nevis.columbia.edu
         IP_cluster_sandbox:1   (ocf::heartbeat:IPaddr2):   Started orestes.nevis.columbia.edu
 Clone Set: LibvirtdClone [LibvirtdGroup]
     Started: [ orestes.nevis.columbia.edu hypatia.nevis.columbia.edu ]
 Clone Set: TftpClone [TftpGroup]
     Started: [ orestes.nevis.columbia.edu hypatia.nevis.columbia.edu ]
 Clone Set: ExportsClone [ExportsGroup]
     Started: [ orestes.nevis.columbia.edu hypatia.nevis.columbia.edu ]
 Clone Set: FilesystemClone [FilesystemGroup]
     Started: [ orestes.nevis.columbia.edu hypatia.nevis.columbia.edu ]
 VM_proxy   (ocf::heartbeat:VirtualDomain):   Started hypatia.nevis.columbia.edu
 VM_hogwarts   (ocf::heartbeat:VirtualDomain):   Started hypatia.nevis.columbia.edu
 VM_wordpress   (ocf::heartbeat:VirtualDomain):   Started hypatia.nevis.columbia.edu
 VM_tango   (ocf::heartbeat:VirtualDomain):   Started orestes.nevis.columbia.edu
 VM_sullivan   (ocf::heartbeat:VirtualDomain):   Started hypatia.nevis.columbia.edu
 VM_ada   (ocf::heartbeat:VirtualDomain):   Started orestes.nevis.columbia.edu
 VM_nagios   (ocf::heartbeat:VirtualDomain):   Started orestes.nevis.columbia.edu
 CronBackupVirtualDiskImages   (ocf::heartbeat:symlink):   Started hypatia.nevis.columbia.edu
 VM_franklin   (ocf::heartbeat:VirtualDomain):   Started hypatia.nevis.columbia.edu

You can compare this with the resource sketch elsewhere in this wiki, and the output from crm configure show. Without going into detail, note that in the above display:

  • both nodes are on-line;

  • cloned resources are running on both nodes (exception: both instances of IPClone tend to run on a single node);

  • other resources are distributed between the two nodes;

  • virtual machine resources such as VM_franklin (the mail server) and VM_ada (the web server) are all running.

Only one node up

If only one node is up, you might see something like this:

============
Last updated: Wed May 16 16:28:21 2012
Last change: Tue May 15 18:43:21 2012 via crmd on hypatia.nevis.columbia.edu
Stack: cman
Current DC: hypatia.nevis.columbia.edu - partition with quorum
Version: 1.1.6-3.el6-a02c0f19a00c1eb2527ad38f146ebc0834814558
2 Nodes configured, 2 expected votes
63 Resources configured.
============

Online: [ hypatia.nevis.columbia.edu ]
OFFLINE: [ orestes.nevis.columbia.edu ]

 Master/Slave Set: AdminClone [AdminDrbd]
     Masters: [ hypatia.nevis.columbia.edu ]
     Stopped: [ AdminDrbd:1 ]
 CronAmbientTemperature   (ocf::heartbeat:symlink):   Started hypatia.nevis.columbia.edu
 StonithOrestes   (stonith:fence_nut):   Started hypatia.nevis.columbia.edu
 Resource Group: DhcpGroup
     SymlinkDhcpdConf   (ocf::heartbeat:symlink):   Started hypatia.nevis.columbia.edu
     SymlinkSysconfigDhcpd   (ocf::heartbeat:symlink):   Started hypatia.nevis.columbia.edu
     SymlinkDhcpdLeases   (ocf::heartbeat:symlink):   Started hypatia.nevis.columbia.edu
     Dhcpd   (lsb:dhcpd):   Started hypatia.nevis.columbia.edu
     IP_dhcp   (ocf::heartbeat:IPaddr2):   Started hypatia.nevis.columbia.edu
 Clone Set: IPClone [IPGroup] (unique)
     Resource Group: IPGroup:0
         IP_cluster:0   (ocf::heartbeat:IPaddr2):   Started hypatia.nevis.columbia.edu
         IP_cluster_local:0   (ocf::heartbeat:IPaddr2):   Started hypatia.nevis.columbia.edu
         IP_cluster_sandbox:0   (ocf::heartbeat:IPaddr2):   Started hypatia.nevis.columbia.edu
     Resource Group: IPGroup:1
         IP_cluster:1   (ocf::heartbeat:IPaddr2):   Started hypatia.nevis.columbia.edu
         IP_cluster_local:1   (ocf::heartbeat:IPaddr2):   Started hypatia.nevis.columbia.edu
         IP_cluster_sandbox:1   (ocf::heartbeat:IPaddr2):   Started hypatia.nevis.columbia.edu
 Clone Set: LibvirtdClone [LibvirtdGroup]
     Started: [ hypatia.nevis.columbia.edu ]
     Stopped: [ LibvirtdGroup:0 ]
 Clone Set: TftpClone [TftpGroup]
     Started: [ hypatia.nevis.columbia.edu ]
     Stopped: [ TftpGroup:1 ]
 Clone Set: ExportsClone [ExportsGroup]
     Started: [ hypatia.nevis.columbia.edu ]
     Stopped: [ ExportsGroup:0 ]
 Clone Set: FilesystemClone [FilesystemGroup]
     Started: [ hypatia.nevis.columbia.edu ]
     Stopped: [ FilesystemGroup:1 ]
 VM_proxy   (ocf::heartbeat:VirtualDomain):   Started hypatia.nevis.columbia.edu
 VM_hogwarts   (ocf::heartbeat:VirtualDomain):   Started hypatia.nevis.columbia.edu
 VM_wordpress   (ocf::heartbeat:VirtualDomain):   Started hypatia.nevis.columbia.edu
 VM_tango   (ocf::heartbeat:VirtualDomain):   Started hypatia.nevis.columbia.edu
 VM_sullivan   (ocf::heartbeat:VirtualDomain):   Started hypatia.nevis.columbia.edu
 VM_ada   (ocf::heartbeat:VirtualDomain):   Started hypatia.nevis.columbia.edu
 VM_nagios   (ocf::heartbeat:VirtualDomain):   Started hypatia.nevis.columbia.edu
 CronBackupVirtualDiskImages   (ocf::heartbeat:symlink):   Started hypatia.nevis.columbia.edu
 VM_franklin   (ocf::heartbeat:VirtualDomain):   Started hypatia.nevis.columbia.edu

Note that all the resources are running on a single node (hypatia in this example) and the other node is off-line.

This is not a disaster; this is what the HA software is supposed to do if there's a problem with one of the nodes. As long as everything is running, there's nothing you have to do. Wait until WilliamSeligman is available to fix whatever is wrong with the off-line node.

Problems with a resource

What if you see that a resource is missing from the list; e.g., you don't see VM_franklin, which means the mail server's virtual machine isn't running? Or what if you see an explicit error message, like this:

    ============
    Last updated: Fri Mar  2 17:17:25 2012
    Last change: Fri Mar  2 17:12:44 2012 via crm_shadow on orestes.nevis.columbia.edu
    Stack: cman
    Current DC: hypatia.nevis.columbia.edu - partition with quorum
    Version: 1.1.6-3.el6-a02c0f19a00c1eb2527ad38f146ebc0834814558
    2 Nodes configured, unknown expected votes
    41 Resources configured.
    ============
     
    Online: [ hypatia.nevis.columbia.edu orestes.nevis.columbia.edu ]
     
     Master/Slave Set: AdminClone [AdminDrbd]
         Slaves: [ hypatia.nevis.columbia.edu orestes.nevis.columbia.edu ]
    StonithHypatia  (stonith:fence_nut):    Started orestes.nevis.columbia.edu
    StonithOrestes  (stonith:fence_nut):    Started hypatia.nevis.columbia.edu
     Clone Set: ExportsClone [ExportsGroup]
         Started: [ hypatia.nevis.columbia.edu orestes.nevis.columbia.edu ]
     
    Failed actions:
        ExportMail:0_monitor_0 (node=hypatia.nevis.columbia.edu, call=29, rc=-2, status=Timed Out): unknown exec error
        ExportMail:0_monitor_0 (node=orestes.nevis.columbia.edu, call=29, rc=-2, status=Timed Out): unknown exec error

The first step is to try to clear the error message. In this case, the resource displaying the error is ExportMail. The command to try is

crm resource cleanup [resource-name]

For example, if you saw that VM_franklin was missing from the list or appeared in an error message, you could try crm resource cleanup VM_franklin.

In general, try to cleanup resources from the "top" down; that is, if more than one resource is missing or showing an error, cleanup the ones nearer the top of the above lists. This more-or-less corresponds to the order in which the resources are started by Pacemaker.

DRBD problems

Suppose you see something like this:

============
Last updated: Tue Mar 27 17:25:36 2012
Last change: Tue Mar 27 17:03:23 2012 via cibadmin on hypatia.nevis.columbia.edu
Stack: cman
Current DC: hypatia.nevis.columbia.edu   - partition with quorum
Version: 1.1.6-3.el6-a02c0f19a00c1eb2527ad38f146ebc0834814558
2 Nodes configured, 2 expected votes
58 Resources configured.
============

Online: [ hypatia.nevis.columbia.edu orestes.nevis.columbia.edu ]

 Master/Slave Set: AdminClone [AdminDrbd]
     Slaves: [ hypatia.nevis.columbia.edu orestes.nevis.columbia.edu ]
StonithHypatia  (stonith:fence_nut):    Started orestes.nevis.columbia.edu
StonithOrestes  (stonith:fence_nut):    Started hypatia.nevis.columbia.edu

The DRBD resource is running, but both nodes are in the slave state. This means that the DRBD disks on the two nodes are synchronized, but neither node has mounted the disk.

This is not necessarily an emergency; it may just mean a wait of up to 15 minutes. This normally occurs if both systems have come up, and it took a while for the DRBD disks to synchronize between the two systems. You can try to speed things up with the command

crm resource cleanup AdminClone

You can check on the status of DRBD on both systems. The command is

cat /proc/drbd

If everything is normal, the output looks like this:

 0: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r-----
    ns:652411770 nr:1207301063 dw:1859712813 dr:191729621 al:34135 bm:1655 lo:5 pe:0 ua:6 ap:0 ep:2 wo:b oos:0

If one or both systems are DBBD slaves, they'll be labeled as "Secondary"; e.g.,:

 0: cs:Connected ro:Secondary/Secondary ds:UpToDate/UpToDate C r-----
    ns:652411770 nr:1207301063 dw:1859712813 dr:191729621 al:34135 bm:1655 lo:5 pe:0 ua:6 ap:0 ep:2 wo:b oos:0

You may even catch the systems in the act of syncing, if you're recovering from one system being down:

 0: cs:SyncSource ro:Primary/Secondary ds:UpToDate/Inconsistent C r----
    ns:2184 nr:0 dw:0 dr:2472 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:10064
        [=====>..............] sync'ed: 33.4% (10064/12248)K
        finish: 0:00:37 speed: 240 (240) K/sec

All of the above are relatively normal. Pacemaker should eventually recover without any further intervention.

However, it's much more serious if the output of crm status looks something like this:

============
Last updated: Mon Mar  5 11:39:44 2012
Last change: Mon Mar  5 11:37:23 2012 via cibadmin on hypatia.nevis.columbia.edu
Stack: cman
Current DC: orestes.nevis.columbia.edu - partition with quorum
Version: 1.1.6-3.el6-a02c0f19a00c1eb2527ad38f146ebc0834814558
2 Nodes configured, unknown expected votes
41 Resources configured.
============
 
Node orestes.nevis.columbia.edu: UNCLEAN (online)
OFFLINE: [ hypatia.nevis.columbia.edu ]

This is bad. Nothing is running. One of the systems (hypatia) has been STONITHed. The other thinks that it can't join the HA cluster. The DRBD output from cat /proc/drbd probably looks something like this:

0: cs:WFConnection st:Secondary/Unknown ds:Inconsistent/DUnknown C r---
ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0

Here's what probably happened: Both nodes went down. Only one node has come back up even partially. The drbd service is not sure whether its copy of the disk is up-to-date; it can't tell for certain until it can talk with the other node. With the other node down, drbd can't do that. The result is that cluster resource management hangs indefinitely.

This is the reason why random power-cycling of the nodes is a bad idea.

This problem is supposed to fix itself automatically. But if you see a node with an UNCLEAN status, pacemaker has already tried and failed. The next step is for you to fix the DRBD status manually.

DRBD manual repair

Stop pacemaker

On the UNCLEAN node(s), turn off pacemaker:

/sbin/service pacemaker stop

If pacemaker hangs while stopping

If more than two minutes have elapsed, and the script is still printing out periods, then you'll have to force-quit the pacemaker service:

Ctrl-C # to stop the pacemaker script
killall -v -HUP pacemakerd
# Check that there are no surviving pacemaker processes.
# If there are, use "kill" on them.
ps -elf | grep pacemaker
ps -elf | grep pcmk

Disable HA services at startup

/sbin/chkconfig pacemaker off
/sbin/chkconfig clvmd off
/sbin/chkconfig cman off
# Reboot the system:
/sbin/shutdown -r now

If a node is down, boot it up in single-user mode to disable its HA services

  • Turn on its UPS if it's off. This will probably turn on the node. If not, power it on at its front panel.
  • Rush over to its console.
  • At the screen where it say "Booting Scientific Linux..." quickly hit a key; you have only three seconds to do this!
  • Hit "e" to edit the boot commands,
  • Use the down-arrow key to move the cursor to the "kernel" line.
  • Hit "e" to edit the line.
  • Add the word single to the end of the line, with a space to separate it from the word "quiet".
  • Hit RETURN to exit edit mode and go back to the boot screen.
  • Hit "b" to boot the system.
  • The system will come up in single-user mode.
  • Disable HA services:
    /sbin/chkconfig pacemaker off
    /sbin/chkconfig clvmd off
    /sbin/chkconfig cman off
    # Continue with the normal boot process:
    exit
    

Start drbd manually

On both systems, start drbd. Type the following command on one node, then switch to the other node as quickly as possible and type it again:

/sbin/service drbd start

If one node is still down because of a hardware problem (e.g., two or more RAID drives have failed; power supply has blown; motherboard is fried) then you'll get a prompt on the surviving node; answer "yes" to disable waiting.

Fix any problems

On each node, check the DRBD status:

cat /proc/drbd

Hopefully DRBD will be able to resolve the problem on its own; you may see messages about the partitions syncing as shown above. Eventually you'll see the partition(s) have a status of Secondary.

If not, you'll have to do web searches to fix the problem. Useful links are:

The name of DRBD resource is admin (assigned in /etc/drbd.d/admin.res)

Once DRBD is working, on both nodes activate the HA resources at startup:

/sbin/chkconfig pacemaker on
/sbin/chkconfig clvmd on
/sbin/chkconfig cman on

Reboot the node(s). After issuing this command on one node, issue it on the other node as quickly as possible:

/sbin/shutdown -r now

Virtual machines

The following tips can help if a virtual machine appears to be hanged, or is constantly failing to run on either node.

Note that we've had problems with virtual machines that have had nothing to do with the high-availability software. For example, the mail server has failed due to a error made in the DNS configuration file on hermes; another time it failed due to an overnight upgrade to dovecot, the IMAP server program. These problems would have occurred even if franklin was a separate box, instead of a virtual domain managed by pacemaker. Don't be too quick to blame HA!

Names

As of Jan-2013, these are the virtual machines running on the HA cluster:

VM function
franklin mail server
ada web server (including calendars, wiki, ELOG)
sullivan mailing-list server
tango SAMBA server (the admin staff calls this the "shared server")
hogwarts home directories for staff accounts
nagios Linux cluster monitoring
proxy web proxy, occasionally used by Nevis personnel at CERN
wordpress Wordpress server; as of Jan-2013 only used by VERITAS

The domain name of the virtual machine (what you see if you enter the virsh list command on hypatia or orestes) is the same as the IP name; e.g., if the mail server happens to be running on hypatia and you want to reboot it, you can login to hypatia and type virsh reboot franklin, or you can ssh root@franklin and reboot the system from the command line. The name of the pacemaker resource that controls the virtual machine is same as the domain name prefixed by VM_; e.g., if you want to use pacemaker to restart the mail server, you can type crm resource restart VM_franklin on either HA node, without knowing on which system it's running.

Restarting a VM

If you want to reboot a virtual machine named XXX, you can do by logging into it:

ssh -x root@XXX "/sbin/shutdown -r now"

You can tell the HA cluster to reboot it:

ssh -x root@hypatia "/usr/sbin/crm resource restart VM_XXX"

If XXX appears to be hung, you can "pull the plug" on the virtual machine. Pacemaker will automatically "power it on" again when it detects that XXX is down.

# Find out on which node the VM resource is running.
ssh -x root@hypatia "/usr/sbin/crm resource status VM_XXX"
# You'll get a response like:
resource VM_XXX is running on: [system-name]
# Then "power-cycle" the virtual machine on the host on which it's running
ssh -x root@[system-name] "/usr/sbin/virsh destroy XXX"

Logging in

All the virtual machines have sshd enabled, so one can ssh as root if necessary. But this doesn't let you see the system console, which often displays messages that you can't see via ssh.

If you want to see a virtual machine's console screen in an X-window, there are a couple of ways. The first is to use virt-viewer. For example, to see the mail server console:

# On which node is the VM running?
ssh root@hypatia "/usr/sbin/crm status VM_franklin"
# You'll get a response like:
resource VM_franklin is running on: [system-name]
# Login to that system if you're not already there.
ssh root@[system-name]
virt-viewer franklin &

If you reboot the VM, the X-window with the console will probably disappear and exit virt-viewer. If you try to start it up again immediately, you'll probably get a "domain not found" message, since it takes a few seconds to restart. To get around this, use the --wait option; e.g.,

virt-viewer --wait franklin &
The viewer will wait until the VM starts and then display the console.

The other way is get a console to login as root to hypatia or orestes and use the virt-manager command. On each node, this has been configured to show the virtual domains on both nodes. Double-click on a domain to see its console. Be careful! Don't accidentally start a virtual domain on both nodes at once; that will create a bigger problem than you're trying to solve.

When using either virt-viewer or virt-manager, when you click on the screen it will "grab" the cursor. To gain control of the mouse again, hit the Ctrl-Alt keys. You'll see messages prefixed with atkbd.c; ignore them.

When you first see the console, it may be asleep (black). Just hit the RETURN key to activate it.

Watching the console error messages as a VM reboots can be the key to diagnosing a problem. The following sections discuss how to repair a corrupted disk image. Another problem I've seen is that a directory was not being correctly exported by a node as the VM started; usually you'd have had to repair an "export" resource as described above. In that case, simply rebooting the VM should fix the problem.

Repairing a virtual disk

The most common HA-related problem with a virtual machine is that the virtual hard disk has become corrupted due a sudden outage, whether real or virtual. Fortunately, it's easier to repair this problem on a virtual than it would be on a real one.

The most frequent sign of disk corruption is that the virtual OS can only mount the disk in read-only mode. If you're watching the console as the VM boots, you'll see this quickly; there'll be a large number of "cannot write to read-only directory" messages. Another way to test this is to login into the VM:

ssh XXX
touch /tmp/test
If the touch command displays an error message, then the disk has become corrupted.

There are two ways to repair a broken virtual disk:

fsck

Reboot the virtual machine (e.g., /sbin/shutdown -r now) while looking at its console with virt-viewer or virt-manager. As it goes through the boot process, it will detect the disk corruption and prompt for the root password to perform maintenance. Enter it, and you'll be in a special "repair" shell. Try to fix the virtual disk with fsck:

/sbin/fsck -y /dev/vda1
# Reboot the system again when done.
/sbin/shutdown -r now

If this doesn't work, the remaining option is to go to the backup copy.

Restore from backup

Every two months, the HA cluster automatically makes a backup of all the virtual machine images. On hypatia or orestes, look at the sub-directories in /work/saves. The names of the directories correspond to the date that the backup was made. You probably want the most recent version, unless you suspect that the disk corruption took place before the backup was made.

The procedure is:

  • shutdown the virtual machine;
  • copy the virtual machine's disk image from /work/saves/XXXXXXXX to /xen/images;
  • start the virtual machine again and watch the console.

Assume the mail server's disk has become corrupted, fsck hasn't fixed the problem, and the most recent backup is in /work/saves/20130101:

# On hypatia or orestes, tell pacemaker to stop the VM
/usr/sbin/crm resource stop VM_franklin
# Check that franklin has completely stopped
/usr/sbin/crm resource status VM_franklin
ssh -x root@hypatia "virsh list"
ssh -x root@orestes "virsh list"
# Move the existing corrupted disk image out the way.
# (The part in backticks appends the date to the old name.)
mv /xen/images/franklin-os.img /xen/images/franklin-os.img.`date +%Y%m%d`
# Copy the backup
cp -v /work/saves/20130101/franklin-os.img /xen/images/franklin-os.img
# Start the virtual machine again
/usr/sbin/crm resource start VM_franklin
# Look at the console as it boots.
# Warning: There's no guarantee that it will run on the same 
# node as it did before!
virt-viewer --wait franklin &

The reason why you want to look at the console is that even a disk recovered from backup may still generated "unclean partition" error messages. This is because the disk-image backups can be made while the VM is running, and the disk image might be in an inconsistent state. Running fsck on the restored disk image will fix the problem.

Restore any broken configuration files

After either one of the above recovery methods, there may yet be some files that were not recovered or are out-of-date. The nightly backup includes the important directories from virtual machines (as well as every other box on the Nevis Linux cluster. Login to shelley and look at the contents of /backup/README for a guide on finding the copies of the backed-up files.

Edit | Attach | Watch | Print version | History: r11 < r10 < r9 < r8 < r7 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r11 - 2014-07-01 - WilliamSeligman
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback