Basic diagnostics for the high-availability cluster
Archived 20-Sep-2013: The high-availability cluster has been set aside in favor of a more traditional single-box admin server. HA is grand in theory, but in the three years we operated the cluster we had no hardware problems which the HA set-up would have prevented, but many hours of downtime due to problems with the HA software.
This mailing-list post
has some details.
Here are some basic diagnostic procedures for the
hypatia/orestes
high-availability (HA) cluster in the Nevis particle-physics
Linux cluster.
The whole point of a high-availability cluster is that when things go wrong, resources automatically shuffle between the systems in the cluster so that the users never notice a problem. This means that when things go wrong, they go
really wrong, and fixing the problem is usually not as simple as the following procedures suggest.
Of course, the usual first step is to contact
WilliamSeligman. These instructions assume that I am not available, and you're just trying to get things functional enough to send/receive mail, access the web services, etc.
General tips:
- ChengyiChi has the root password. It's also in an envelope in a secure location; ask AmyGarwood for access.
- Give these procedures time to work! For example:
- if the web server virtual machine is overloaded, it may take more then 20 minutes to restart on its own;
- if the DRBD partitions have to re-sync, it may take 20-30 minutes for other resources to start.
- These are extreme examples. Normally a delay of 5-10 minutes is typical for a serious-but-recoverable problem.
Simple recipes
Restarting the mail server
If you want to reboot the mail server virtual machine, you can do:
ssh -x root@franklin "/sbin/shutdown -r now"
or
ssh -x root@hypatia "/usr/sbin/crm resource restart VM_franklin"
or (if
franklin
is stuck)
ssh -x root@hypatia "/usr/sbin/crm resource status VM_franklin"
# You'll get a response like:
resource VM_franklin is running on: [system-name]
# Then "power-cycle" the virtual machine on the host on which it's running
ssh -x root@[system-name] "/usr/sbin/virsh destroy franklin"
Do NOT reboot
hypatia
or
orestes
by itself in the hopes of rebooting the mail server. The most likely outcome is the rebooted system’s UPS will shut down its power.
Restarting the web server
Same as above, replacing
franklin
with
ada
.
It's far more common for the web server to become overloaded than the mail server. The problem appears to be web robots (e.g., Google, Yahoo) probing the wiki and getting "stuck". They keep probing the same web address, the wiki software is further delayed with each query, and we have a
DoS
situation. Usually just restarting apache on the web server is sufficient:
ssh -x root@ada "/sbin/service httpd restart"
Give this a lot of time to complete. I've seen the web server so bogged down with processes that it takes 30 minutes for apache to restart.
Quick-but-drastic
Warning: Before trying this, make sure that the problem is being caused by a failure in the HA cluster software (pacemaker or drbd). Cycling the power on both
hypatia
and
orestes
at the same time may fix the problem. Or it may cause the cluster to hang indefinitely. It's usually better to try to understand what the problem is (multiple hard drive failures? bad configuration file?), but if you're rushed for time you can give it a try.
The problems with power-cycling the HA cluster:
- If the cause is a hardware problem or a configuration issue, power-cycling won't solve it.
- This guarantees that the DRBD partition will be rebuilt. It's hard to say how long this will take. You can monitor the status of the rebuild as described below (
cat /proc/drbd
).
- Cutting the power also "pulls the plug" on all the virtual machines, with all the risks associated with pulling the plug on a physical box. The disk images on the virtual machines can be corrupted, especially if they're halted in the middle of a disk write, which is frequent for the mail and web servers. There's a restore procedure for a damaged virtual machine given below, but it's better if a virtual machine isn't corruped in the first place.
Some things to check before taking this step:
- Make sure that the UPSes for both
hypatia
and orestes
are both on. If either or both are off, turn them on; that might be the cause of the problem.
- Check that the
rack-b-switch
is powered on and running.
- The power for
rack-b-switch
comes from bleeker-ups
; make sure that's up as well.
If you have no other choice, pray to whatever gods you worship (I prefer
Hermes
and
Hecate
), then cycle the power on both
hypatia
and
orestes
using the buttons on their front panels.
Pacemaker diagnostics
Login as root to
hypatia
or
orestes
. You can see the state of the resources managed by Pacemaker with the command
crm status
Resources are running normally
If things are working, you'll see something like this:
============
Last updated: Tue Jan 8 17:39:02 2013
Last change: Tue Jan 1 00:44:02 2013 via cibadmin on orestes.nevis.columbia.edu
Stack: cman
Current DC: orestes.nevis.columbia.edu - partition with quorum
Version: 1.1.6-3.el6-a02c0f19a00c1eb2527ad38f146ebc0834814558
2 Nodes configured, 2 expected votes
65 Resources configured.
============
Online: [ hypatia.nevis.columbia.edu orestes.nevis.columbia.edu ]
Master/Slave Set: AdminClone [AdminDrbd]
Masters: [ hypatia.nevis.columbia.edu orestes.nevis.columbia.edu ]
CronAmbientTemperature (ocf::heartbeat:symlink): Started orestes.nevis.columbia.edu
StonithHypatia (stonith:fence_nut): Started orestes.nevis.columbia.edu
StonithOrestes (stonith:fence_nut): Started hypatia.nevis.columbia.edu
Resource Group: DhcpGroup
SymlinkDhcpdConf (ocf::heartbeat:symlink): Started hypatia.nevis.columbia.edu
SymlinkSysconfigDhcpd (ocf::heartbeat:symlink): Started hypatia.nevis.columbia.edu
SymlinkDhcpdLeases (ocf::heartbeat:symlink): Started hypatia.nevis.columbia.edu
Dhcpd (lsb:dhcpd): Started hypatia.nevis.columbia.edu
IP_dhcp (ocf::heartbeat:IPaddr2): Started hypatia.nevis.columbia.edu
Clone Set: IPClone [IPGroup] (unique)
Resource Group: IPGroup:0
IP_cluster:0 (ocf::heartbeat:IPaddr2): Started orestes.nevis.columbia.edu
IP_cluster_local:0 (ocf::heartbeat:IPaddr2): Started orestes.nevis.columbia.edu
IP_cluster_sandbox:0 (ocf::heartbeat:IPaddr2): Started orestes.nevis.columbia.edu
Resource Group: IPGroup:1
IP_cluster:1 (ocf::heartbeat:IPaddr2): Started orestes.nevis.columbia.edu
IP_cluster_local:1 (ocf::heartbeat:IPaddr2): Started orestes.nevis.columbia.edu
IP_cluster_sandbox:1 (ocf::heartbeat:IPaddr2): Started orestes.nevis.columbia.edu
Clone Set: LibvirtdClone [LibvirtdGroup]
Started: [ orestes.nevis.columbia.edu hypatia.nevis.columbia.edu ]
Clone Set: TftpClone [TftpGroup]
Started: [ orestes.nevis.columbia.edu hypatia.nevis.columbia.edu ]
Clone Set: ExportsClone [ExportsGroup]
Started: [ orestes.nevis.columbia.edu hypatia.nevis.columbia.edu ]
Clone Set: FilesystemClone [FilesystemGroup]
Started: [ orestes.nevis.columbia.edu hypatia.nevis.columbia.edu ]
VM_proxy (ocf::heartbeat:VirtualDomain): Started hypatia.nevis.columbia.edu
VM_hogwarts (ocf::heartbeat:VirtualDomain): Started hypatia.nevis.columbia.edu
VM_wordpress (ocf::heartbeat:VirtualDomain): Started hypatia.nevis.columbia.edu
VM_tango (ocf::heartbeat:VirtualDomain): Started orestes.nevis.columbia.edu
VM_sullivan (ocf::heartbeat:VirtualDomain): Started hypatia.nevis.columbia.edu
VM_ada (ocf::heartbeat:VirtualDomain): Started orestes.nevis.columbia.edu
VM_nagios (ocf::heartbeat:VirtualDomain): Started orestes.nevis.columbia.edu
CronBackupVirtualDiskImages (ocf::heartbeat:symlink): Started hypatia.nevis.columbia.edu
VM_franklin (ocf::heartbeat:VirtualDomain): Started hypatia.nevis.columbia.edu
You can compare this with the
resource sketch elsewhere in this wiki, and the
output
from
crm configure show
. Without going into detail, note that in the above display:
- cloned resources are running on both nodes (exception: both instances of
IPClone
tend to run on a single node);
- other resources are distributed between the two nodes;
- virtual machine resources such as
VM_franklin
(the mail server) and VM_ada
(the web server) are all running.
Only one node up
If only one node is up, you might see something like this:
============
Last updated: Wed May 16 16:28:21 2012
Last change: Tue May 15 18:43:21 2012 via crmd on hypatia.nevis.columbia.edu
Stack: cman
Current DC: hypatia.nevis.columbia.edu - partition with quorum
Version: 1.1.6-3.el6-a02c0f19a00c1eb2527ad38f146ebc0834814558
2 Nodes configured, 2 expected votes
63 Resources configured.
============
Online: [ hypatia.nevis.columbia.edu ]
OFFLINE: [ orestes.nevis.columbia.edu ]
Master/Slave Set: AdminClone [AdminDrbd]
Masters: [ hypatia.nevis.columbia.edu ]
Stopped: [ AdminDrbd:1 ]
CronAmbientTemperature (ocf::heartbeat:symlink): Started hypatia.nevis.columbia.edu
StonithOrestes (stonith:fence_nut): Started hypatia.nevis.columbia.edu
Resource Group: DhcpGroup
SymlinkDhcpdConf (ocf::heartbeat:symlink): Started hypatia.nevis.columbia.edu
SymlinkSysconfigDhcpd (ocf::heartbeat:symlink): Started hypatia.nevis.columbia.edu
SymlinkDhcpdLeases (ocf::heartbeat:symlink): Started hypatia.nevis.columbia.edu
Dhcpd (lsb:dhcpd): Started hypatia.nevis.columbia.edu
IP_dhcp (ocf::heartbeat:IPaddr2): Started hypatia.nevis.columbia.edu
Clone Set: IPClone [IPGroup] (unique)
Resource Group: IPGroup:0
IP_cluster:0 (ocf::heartbeat:IPaddr2): Started hypatia.nevis.columbia.edu
IP_cluster_local:0 (ocf::heartbeat:IPaddr2): Started hypatia.nevis.columbia.edu
IP_cluster_sandbox:0 (ocf::heartbeat:IPaddr2): Started hypatia.nevis.columbia.edu
Resource Group: IPGroup:1
IP_cluster:1 (ocf::heartbeat:IPaddr2): Started hypatia.nevis.columbia.edu
IP_cluster_local:1 (ocf::heartbeat:IPaddr2): Started hypatia.nevis.columbia.edu
IP_cluster_sandbox:1 (ocf::heartbeat:IPaddr2): Started hypatia.nevis.columbia.edu
Clone Set: LibvirtdClone [LibvirtdGroup]
Started: [ hypatia.nevis.columbia.edu ]
Stopped: [ LibvirtdGroup:0 ]
Clone Set: TftpClone [TftpGroup]
Started: [ hypatia.nevis.columbia.edu ]
Stopped: [ TftpGroup:1 ]
Clone Set: ExportsClone [ExportsGroup]
Started: [ hypatia.nevis.columbia.edu ]
Stopped: [ ExportsGroup:0 ]
Clone Set: FilesystemClone [FilesystemGroup]
Started: [ hypatia.nevis.columbia.edu ]
Stopped: [ FilesystemGroup:1 ]
VM_proxy (ocf::heartbeat:VirtualDomain): Started hypatia.nevis.columbia.edu
VM_hogwarts (ocf::heartbeat:VirtualDomain): Started hypatia.nevis.columbia.edu
VM_wordpress (ocf::heartbeat:VirtualDomain): Started hypatia.nevis.columbia.edu
VM_tango (ocf::heartbeat:VirtualDomain): Started hypatia.nevis.columbia.edu
VM_sullivan (ocf::heartbeat:VirtualDomain): Started hypatia.nevis.columbia.edu
VM_ada (ocf::heartbeat:VirtualDomain): Started hypatia.nevis.columbia.edu
VM_nagios (ocf::heartbeat:VirtualDomain): Started hypatia.nevis.columbia.edu
CronBackupVirtualDiskImages (ocf::heartbeat:symlink): Started hypatia.nevis.columbia.edu
VM_franklin (ocf::heartbeat:VirtualDomain): Started hypatia.nevis.columbia.edu
Note that all the resources are running on a single node (
hypatia
in this example) and the other node is off-line.
This is not a disaster; this is what the HA software is supposed to do if there's a problem with one of the nodes. As long as everything is running, there's nothing you have to do. Wait until
WilliamSeligman is available to fix whatever is wrong with the off-line node.
Problems with a resource
What if you see that a resource is missing from the list; e.g., you don't see
VM_franklin
, which means the mail server's virtual machine isn't running? Or what if you see an explicit error message, like this:
============
Last updated: Fri Mar 2 17:17:25 2012
Last change: Fri Mar 2 17:12:44 2012 via crm_shadow on orestes.nevis.columbia.edu
Stack: cman
Current DC: hypatia.nevis.columbia.edu - partition with quorum
Version: 1.1.6-3.el6-a02c0f19a00c1eb2527ad38f146ebc0834814558
2 Nodes configured, unknown expected votes
41 Resources configured.
============
Online: [ hypatia.nevis.columbia.edu orestes.nevis.columbia.edu ]
Master/Slave Set: AdminClone [AdminDrbd]
Slaves: [ hypatia.nevis.columbia.edu orestes.nevis.columbia.edu ]
StonithHypatia (stonith:fence_nut): Started orestes.nevis.columbia.edu
StonithOrestes (stonith:fence_nut): Started hypatia.nevis.columbia.edu
Clone Set: ExportsClone [ExportsGroup]
Started: [ hypatia.nevis.columbia.edu orestes.nevis.columbia.edu ]
Failed actions:
ExportMail:0_monitor_0 (node=hypatia.nevis.columbia.edu, call=29, rc=-2, status=Timed Out): unknown exec error
ExportMail:0_monitor_0 (node=orestes.nevis.columbia.edu, call=29, rc=-2, status=Timed Out): unknown exec error
The first step is to try to clear the error message. In this case, the resource displaying the error is
ExportMail
. The command to try is
crm resource cleanup [resource-name]
For example, if you saw that
VM_franklin
was missing from the list or appeared in an error message, you could try
crm resource cleanup VM_franklin
.
In general, try to cleanup resources from the "top" down; that is, if more than one resource is missing or showing an error, cleanup the ones nearer the top of the above lists. This more-or-less corresponds to the order in which the resources are started by Pacemaker.
DRBD problems
Suppose you see something like this:
============
Last updated: Tue Mar 27 17:25:36 2012
Last change: Tue Mar 27 17:03:23 2012 via cibadmin on hypatia.nevis.columbia.edu
Stack: cman
Current DC: hypatia.nevis.columbia.edu - partition with quorum
Version: 1.1.6-3.el6-a02c0f19a00c1eb2527ad38f146ebc0834814558
2 Nodes configured, 2 expected votes
58 Resources configured.
============
Online: [ hypatia.nevis.columbia.edu orestes.nevis.columbia.edu ]
Master/Slave Set: AdminClone [AdminDrbd]
Slaves: [ hypatia.nevis.columbia.edu orestes.nevis.columbia.edu ]
StonithHypatia (stonith:fence_nut): Started orestes.nevis.columbia.edu
StonithOrestes (stonith:fence_nut): Started hypatia.nevis.columbia.edu
The DRBD resource is running, but both nodes are in the slave state. This means that the DRBD disks on the two nodes are synchronized, but neither node has mounted the disk.
This is not necessarily an emergency; it may just mean a wait of up to 15 minutes. This normally occurs if both systems have come up, and it took a while for the DRBD disks to synchronize between the two systems. You can try to speed things up with the command
crm resource cleanup AdminClone
You can check on the status of DRBD on both systems. The command is
cat /proc/drbd
If everything is normal, the output looks like this:
0: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r-----
ns:652411770 nr:1207301063 dw:1859712813 dr:191729621 al:34135 bm:1655 lo:5 pe:0 ua:6 ap:0 ep:2 wo:b oos:0
If one or both systems are DBBD slaves, they'll be labeled as "Secondary"; e.g.,:
0: cs:Connected ro:Secondary/Secondary ds:UpToDate/UpToDate C r-----
ns:652411770 nr:1207301063 dw:1859712813 dr:191729621 al:34135 bm:1655 lo:5 pe:0 ua:6 ap:0 ep:2 wo:b oos:0
You may even catch the systems in the act of syncing, if you're recovering from one system being down:
0: cs:SyncSource ro:Primary/Secondary ds:UpToDate/Inconsistent C r----
ns:2184 nr:0 dw:0 dr:2472 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:10064
[=====>..............] sync'ed: 33.4% (10064/12248)K
finish: 0:00:37 speed: 240 (240) K/sec
All of the above are relatively normal. Pacemaker should eventually recover without any further intervention.
However, it's much more serious if the output of
crm status
looks something like this:
============
Last updated: Mon Mar 5 11:39:44 2012
Last change: Mon Mar 5 11:37:23 2012 via cibadmin on hypatia.nevis.columbia.edu
Stack: cman
Current DC: orestes.nevis.columbia.edu - partition with quorum
Version: 1.1.6-3.el6-a02c0f19a00c1eb2527ad38f146ebc0834814558
2 Nodes configured, unknown expected votes
41 Resources configured.
============
Node orestes.nevis.columbia.edu: UNCLEAN (online)
OFFLINE: [ hypatia.nevis.columbia.edu ]
This is bad. Nothing is running. One of the systems (
hypatia
) has been
STONITHed
. The other thinks that it can't join the HA cluster. The DRBD output from
cat /proc/drbd
probably looks something like this:
0: cs:WFConnection st:Secondary/Unknown ds:Inconsistent/DUnknown C r---
ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0
Here's what probably happened: Both nodes went down. Only one node has come back up even partially. The drbd service is not sure whether its copy of the disk is up-to-date; it can't tell for certain until it can talk with the other node. With the other node down, drbd can't do that. The result is that cluster resource management hangs indefinitely.
This is the reason why random power-cycling of the nodes is a bad idea.
This problem is supposed to fix itself automatically. But if you see a node with an UNCLEAN status, pacemaker has already tried and failed. The next step is for you to fix the DRBD status manually.
DRBD manual repair
Stop pacemaker
On the UNCLEAN node(s), turn off pacemaker:
/sbin/service pacemaker stop
If pacemaker hangs while stopping
If more than two minutes have elapsed, and the script is still printing out periods, then you'll have to force-quit the pacemaker service:
Ctrl-C # to stop the pacemaker script
killall -v -HUP pacemakerd
# Check that there are no surviving pacemaker processes.
# If there are, use "kill" on them.
ps -elf | grep pacemaker
ps -elf | grep pcmk
Disable HA services at startup
/sbin/chkconfig pacemaker off
/sbin/chkconfig clvmd off
/sbin/chkconfig cman off
# Reboot the system:
/sbin/shutdown -r now
If a node is down, boot it up in single-user mode to disable its HA services
Start drbd manually
On both systems, start drbd. Type the following command on one node, then switch to the other node as quickly as possible and type it again:
/sbin/service drbd start
If one node is still down because of a hardware problem (e.g., two or more RAID drives have failed; power supply has blown; motherboard is fried) then you'll get a prompt on the surviving node; answer "yes" to disable waiting.
Fix any problems
On each node, check the DRBD status:
cat /proc/drbd
Hopefully DRBD will be able to resolve the problem on its own; you may see messages about the partitions syncing as shown above. Eventually you'll see the partition(s) have a status of
Secondary
.
If not, you'll have to do web searches to fix the problem. Useful links are:
The name of DRBD resource is
admin
(assigned in
/etc/drbd.d/admin.res
)
Once DRBD is working, on both nodes activate the HA resources at startup:
/sbin/chkconfig pacemaker on
/sbin/chkconfig clvmd on
/sbin/chkconfig cman on
Reboot the node(s). After issuing this command on one node, issue it on the other node as quickly as possible:
/sbin/shutdown -r now
Virtual machines
The following tips can help if a virtual machine appears to be hanged, or is constantly failing to run on either node.
Note that we've had problems with virtual machines that have had nothing to do with the high-availability software. For example, the mail server has failed due to a error made in the
DNS configuration file on
hermes
; another time it failed due to an overnight upgrade to
dovecot
, the IMAP server program. These problems would have occurred even if
franklin
was a separate box, instead of a virtual domain managed by pacemaker. Don't be too quick to blame HA!
Names
As of Jan-2013, these are the virtual machines running on the HA cluster:
VM |
function |
franklin |
mail server |
ada |
web server (including calendars, wiki, ELOG) |
sullivan |
mailing-list server |
tango |
SAMBA server (the admin staff calls this the "shared server") |
hogwarts |
home directories for staff accounts |
nagios |
Linux cluster monitoring |
proxy |
web proxy, occasionally used by Nevis personnel at CERN |
wordpress |
Wordpress server; as of Jan-2013 only used by VERITAS |
The
domain name of the virtual machine (what you see if you enter the
virsh list
command on
hypatia
or
orestes
) is the same as the
IP name; e.g., if the mail server happens to be running on
hypatia
and you want to reboot it, you can login to
hypatia
and type
virsh reboot franklin
, or you can
ssh root@franklin
and reboot the system from the command line. The name of the
pacemaker resource that controls the virtual machine is same as the domain name prefixed by
VM_
; e.g., if you want to use pacemaker to restart the mail server, you can type
crm resource restart VM_franklin
on either HA node, without knowing on which system it's running.
Restarting a VM
If you want to reboot a virtual machine named
XXX
, you can do by logging into it:
ssh -x root@XXX "/sbin/shutdown -r now"
You can tell the HA cluster to reboot it:
ssh -x root@hypatia "/usr/sbin/crm resource restart VM_XXX"
If
XXX
appears to be hung, you can "pull the plug" on the virtual machine. Pacemaker will automatically "power it on" again when it detects that
XXX
is down.
# Find out on which node the VM resource is running.
ssh -x root@hypatia "/usr/sbin/crm resource status VM_XXX"
# You'll get a response like:
resource VM_XXX is running on: [system-name]
# Then "power-cycle" the virtual machine on the host on which it's running
ssh -x root@[system-name] "/usr/sbin/virsh destroy XXX"
Logging in
All the virtual machines have sshd enabled, so one can ssh as root if necessary. But this doesn't let you see the system console, which often displays messages that you can't see via ssh.
If you want to see a virtual machine's console screen in an X-window, there are a couple of ways. The first is to use
virt-viewer
. For example, to see the mail server console:
# On which node is the VM running?
ssh root@hypatia "/usr/sbin/crm status VM_franklin"
# You'll get a response like:
resource VM_franklin is running on: [system-name]
# Login to that system if you're not already there.
ssh root@[system-name]
virt-viewer franklin &
If you reboot the VM, the X-window with the console will probably disappear and exit
virt-viewer
. If you try to start it up again immediately, you'll probably get a "domain not found" message, since it takes a few seconds to restart. To get around this, use the
--wait
option; e.g.,
virt-viewer --wait franklin &
The viewer will wait until the VM starts and then display the console.
The other way is get a console to login as root to
hypatia
or
orestes
and use the
virt-manager
command. On each node, this has been configured to show the virtual domains on both nodes. Double-click on a domain to see its console. Be careful! Don't accidentally start a virtual domain on both nodes at once; that will create a bigger problem than you're trying to solve.
When using either
virt-viewer
or
virt-manager
, when you click on the screen it will "grab" the cursor. To gain control of the mouse again, hit the
Ctrl-Alt
keys. You'll see messages prefixed with
atkbd.c
; ignore them.
When you first see the console, it may be asleep (black). Just hit the RETURN key to activate it.
Watching the console error messages as a VM reboots can be the key to diagnosing a problem. The following sections discuss how to repair a corrupted disk image. Another problem I've seen is that a directory was not being correctly exported by a node as the VM started; usually you'd have had to repair an "export" resource as described above. In that case, simply rebooting the VM should fix the problem.
Repairing a virtual disk
The most common HA-related problem with a virtual machine is that the virtual hard disk has become corrupted due a sudden outage, whether real or virtual. Fortunately, it's easier to repair this problem on a virtual than it would be on a real one.
The most frequent sign of disk corruption is that the virtual OS can only mount the disk in read-only mode. If you're watching the console as the VM boots, you'll see this quickly; there'll be a large number of "cannot write to read-only directory" messages. Another way to test this is to login into the VM:
ssh XXX
touch /tmp/test
If the
touch
command displays an error message, then the disk has become corrupted.
There are two ways to repair a broken virtual disk:
fsck
Reboot the virtual machine (e.g.,
/sbin/shutdown -r now
) while looking at its console with
virt-viewer
or
virt-manager
. As it goes through the boot process, it will detect the disk corruption and prompt for the root password to perform maintenance. Enter it, and you'll be in a special "repair" shell. Try to fix the virtual disk with
fsck
:
/sbin/fsck -y /dev/vda1
# Reboot the system again when done.
/sbin/shutdown -r now
If this doesn't work, the remaining option is to go to the backup copy.
Restore from backup
Every two months, the HA cluster automatically makes a backup of all the virtual machine images. On
hypatia
or
orestes
, look at the sub-directories in
/work/saves
. The names of the directories correspond to the date that the backup was made. You probably want the most recent version, unless you suspect that the disk corruption took place before the backup was made.
The procedure is:
- shutdown the virtual machine;
- copy the virtual machine's disk image from
/work/saves/XXXXXXXX
to /xen/images
;
- start the virtual machine again and watch the console.
Assume the mail server's disk has become corrupted,
fsck
hasn't fixed the problem, and the most recent backup is in
/work/saves/20130101
:
# On hypatia or orestes, tell pacemaker to stop the VM
/usr/sbin/crm resource stop VM_franklin
# Check that franklin has completely stopped
/usr/sbin/crm resource status VM_franklin
ssh -x root@hypatia "virsh list"
ssh -x root@orestes "virsh list"
# Move the existing corrupted disk image out the way.
# (The part in backticks appends the date to the old name.)
mv /xen/images/franklin-os.img /xen/images/franklin-os.img.`date +%Y%m%d`
# Copy the backup
cp -v /work/saves/20130101/franklin-os.img /xen/images/franklin-os.img
# Start the virtual machine again
/usr/sbin/crm resource start VM_franklin
# Look at the console as it boots.
# Warning: There's no guarantee that it will run on the same
# node as it did before!
virt-viewer --wait franklin &
The reason why you want to look at the console is that even a disk recovered from backup may still generated "unclean partition" error messages. This is because the disk-image backups can be made while the VM is running, and the disk image might be in an inconsistent state. Running
fsck
on the restored disk image will fix the problem.
Restore any broken configuration files
After either one of the above recovery methods, there may yet be some files that were not recovered or are out-of-date. The
nightly backup
includes the important directories from virtual machines (as well as every other box on the Nevis
Linux cluster. Login to
shelley
and look at the contents of
/backup/README
for a guide on finding the copies of the backed-up files.