Difference: PacemakerDiagnostics (8 vs. 9)

Revision 92013-01-14 - WilliamSeligman

Line: 1 to 1
 
META TOPICPARENT name="Computing"

Basic diagnostics for the high-availability cluster

Line: 8 to 8
  Here are some basic diagnostic procedures on how to fix problems on the hypatia/orestes high-availability (HA) cluster in the Nevis particle-physics Linux cluster.
Changed:
<
<
The whole point of a high-availability cluster is that when things go wrong, resources are automatically shuffled between the systems in the cluster so that the users never notice a problem. This means that when things go wrong, they go really wrong, and fixing the problem is usually not as simple as the following procedures suggest.
>
>
The whole point of a high-availability cluster is that when things go wrong, resources automatically shuffle between the systems in the cluster so that the users never notice a problem. This means that when things go wrong, they go really wrong, and fixing the problem is usually not as simple as the following procedures suggest.
  Of course, the usual first step is to contact WilliamSeligman. These instructions assume that I am not available, and you're just trying to get things functional enough to send/receive mail, access the web services, etc.
Line: 80 to 80
 

Pacemaker diagnostics

Changed:
<
<
You can see the state of the resources managed by Pacemaker with the command
crm status
>
>
Login as root to hypatia or orestes. You can see the state of the resources managed by Pacemaker with the command
crm status
 

Resources are running normally

Line: 321 to 321
 OFFLINE: [ hypatia.nevis.columbia.edu ]
Changed:
<
<
Here nothing is running. One of the systems (hypatia) has been STONITHed. The other thinks that it can't join the HA cluster. The DRBD output from cat /proc/drbd probably looks something like this:
>
>
This is bad. Nothing is running. One of the systems (hypatia) has been STONITHed. The other thinks that it can't join the HA cluster. The DRBD output from cat /proc/drbd probably looks something like this:
 
0: cs:WFConnection st:Secondary/Unknown ds:Inconsistent/DUnknown C r---
Line: 489 to 489
  When you first see the console, it may be asleep (black). Just hit the RETURN key to activate it.
Changed:
<
<

Repairng a virtual disk

>
>
Watching the console error messages as a VM reboots can be the key to diagnosing a problem. The following sections discuss how to repair a corrupted disk image. Another problem I've seen is that a directory was not being correctly exported by a node as the VM started; usually you'd have had to repair an "export" resource as described above. In that case, simply rebooting the VM should fix the problem.

Repairing a virtual disk

  The most common HA-related problem with a virtual machine is that the virtual hard disk has become corrupted due a sudden outage, whether real or virtual. Fortunately, it's easier to repair this problem on a virtual than it would be on a real one.
Line: 499 to 501
 touch /tmp/test If the touch command displays an error message, then the disk has become corrupted.
Added:
>
>
There are two ways to repair a broken virtual disk:

fsck

Reboot the virtual machine (e.g., /sbin/shutdown -r now) while looking at its console with virt-viewer or virt-manager. As it goes through the boot process, it will detect the disk corruption and prompt for the root password to perform maintenance. Enter it, and you'll be in a special "repair" shell. Try to fix the virtual disk with fsck:

/sbin/fsck -y /dev/vda1
# Reboot the system again when done.
/sbin/shutdown -r now

If this doesn't work, the remaining option is to go to the backup copy.

Restore from backup

Every two months, the HA cluster automatically makes a backup of all the virtual machine images. On hypatia or orestes, look at the sub-directories in /work/saves. The names of the directories correspond to the date that the backup was made. You probably want the most recent version, unless you suspect that the disk corruption took place before the backup was made.

The procedure is:

  • shutdown the virtual machine;
  • copy the virtual machine's disk image from /work/saves/XXXXXXXX to /xen/images;
  • start the virtual machine again and watch the console.

Assume the mail server's disk has become corrupted, fsck hasn't fixed the problem, and the most recent backup is in /work/saves/20130101:

# On hypatia or orestes, tell pacemaker to stop the VM
/usr/sbin/crm resource stop VM_franklin
# Check that franklin has completely stopped
/usr/sbin/crm resource status VM_franklin
ssh -x root@hypatia "virsh list"
ssh -x root@orestes "virsh list"
# Move the existing corrupted disk image out the way.
# (The part in backticks appends the date to the old name.)
mv /xen/images/franklin-os.img /xen/images/franklin-os.img.`date +%Y%m%d`
# Copy the backup
cp -v /work/saves/20130101/franklin-os.img /xen/images/franklin-os.img
# Start the virtual machine again
/usr/sbin/crm resource start VM_franklin
# Look at the console as it boots.
# Warning: There's no guarantee that it will run on the same 
# node as it did before!
virt-viewer --wait franklin &

The reason why you want to look at the console is that even a disk recovered from backup may still generated "unclean partition" error messages. This is because the disk-image backups can be made while the VM is running, and the disk image might be in an inconsistent state. Running fsck on the restored disk image will fix the problem.

Restore any broken configuration files

After either one of the above recovery methods, there may yet be some files that were not recovered or are out-of-date. The nightly backup includes the important directories from virtual machines (as well as every other box on the Nevis Linux cluster. Login to shelley and look at the contents of /backup/README for a guide on finding the copies of the backed-up files.

 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback