Difference: PacemakerDiagnostics (7 vs. 8)

Revision 82013-01-11 - WilliamSeligman

Line: 1 to 1
 
META TOPICPARENT name="Computing"

Basic diagnostics for the high-availability cluster

Line: 58 to 58
 

Quick-but-drastic

Changed:
<
<
Cycling the power on both hypatia and orestes at the same time may fix the problem. Or it may cause the cluster to hang indefinitely. It's usually better to try to understand what the problem is (multiple hard drive failures? bad configuration file?), but if you're rushed for time you can give it a try.
>
>
Warning: Before trying this, make sure that the problem is being caused by a failure in the HA cluster software (pacemaker or drbd). Cycling the power on both hypatia and orestes at the same time may fix the problem. Or it may cause the cluster to hang indefinitely. It's usually better to try to understand what the problem is (multiple hard drive failures? bad configuration file?), but if you're rushed for time you can give it a try.
 
Changed:
<
<
The problems with this approach:
>
>
The problems with power-cycling the HA cluster:
 
  • If the cause is a hardware problem or a configuration issue, power-cycling won't solve it.

Line: 420 to 420
 

Virtual machines

Added:
>
>
The following tips can help if a virtual machine appears to be hanged, or is constantly failing to run on either node.

Note that we've had problems with virtual machines that have had nothing to do with the high-availability software. For example, the mail server has failed due to a error made in the DNS configuration file on hermes; another time it failed due to an overnight upgrade to dovecot, the IMAP server program. These problems would have occurred even if franklin was a separate box, instead of a virtual domain managed by pacemaker. Don't be too quick to blame HA!

Names

 As of Jan-2013, these are the virtual machines running on the HA cluster:

VM function
Line: 432 to 438
 
proxy web proxy, occasionally used by Nevis personnel at CERN
wordpress Wordpress server; as of Jan-2013 only used by VERITAS
Deleted:
<
<
The domain name of the virtual machine (what you see if you enter the virsh list command on hypatia or orestes) is the same as the IP name; e.g., if the mail server happens to be running on hypatia and you want to reboot it, you can login to hypatia and type virsh reboot franklin, or you can ssh root@franklin and reboot the system from the command line. The name of the pacemaker resource that controls the virtual machine is same as the domain name prefixed by VM_; e.g., if you want to use pacemaker to restart the mail server, you type crm resource restart VM_franklin on either HA node, without knowing on which system it's running.
 \ No newline at end of file
Added:
>
>
The domain name of the virtual machine (what you see if you enter the virsh list command on hypatia or orestes) is the same as the IP name; e.g., if the mail server happens to be running on hypatia and you want to reboot it, you can login to hypatia and type virsh reboot franklin, or you can ssh root@franklin and reboot the system from the command line. The name of the pacemaker resource that controls the virtual machine is same as the domain name prefixed by VM_; e.g., if you want to use pacemaker to restart the mail server, you can type crm resource restart VM_franklin on either HA node, without knowing on which system it's running.

Restarting a VM

If you want to reboot a virtual machine named XXX, you can do by logging into it:

ssh -x root@XXX "/sbin/shutdown -r now"

You can tell the HA cluster to reboot it:

ssh -x root@hypatia "/usr/sbin/crm resource restart VM_XXX"

If XXX appears to be hung, you can "pull the plug" on the virtual machine. Pacemaker will automatically "power it on" again when it detects that XXX is down.

# Find out on which node the VM resource is running.
ssh -x root@hypatia "/usr/sbin/crm resource status VM_XXX"
# You'll get a response like:
resource VM_XXX is running on: [system-name]
# Then "power-cycle" the virtual machine on the host on which it's running
ssh -x root@[system-name] "/usr/sbin/virsh destroy XXX"

Logging in

All the virtual machines have sshd enabled, so one can ssh as root if necessary. But this doesn't let you see the system console, which often displays messages that you can't see via ssh.

If you want to see a virtual machine's console screen in an X-window, there are a couple of ways. The first is to use virt-viewer. For example, to see the mail server console:

# On which node is the VM running?
ssh root@hypatia "/usr/sbin/crm status VM_franklin"
# You'll get a response like:
resource VM_franklin is running on: [system-name]
# Login to that system if you're not already there.
ssh root@[system-name]
virt-viewer franklin &

If you reboot the VM, the X-window with the console will probably disappear and exit virt-viewer. If you try to start it up again immediately, you'll probably get a "domain not found" message, since it takes a few seconds to restart. To get around this, use the --wait option; e.g.,

virt-viewer --wait franklin &
The viewer will wait until the VM starts and then display the console.

The other way is get a console to login as root to hypatia or orestes and use the virt-manager command. On each node, this has been configured to show the virtual domains on both nodes. Double-click on a domain to see its console. Be careful! Don't accidentally start a virtual domain on both nodes at once; that will create a bigger problem than you're trying to solve.

When using either virt-viewer or virt-manager, when you click on the screen it will "grab" the cursor. To gain control of the mouse again, hit the Ctrl-Alt keys. You'll see messages prefixed with atkbd.c; ignore them.

When you first see the console, it may be asleep (black). Just hit the RETURN key to activate it.

Repairng a virtual disk

The most common HA-related problem with a virtual machine is that the virtual hard disk has become corrupted due a sudden outage, whether real or virtual. Fortunately, it's easier to repair this problem on a virtual than it would be on a real one.

The most frequent sign of disk corruption is that the virtual OS can only mount the disk in read-only mode. If you're watching the console as the VM boots, you'll see this quickly; there'll be a large number of "cannot write to read-only directory" messages. Another way to test this is to login into the VM:

ssh XXX
touch /tmp/test
If the touch command displays an error message, then the disk has become corrupted.
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback