Nevis particle-physics administrative cluster configuration
Archived 20-Sep-2013: The high-availability cluster has been set aside in favor of a more traditional single-box admin server. HA is grand in theory, but in the three years we operated the cluster we had no hardware problems which the HA set-up would have prevented, but many hours of downtime due to problems with the HA software.
This mailing-list post
has some details.
This is a reference page. It contains a text file that describes how the high-availability
pacemaker
configuration was set up on two administrative
servers,
hypatia
and
orestes
.
Files
Key HA configuration files.
Note: Even in an emergency, there's no reason to edit these files!
/etc/cluster/cluster.conf
/etc/lvm/lvm.conf
/etc/drbd.d/global_common.conf
/etc/drbd.d/admin.res
/home/bin/fence_nut.pl
/etc/rc.d/rc.local
/home/bin/recover-symlinks.sh
/etc/rc.d/rc.fix-pacemaker-delay
(on hypatia only)
The links are to an external site,
pastebin
; I use this in case I want to consult with someone on the HA setup. If you're reading this from a hardcopy, you can find all these files by visiting
http://pastebin.com/u/wgseligman
and searching for
20130103
.
One-time set-up
The commands to set up a dual-primary cluster are outlined here. The details can be found in
Clusters From Scratch
and
Redhat Cluster Tutorial
.
Warning: Do not type any of these commands in the hopes of fixing a problem! They will erase the shared DRBD drive.
DRBD set-up
Edit
/etc/drbd.d/global_common.conf
and create
/etc/drbd.d/admin.res
. Then on
hypatia
:
/sbin/drbdadm create-md admin
/sbin/service drbd start
/sbin/drbdadm up admin
Then on
orestes
:
/sbin/drbdadm --force create-md admin
/sbin/service drbd start
/sbin/drbdadm up admin
Back to
hypatia
:
/sbin/drbdadm -- --overwrite-data-of-peer primary admin
cat /proc/drbd
Keeping looking at the contents of
/proc/drbd
. It will take a while, but eventually the two disks will sync up.
Back to
orestes
:
/sbin/drbdadm primary admin
cat /proc/drbd
The result should be something like this:
# cat /proc/drbd
version: 8.4.1 (api:1/proto:86-100)
GIT-hash: 91b4c048c1a0e06777b5f65d312b38d47abaea80 build by root@hypatia-tb.nevis.columbia.edu, 2012-02-14 17:04:51
0: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r-----
ns:162560777 nr:78408289 dw:240969067 dr:747326438 al:10050 bm:1583 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
Here's
a guide to understanding the contents of
/proc/drbd
.
Clustered LVM setup
Most of the following commands only have to be issued on one of the nodes. See
Clusters From Scratch
and
Redhat Cluster Tutorial
for details.
- Edit
/etc/lvm/lvm.conf
on both systems; search this file for the initials WGS
for a complete list of changes.
Pacemaker configuration
Commands
The configuration has definitely changed from that listed below. To see the
current configuration
, run this as root on either
hypatia
or
orestes
:
crm configure show
To see the status of all the resources:
crm resource status
To get a constantly-updated display of the resource status, the following command is the corosync equivalent of "top" (use Ctrl-C to exit):
crm_mon
You can run the above commands via sudo, but you'll have to extend your path; e.g.,
export PATH=/sbin:/usr/sbin:${PATH}
sudo crm_mon
Concepts
This may help as you work your way through the configuration:
crm configure primitive MyIPResource ocf:heartbeat:IPaddr2 params ip=192.168.85.3 \
cidr_netmask=32 op monitor interval=30s timeout=60s
# Which is composed of
* crm ::= "cluster resource manager", the command we're executing
* primitive ::= The type of resource object that we’re creating.
* MyIPResource ::= Our name for the resource
* IPaddr2 ::= The script to call
* ocf ::= The standard it conforms to
* ip=192.168.85.3 ::= Parameter(s) as name/value pairs
* cidr_netmask ::= netmask; 32-bits means use this exact IP address
* op ::== what follows are options
* monitor interval=30s ::= check every 30 seconds that this resource is working
* timeout ::= how long to wait before you assume an "op" is dead.
How to find out which scripts exist, that is, which resources can be controlled by the HA cluster:
crm ra classes
Based on the result, I looked at:
crm ra list ocf heartbeat
To find out what IPaddr2 parameters I needed, I used:
crm ra meta ocf:heartbeat:IPaddr2
Initial configuration guide
This work was done in Apr-2012. The configuration has almost certainly changed since then. Hopefully, the following commands and comments will guide you to understanding any future changes and the reasons for them. The entire final configuration is on pastebin:
http://pastebin.com/QcxuvfK0
.
# The commands ultimately used to configure the high-availability (HA) servers:
# The beginning: make sure pacemaker is running on both hypatia and orestes:
/sbin/service pacemaker status
crm node status
crm resource status
# We'll configure STONITH later (see below)
crm configure property stonith-enabled=false
# Let's continue by entering the crm utility for short sessions. I'm going to
# test groups of commands before I commit them. I omit the "crm configure show'
# and "crm status" commands that I frequently typed in, in order to see that
# everything was correct.
# I also omit some of the standard resource options
# (e.g., "... op monitor interval="20" timeout="40" depth="0"...) to make the
# commands look simpler. You can see the
# complete list with "crm configure show".
# DRBD is a service that synchronizes the hard drives between two machines.
# When one machine makes any change to the DRBD disk, the other machine
# immediately duplicates that change on the block level. We have a dual-primary
# configuration, which means both machines can mount the DRBD disk at once.
# Start by entering the resource manager.
crm
# Define a "shadow" configuration, to test things without committing them
# to the HA cluster:
cib new drbd
# The "drbd_resource" parameter points to a configuration defined in /etc/drbd.d/admin.res
primitive AdminDrbd ocf:linbit:drbd \
params drbd_resource="admin" \
meta target-role="Master"
# The following resources defines how the DRBD resource (AdminDrbd) is to
# duplicated ("cloned") among the nodes. The parameters clarify that there are
# two copies, one on each node, and both can be the master.
ms AdminClone AdminDrbd \
meta master-max="2" master-node-max="1" clone-max="2" clone-node-max="1" \
notify="true" interleave="true"
configure show
# Looks good, so commit the change.
cib commit drbd
quit
# Now define resources that depend on ordering.
crm
cib new disk
# The DRBD disk is available to the system. The next step is to tell LVM
# that the volume group ADMIN exists on the disk.
# To find out that there was a resource "ocf:heartbeat:LVM" that I could use,
# I used the command:
ra classes
# Based on the result, I looked at:
ra list ocf heartbeat
# To find out what LVM parameters I needed, I used:
ra meta ocf:heartbeat:LVM
# All of the above led me to create the following resource configuration:
primitive AdminLvm ocf:heartbeat:LVM \
params volgrpname="ADMIN"
# After I set up the volume group, I want to mount the logical volumes
# (partitions) within the volume group. Here's one of the partitions, /usr/nevis;
# note that I begin all the filesystem resources with FS so they'll be next
# to each other when I type "crm configure show".
primitive FSUsrNevis ocf:heartbeat:Filesystem \
params device="/dev/mapper/ADMIN-usr" directory="/usr/nevis" fstype="gfs2" \
options="defaults,noatime,nodiratime"
# I have similar definitions for the other logical volumes in volume group ADMIN:
# /mail, /var/nevis, etc.
# Now I'm going to define a resource group. The following command means:
# - Put all these resources on the same node;
# - Start these resources in the order they're listed;
# - The resources depend on each other in the order they're listed. For example,
# if AdminLvm fails, FSUsrNevis will not start, or will be stopped if it's running.
group FilesystemGroup AdminLvm FSUsrNevis FSVarNevis FSVirtualMachines FSMail FSWork
# We want these logical volumes (or partitions or filesystems) to be available
# on both nodes. To do this, we define a clone resource.
clone FilesystemClone FilesystemGroup meta interleave="true"
# It's important that we not try to set up the filesystems
# until the DRBD admin resource is running on a node, and has been
# promoted to master.
# A score of "inf:" means "infinity": 'FileSystemClone' must be on a node on which
# 'AdminClone' is in the Master state; if the DRBD resource 'AdminClone' can't
# be promoted, then don't start the 'FilesystemClone' resource. (You can use numeric
# values instead of infinity, in which case these constraints become suggestions
# instead of being mandatory.)
colocation Filesystem_With_Admin inf: FilesystemClone AdminClone:Master
order Admin_Before_Filesystem inf: AdminClone:promote FilesystemClone:start
cib commit disk
quit
# Once all the filesystems are mounted, we start can start other resources. Let's
# define a set of cloned IP addresses that will always point to at least one of the nodes,
# possibly both.
crm
cib new ip
# One address for each network
primitive IP_cluster ocf:heartbeat:IPaddr2 \
params ip="129.236.252.11" cidr_netmask="32" nic="eth0"
primitive IP_cluster_local ocf:heartbeat:IPaddr2 \
params ip="10.44.7.11" cidr_netmask="32" nic="eth2"
primitive IP_cluster_sandbox ocf:heartbeat:IPaddr2 \
params ip="10.43.7.11" cidr_netmask="32" nic="eth0.3"
# Group them together
group IPGroup IP_cluster IP_cluster_local IP_cluster_sandbox
# The option "globally-unique=true" works with IPTABLES to make
# sure that ethernet connections are not disrupted even if one of
# nodes goes down; see "Clusters From Scratch" for details.
clone IPClone IPGroup \
meta globally-unique="true" clone-max="2" clone-node-max="2" interleave="false"
# Make sure the filesystems are mounted before starting the IP resources.
colocation IP_With_Filesystem inf: IPClone FilesystemClone
order Filesystem_Before_IP inf: FilesystemClone IPClone
cib commit ip
quit
# We have to export some of the filesystems via NFS before some of the virtual machines
# will be able to run.
crm
cib new exports
# This is an example NFS export resource; I won't list them all here. See
# "crm configure show" for the complete list.
primitive ExportUsrNevis ocf:heartbeat:exportfs \
description="Site-wide applications installed in /usr/nevis" \
params clientspec="*.nevis.columbia.edu" directory="/usr/nevis" fsid="20" \
options="ro,no_root_squash,async"
# Define a group for all the exportfs resources. You can see it's a long list,
# which is why I don't list them all explicitly. I had to be careful
# about the exportfs definitions; despite the locking mechanims of GFS2,
# we'd get into trouble if two external systems tried to write to the same
# DRBD partition at once via NFS.
group ExportsGroup ExportMail ExportMailInbox ExportMailFolders ExportMailForward ExportMailProcmailrc \
ExportUsrNevisHermes ExportUsrNevis ExportUsrNevisOffsite ExportWWW
# Clone the group so both nodes export the partitions. Make sure the
# filesystems are mounted before we export them.
clone ExportsClone ExportsGroup
colocation Exports_With_Filesystem inf: ExportsClone FilesystemClone
order Filesystem_Before_Exports inf: FilesystemClone ExportsClone
cib commit exports
quit
# Symlinks: There are some scripts that I want to run under cron. These scripts are
# located in the DRBD /var/nevis file system. For them to run via cron, they have to
# found in /etc/cron.d somehow. A symlink is the easiest way, and there's a
# symlink pacemaker resource to manage this.
crm
cib new cron
# The ambient-temperature script periodically checks the computer room's
# environment monitor, and shuts down the cluster if the temperature gets
# too high.
primitive CronAmbientTemperature ocf:heartbeat:symlink \
description="Shutdown cluster if A/C stops" \
params link="/etc/cron.d/ambient-temperature" target="/var/nevis/etc/cron.d/ambient-temperature" \
backup_suffix=".original"
# We don't want to clone this resource; I only want one system to run this script
# at any one time.
colocation Temperature_With_Filesystem inf: CronAmbientTemperature FilesystemClone
order Filesystem_Before_Temperature inf: FilesystemClone CronAmbientTemperature
# Every couple of months, make a backup of the virtual machine's disk images.
primitive CronBackupVirtualDiskImages ocf:heartbeat:symlink \
description="Periodically save copies of the virtual machines" \
params link="/etc/cron.d/backup-virtual-disk-images" \
target="/var/nevis/etc/cron.d/backup-virtual-disk-images" \
backup_suffix=".original"
colocation BackupImages_With_Filesystem inf: CronBackupVirtualDiskImages FilesystemClone
order Filesystem_Before_BackupImages inf: FilesystemClone CronBackupVirtualDiskImages
cib commit cron
quit
# These are the most important resources on the HA cluster: the virtual
# machines.
crm
cib new vm
# In order to start a virtual machine, the libvirtd daemon has to run. The "lsb:" means
# "Linux Standard Base", which in turn means any script located in
# /etc/init.d.
primitive Libvirtd lsb:libvirtd
# libvirtd looks for configuration files that define the virtual machines.
# These files are kept in /var/nevis, like the above cron scripts, and are
# "placed" via symlinks.
primitive SymlinkEtcLibvirt ocf:heartbeat:symlink \
params link="/etc/libvirt" target="/var/nevis/etc/libvirt" backup_suffix=".original"
primitive SymlinkQemuSnapshot ocf:heartbeat:symlink \
params link="/var/lib/libvirt/qemu/snapshot" target="/var/nevis/lib/libvirt/qemu/snapshot" \
backup_suffix=".original"
# Again, define a group for these resources, clone the group so they
# run on both nodes, and make sure they don't run unless the
# filesystems are mounted.
group LibvirtdGroup SymlinkEtcLibvirt SymlinkQemuSnapshot Libvirtd
clone LibvirtdClone LibvirtdGroup
colocation Libvirtd_With_Filesystem inf: LibvirtdClone FilesystemClone
# A tweak: some virtual machines require the directories exportted
# by the exportfs resources defined above. Don't start the VMs until
# the exports are complete.
order Exports_Before_Libvirtd inf: ExportsClone LibvirtdClone
# The typical definition of a resource that runs a VM. I won't list
# them all, just the one for the mail server. Note that all the
# virtual-machine resource names start with VM_, so they'll show
# up next to each other in the output of "crm configure show".
# VM migration is a neat feature. If pacemaker has the chance to move
# a virtual machine, it can transmit it to another node without stopping it
# on the source node and restarting it at the destination. If a machine
# crashes, migration can't happen, but it can greatly speed up the
# controlled shutdown of a node.
primitive VM_franklin ocf:heartbeat:VirtualDomain \
params config="/etc/libvirt/qemu/franklin.xml" \
migration_transport="ssh" meta allow-migrate="true"
# We don't want to clone the VMs; it will just confuse things if there
# two mail servers (with the same IP address!) running at the same time.
colocation Mail_With_Libvirtd inf: VM_franklin LibvirtdClone
order Libvirtd_Before_Mail inf: LibvirtdClone VM_franklin
cib commit vm
quit
# A less-critical resource is tftp. As above, we define the basic xinetd
# resource found in /etc/init.d, include a configure file with a symlink,
# then clone the resource and specify it can't run until the filesystems
# are mounted.
crm
cib new tftp
primitive Xinetd lsb:xinetd
primitive SymlinkTftp ocf:heartbeat:symlink \
params link="/etc/xinetd.d/tftp" target="/var/nevis/etc/xinetd.d/tftp" \
backup_suffix=".original"
group TftpGroup SymlinkTftp Xinetd
clone TftpClone TftpGroup
colocation Tftp_With_Filesystem inf: TftpClone FilesystemClone
order Filesystem_Before_Tftp inf: FilesystemClone TftpClone
cib commit tftp
quit
# More important is dhcpd, which assigns IP addresses dynamically.
# Many systems at Nevis require a DHCP server for their IP address,
# include the wireless routers. This follows the same pattern as above,
# except that we don't clone the dhcpd daemon, since we want only
# one DHCP server at Nevis.
crm
cib new dhcp
configure primitive Dhcpd lsb:dhcpd
# Associate an IP address with the DHCP server. This is a mild
# convenience for the times I update the list of MAC addresses
# to be assigned permanent IP addresses.
primitive IP_dhcp ocf:heartbeat:IPaddr2 \
params ip="10.44.107.11" cidr_netmask="32" nic="eth2"
primitive SymlinkDhcpdConf ocf:heartbeat:symlink \
params link="/etc/dhcp/dhcpd.conf" target="/var/nevis/etc/dhcpd.conf" \
backup_suffix=".original"
primitive SymlinkDhcpdLeases ocf:heartbeat:symlink \
params link="/var/lib/dhcpd" target="/var/nevis/dhcpd" \
backup_suffix=".original"
primitive SymlinkSysconfigDhcpd ocf:heartbeat:symlink \
params link="/etc/sysconfig/dhcpd" target="/var/nevis/etc/sysconfig/dhcpd"\
backup_suffix=".original"
group DhcpGroup SymlinkDhcpdConf SymlinkSysconfigDhcpd SymlinkDhcpdLeases Dhcpd IP_dhcp
colocation Dhcp_With_Filesystem inf: DhcpGroup FilesystemClone
order Filesystem_Before_Dhcp inf: FilesystemClone DhcpGroup
cib commit dhcp
quit
# An important part of a high-availability configuration is STONITH = "Shoot the
# other node in the head." Here's the idea: suppose one node fails for some reason. The
# other node will take over as needed.
# Suppose the failed node tries to come up again. This can be a problem: The other node
# may have accumulated changes that the failed node doesn't know about. There can be
# synchronization issues that require manual intervention.
# The STONITH mechanism means: If a node fails, the remaining node(s) in a cluster will
# force a permanent shutdown of the failed node; it can't automatically come back up again.
# This is a special case of "fencing": once a node or resource fails, it can't be allowed
# to start up again automatically.
# In general, there are many ways to implement a STONITH mechanism. At Nevis, the way
# we do it is to shut-off the power on the UPS connected to the failed node.
# (By the way, this is why you have to be careful about restarting hypatia or orestes.
# The STONITH mechanism may cause the UPS on the restarting
# computer to turn off the power; it will never come back up.)
# At Nevis, the UPSes are monitored and controlled using the NUT package
# <http://www.networkupstools.org/>; details are on the Nevis wiki at
# <http://www.nevis.columbia.edu/twiki/bin/view/Nevis/Ups>.
# The official corosync distribution from <http://www.clusterlabs.org/>
# does not include a script for NUT, so I had to write one. It's located at
# /home/bin//home/bin/fence_nut.pl on both hypatia and orestes; there are appropriate links
# to this script from /usr/sbin/fence_nut.
# The following commands implement the STONITH mechanism for our cluster:
crm
cib new stonith
# The STONITH resource that can potentially shut down hypatia.
primitive StonithHypatia stonith:fence_nut \
params stonith-timeout="120s" pcmk_host_check="static-list" \
pcmk_host_list="hypatia.nevis.columbia.edu" ups="hypatia-ups" username="XXXX" \
password="XXXX" cycledelay="20" ondelay="20" offdelay="20" \
noverifyonoff="1" debug="1"
# The node that runs the above script cannot be hypatia; it's
# not wise to trust a node to STONITH itself. Note that the score
# is "negative infinity," which means "never run this resource
# on the named node."
location StonithHypatia_Location StonithHypatia -inf: hypatia.nevis.columbia.edu
# The STONITH resource that can potentially shut down orestes.
primitive StonithOrestes stonith:fence_nut \
params stonith-timeout="120s" pcmk_host_check="static-list" \
pcmk_host_list="orestes.nevis.columbia.edu" ups="orestes-ups" username="XXXX" \
password="XXXX" cycledelay="20" ondelay="20" offdelay="20" \
noverifyonoff="1" debug="1"
# Again, orestes cannot be the node that runs the above script.
location StonithOrestes_Location StonithOrestes -inf: orestes.nevis.columbia.edu
cib commit stonith
quit
# Now turn the STONITH mechanism on for the cluster.
crm configure property stonith-enabled=true
Again, the final configuration that results from the above commands is on
pastebin
.