Nevis particle-physics administrative cluster configuration

Archived 20-Sep-2013: The high-availability cluster has been set aside in favor of a more traditional single-box admin server. HA is grand in theory, but in the three years we operated the cluster we had no hardware problems which the HA set-up would have prevented, but many hours of downtime due to problems with the HA software. This mailing-list post has some details.

This is a reference page. It contains a text file that describes how the high-availability pacemaker configuration was set up on two administrative servers, hypatia and orestes.

Files

Key HA configuration files. Note: Even in an emergency, there's no reason to edit these files!

/etc/cluster/cluster.conf
/etc/lvm/lvm.conf
/etc/drbd.d/global_common.conf
/etc/drbd.d/admin.res
/home/bin/fence_nut.pl
/etc/rc.d/rc.local
/home/bin/recover-symlinks.sh
/etc/rc.d/rc.fix-pacemaker-delay (on hypatia only)

The links are to an external site, pastebin; I use this in case I want to consult with someone on the HA setup. If you're reading this from a hardcopy, you can find all these files by visiting http://pastebin.com/u/wgseligman and searching for 20130103.

One-time set-up

The commands to set up a dual-primary cluster are outlined here. The details can be found in Clusters From Scratch and Redhat Cluster Tutorial.

Warning: Do not type any of these commands in the hopes of fixing a problem! They will erase the shared DRBD drive.

DRBD set-up

Edit /etc/drbd.d/global_common.conf and create /etc/drbd.d/admin.res. Then on hypatia:

/sbin/drbdadm create-md admin
/sbin/service drbd start
/sbin/drbdadm up admin

Then on orestes:

/sbin/drbdadm --force create-md admin
/sbin/service drbd start
/sbin/drbdadm up admin

Back to hypatia:

/sbin/drbdadm -- --overwrite-data-of-peer primary admin
cat /proc/drbd

Keeping looking at the contents of /proc/drbd. It will take a while, but eventually the two disks will sync up.

Back to orestes:

/sbin/drbdadm primary admin
cat /proc/drbd

The result should be something like this:

# cat /proc/drbd
version: 8.4.1 (api:1/proto:86-100)
GIT-hash: 91b4c048c1a0e06777b5f65d312b38d47abaea80 build by root@hypatia-tb.nevis.columbia.edu, 2012-02-14 17:04:51
 0: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r-----
    ns:162560777 nr:78408289 dw:240969067 dr:747326438 al:10050 bm:1583 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0

Here's a guide to understanding the contents of /proc/drbd.

Clustered LVM setup

Most of the following commands only have to be issued on one of the nodes. See Clusters From Scratch and Redhat Cluster Tutorial for details.

  • Edit /etc/lvm/lvm.conf on both systems; search this file for the initials WGS for a complete list of changes.
    • Change the filter line to search for DRBD partitions:
      filter = [ "a|/dev/drbd.*|", "a|/dev/md1|", "r|.*|" ]
    • For lvm locking:
      locking_type = 3

  • Edit /etc/sysconfig/cman to disable quorum (because there's only two nodes on the cluster):
    sed -i.sed "s/.*CMAN_QUORUM_TIMEOUT=.*/CMAN_QUORUM_TIMEOUT=0/g" /etc/sysconfig/cman

  • Create a physical volume and a clustered volume group on the DRBD partition. The name of the DRBD disk is /dev/drbd0; the name of the volume group is ADMIN.
    pvcreate /dev/drbd0
    vgcreate -c y ADMIN /dev/drbd0

  • For each logical volume in the volume group, create the volume and install a GFS2 filesystem; for example, the following creates a logical volume usr within the volume group ADMIN:
    lvcreate -L 200G -n usr ADMIN 
    mkfs.gfs2 -p lock_dlm -j 2 -t Nevis_HA:usr /dev/ADMIN/usr
    Note that Nevis_HA is the cluster name defined in /etc/cluster/cluster.conf.

  • Make sure that cman, clvm2, and pacemaker daemons will start at boot; on both nodes, do:
    /sbin/chkconfig cman on
    /sbin/chkconfig clvmd on
    /sbin/chkconfig pacemaker on

  • Reboot both nodes.

Pacemaker configuration

Commands

The configuration has definitely changed from that listed below. To see the current configuration, run this as root on either hypatia or orestes:

crm configure show
To see the status of all the resources:
crm resource status
To get a constantly-updated display of the resource status, the following command is the corosync equivalent of "top" (use Ctrl-C to exit):
crm_mon
You can run the above commands via sudo, but you'll have to extend your path; e.g.,
export PATH=/sbin:/usr/sbin:${PATH}
sudo crm_mon

Concepts

This may help as you work your way through the configuration:

crm configure primitive MyIPResource ocf:heartbeat:IPaddr2 params ip=192.168.85.3 \
   cidr_netmask=32 op monitor interval=30s timeout=60s

# Which is composed of
    * crm ::= "cluster resource manager", the command we're executing
    * primitive ::= The type of resource object that we’re creating.
    * MyIPResource ::= Our name for the resource
    * IPaddr2 ::= The script to call
    * ocf ::= The standard it conforms to
    * ip=192.168.85.3 ::= Parameter(s) as name/value pairs
    * cidr_netmask ::= netmask; 32-bits means use this exact IP address
    * op ::== what follows are options
    * monitor interval=30s ::= check every 30 seconds that this resource is working
    * timeout ::= how long to wait before you assume an "op" is dead. 

How to find out which scripts exist, that is, which resources can be controlled by the HA cluster:

crm ra classes
Based on the result, I looked at:
crm ra list ocf heartbeat
To find out what IPaddr2 parameters I needed, I used:
crm ra meta ocf:heartbeat:IPaddr2

Initial configuration guide

This work was done in Apr-2012. The configuration has almost certainly changed since then. Hopefully, the following commands and comments will guide you to understanding any future changes and the reasons for them. The entire final configuration is on pastebin: http://pastebin.com/QcxuvfK0.

# The commands ultimately used to configure the high-availability (HA) servers:

# The beginning: make sure pacemaker is running on both hypatia and orestes:

/sbin/service pacemaker status
crm node status
crm resource status

# We'll configure STONITH later (see below)

crm configure property stonith-enabled=false

# Let's continue by entering the crm utility for short sessions. I'm going to 
# test groups of commands before I commit them. I omit the "crm configure show' 
# and "crm status" commands that I frequently typed in, in order to see that 
# everything was correct. 

# I also omit some of the standard resource options
# (e.g., "... op monitor interval="20" timeout="40" depth="0"...) to make the
# commands look simpler. You can see the
# complete list with "crm configure show".  
   
# DRBD is a service that synchronizes the hard drives between two machines.
# When one machine makes any change to the DRBD disk, the other machine 
# immediately duplicates that change on the block level. We have a dual-primary
# configuration, which means both machines can mount the DRBD disk at once.

# Start by entering the resource manager. 

crm

   # Define a "shadow" configuration, to test things without committing them
   # to the HA cluster:
   cib new drbd
   
   # The "drbd_resource" parameter points to a configuration defined in /etc/drbd.d/admin.res
   
   primitive AdminDrbd ocf:linbit:drbd \
      params drbd_resource="admin" \
      meta target-role="Master"
   
   # The following resources defines how the DRBD resource (AdminDrbd) is to
   # duplicated ("cloned") among the nodes. The parameters clarify that there are 
   # two copies, one on each node, and both can be the master.
   
   ms AdminClone AdminDrbd \
      meta master-max="2" master-node-max="1" clone-max="2" clone-node-max="1" \
      notify="true" interleave="true"

   configure show

   # Looks good, so commit the change. 
   cib commit drbd
   quit

# Now define resources that depend on ordering.
crm
   cib new disk
   
   # The DRBD disk is available to the system. The next step is to tell LVM
   # that the volume group ADMIN exists on the disk. 

   # To find out that there was a resource "ocf:heartbeat:LVM" that I could use,
   # I used the command:
   ra classes
   
   # Based on the result, I looked at:
   
   ra list ocf heartbeat
   
   # To find out what LVM parameters I needed, I used:
   
   ra meta ocf:heartbeat:LVM
   
   # All of the above led me to create the following resource configuration:
   
   primitive AdminLvm ocf:heartbeat:LVM \
      params volgrpname="ADMIN" 

   # After I set up the volume group, I want to mount the logical volumes
   # (partitions) within the volume group. Here's one of the partitions, /usr/nevis;
   # note that I begin all the filesystem resources with FS so they'll be next
   # to each other when I type "crm configure show".

   primitive FSUsrNevis ocf:heartbeat:Filesystem \
      params device="/dev/mapper/ADMIN-usr" directory="/usr/nevis" fstype="gfs2" \
      options="defaults,noatime,nodiratime" 
      
   # I have similar definitions for the other logical volumes in volume group ADMIN:
   # /mail, /var/nevis, etc. 
   
   # Now I'm going to define a resource group. The following command means:
   #    - Put all these resources on the same node;
   #    - Start these resources in the order they're listed;
   #    - The resources depend on each other in the order they're listed. For example,
   #       if AdminLvm fails, FSUsrNevis will not start, or will be stopped if it's running.

   group FilesystemGroup AdminLvm FSUsrNevis FSVarNevis FSVirtualMachines FSMail FSWork

   # We want these logical volumes (or partitions or filesystems) to be available
   # on both nodes. To do this, we define a clone resource. 

   clone FilesystemClone FilesystemGroup meta interleave="true"
   
   # It's important that we not try to set up the filesystems
   # until the DRBD admin resource is running on a node, and has been
   # promoted to master. 
   
   # A score of "inf:" means "infinity": 'FileSystemClone' must be on a node on which
   # 'AdminClone' is in the Master state; if the DRBD resource 'AdminClone' can't
   # be promoted, then don't start the 'FilesystemClone' resource. (You can use numeric
   # values instead of infinity, in which case these constraints become suggestions
   # instead of being mandatory.) 
   
   colocation Filesystem_With_Admin inf: FilesystemClone AdminClone:Master
   order Admin_Before_Filesystem inf: AdminClone:promote FilesystemClone:start
   
   cib commit disk
   quit

# Once all the filesystems are mounted, we start can start other resources. Let's
# define a set of cloned IP addresses that will always point to at least one of the nodes,
# possibly both. 

crm
   cib new ip

   # One address for each network

   primitive IP_cluster ocf:heartbeat:IPaddr2 \
      params ip="129.236.252.11" cidr_netmask="32" nic="eth0"
   primitive IP_cluster_local ocf:heartbeat:IPaddr2 \
      params ip="10.44.7.11" cidr_netmask="32" nic="eth2"
   primitive IP_cluster_sandbox ocf:heartbeat:IPaddr2 \
      params ip="10.43.7.11" cidr_netmask="32" nic="eth0.3"

   # Group them together

   group IPGroup IP_cluster IP_cluster_local IP_cluster_sandbox

   # The option "globally-unique=true" works with IPTABLES to make
   # sure that ethernet connections are not disrupted even if one of 
   # nodes goes down; see "Clusters From Scratch" for details. 

   clone IPClone IPGroup \
      meta globally-unique="true" clone-max="2" clone-node-max="2" interleave="false"

   # Make sure the filesystems are mounted before starting the IP resources.
   colocation IP_With_Filesystem inf: IPClone FilesystemClone
   order Filesystem_Before_IP inf: FilesystemClone IPClone

   cib commit ip
   quit

# We have to export some of the filesystems via NFS before some of the virtual machines
# will be able to run. 

crm
   cib new exports

   # This is an example NFS export resource; I won't list them all here. See
   # "crm configure show" for the complete list.

   primitive ExportUsrNevis ocf:heartbeat:exportfs \
      description="Site-wide applications installed in /usr/nevis" \
      params clientspec="*.nevis.columbia.edu" directory="/usr/nevis" fsid="20" \
      options="ro,no_root_squash,async"

   # Define a group for all the exportfs resources. You can see it's a long list, 
   # which is why I don't list them all explicitly. I had to be careful
   # about the exportfs definitions; despite the locking mechanims of GFS2, 
   # we'd get into trouble if two external systems tried to write to the same
   # DRBD partition at once via NFS.

   group ExportsGroup ExportMail ExportMailInbox ExportMailFolders ExportMailForward ExportMailProcmailrc \
      ExportUsrNevisHermes ExportUsrNevis ExportUsrNevisOffsite ExportWWW

   # Clone the group so both nodes export the partitions. Make sure the 
   # filesystems are mounted before we export them. 

   clone ExportsClone ExportsGroup
   colocation Exports_With_Filesystem inf: ExportsClone FilesystemClone
   order Filesystem_Before_Exports inf: FilesystemClone ExportsClone

   cib commit exports
   quit

# Symlinks: There are some scripts that I want to run under cron. These scripts are
# located in the DRBD /var/nevis file system. For them to run via cron, they have to
# found in /etc/cron.d somehow. A symlink is the easiest way, and there's a
# symlink pacemaker resource to manage this. 

crm
   cib new cron

   # The ambient-temperature script periodically checks the computer room's
   # environment monitor, and shuts down the cluster if the temperature gets
   # too high. 

   primitive CronAmbientTemperature ocf:heartbeat:symlink \
      description="Shutdown cluster if A/C stops" \
      params link="/etc/cron.d/ambient-temperature" target="/var/nevis/etc/cron.d/ambient-temperature" \
      backup_suffix=".original" 

   # We don't want to clone this resource; I only want one system to run this script
   # at any one time.

   colocation Temperature_With_Filesystem inf: CronAmbientTemperature FilesystemClone
   order Filesystem_Before_Temperature inf: FilesystemClone CronAmbientTemperature

   # Every couple of months, make a backup of the virtual machine's disk images.

   primitive CronBackupVirtualDiskImages ocf:heartbeat:symlink \
      description="Periodically save copies of the virtual machines" \
      params link="/etc/cron.d/backup-virtual-disk-images" \
      target="/var/nevis/etc/cron.d/backup-virtual-disk-images" \
      backup_suffix=".original"
   colocation BackupImages_With_Filesystem inf: CronBackupVirtualDiskImages FilesystemClone
   order Filesystem_Before_BackupImages inf: FilesystemClone CronBackupVirtualDiskImages

   cib commit cron
   quit

# These are the most important resources on the HA cluster: the virtual 
# machines.

crm
   cib new vm

   # In order to start a virtual machine, the libvirtd daemon has to run. The "lsb:" means
   # "Linux Standard Base", which in turn means any script located in 
   # /etc/init.d.

   primitive Libvirtd lsb:libvirtd

   # libvirtd looks for configuration files that define the virtual machines. 
   # These files are kept in /var/nevis, like the above cron scripts, and are
   # "placed" via symlinks.

   primitive SymlinkEtcLibvirt ocf:heartbeat:symlink \
      params link="/etc/libvirt" target="/var/nevis/etc/libvirt" backup_suffix=".original"
   primitive SymlinkQemuSnapshot ocf:heartbeat:symlink \
      params link="/var/lib/libvirt/qemu/snapshot" target="/var/nevis/lib/libvirt/qemu/snapshot" \
      backup_suffix=".original"

   # Again, define a group for these resources, clone the group so they
   # run on both nodes, and make sure they don't run unless the 
   # filesystems are mounted. 

   group LibvirtdGroup SymlinkEtcLibvirt SymlinkQemuSnapshot Libvirtd
   clone LibvirtdClone LibvirtdGroup
   colocation Libvirtd_With_Filesystem inf: LibvirtdClone FilesystemClone

   # A tweak: some virtual machines require the directories exportted
   # by the exportfs resources defined above. Don't start the VMs until
   # the exports are complete.

   order Exports_Before_Libvirtd inf: ExportsClone LibvirtdClone

   # The typical definition of a resource that runs a VM. I won't list
   # them all, just the one for the mail server. Note that all the
   # virtual-machine resource names start with VM_, so they'll show
   # up next to each other in the output of "crm configure show".

   # VM migration is a neat feature. If pacemaker has the chance to move
   # a virtual machine, it can transmit it to another node without stopping it
   # on the source node and restarting it at the destination. If a machine
   # crashes, migration can't happen, but it can greatly speed up the 
   # controlled shutdown of a node. 

   primitive VM_franklin ocf:heartbeat:VirtualDomain \
      params config="/etc/libvirt/qemu/franklin.xml" \ 
      migration_transport="ssh" meta allow-migrate="true"

   # We don't want to clone the VMs; it will just confuse things if there
   # two mail servers (with the same IP address!) running at the same time.

   colocation Mail_With_Libvirtd inf: VM_franklin LibvirtdClone
   order Libvirtd_Before_Mail inf: LibvirtdClone VM_franklin

   cib commit vm
   quit

# A less-critical resource is tftp. As above, we define the basic xinetd
# resource found in /etc/init.d, include a configure file with a symlink, 
# then clone the resource and specify it can't run until the filesystems
# are mounted. 

crm
   cib new tftp

   primitive Xinetd lsb:xinetd
   primitive SymlinkTftp ocf:heartbeat:symlink \
      params link="/etc/xinetd.d/tftp" target="/var/nevis/etc/xinetd.d/tftp" \
      backup_suffix=".original"

   group TftpGroup SymlinkTftp Xinetd
   clone TftpClone TftpGroup
   colocation Tftp_With_Filesystem inf: TftpClone FilesystemClone
   order Filesystem_Before_Tftp inf: FilesystemClone TftpClone

   cib commit tftp
   quit

# More important is dhcpd, which assigns IP addresses dynamically. 
# Many systems at Nevis require a DHCP server for their IP address,
# include the wireless routers. This follows the same pattern as above,
# except that we don't clone the dhcpd daemon, since we want only
# one DHCP server at Nevis. 

crm
   cib new dhcp
   
   configure primitive Dhcpd lsb:dhcpd

   # Associate an IP address with the DHCP server. This is a mild
   # convenience for the times I update the list of MAC addresses
   # to be assigned permanent IP addresses. 
   primitive IP_dhcp ocf:heartbeat:IPaddr2 \
      params ip="10.44.107.11" cidr_netmask="32" nic="eth2"

   primitive SymlinkDhcpdConf ocf:heartbeat:symlink \
      params link="/etc/dhcp/dhcpd.conf" target="/var/nevis/etc/dhcpd.conf" \
      backup_suffix=".original"
   primitive SymlinkDhcpdLeases ocf:heartbeat:symlink \
      params link="/var/lib/dhcpd" target="/var/nevis/dhcpd" \
      backup_suffix=".original"
   primitive SymlinkSysconfigDhcpd ocf:heartbeat:symlink \
      params link="/etc/sysconfig/dhcpd" target="/var/nevis/etc/sysconfig/dhcpd"\
      backup_suffix=".original"

   group DhcpGroup SymlinkDhcpdConf SymlinkSysconfigDhcpd SymlinkDhcpdLeases Dhcpd IP_dhcp
   colocation Dhcp_With_Filesystem inf: DhcpGroup FilesystemClone
   order Filesystem_Before_Dhcp inf: FilesystemClone DhcpGroup
      
   cib commit dhcp
   quit

# An important part of a high-availability configuration is STONITH = "Shoot the
# other node in the head." Here's the idea: suppose one node fails for some reason. The
# other node will take over as needed. 

# Suppose the failed node tries to come up again. This can be a problem: The other node
# may have accumulated changes that the failed node doesn't know about. There can be
# synchronization issues that require manual intervention.

# The STONITH mechanism means: If a node fails, the remaining node(s) in a cluster will
# force a permanent shutdown of the failed node; it can't automatically come back up again.
# This is a special case of "fencing": once a node or resource fails, it can't be allowed
# to start up again automatically.

# In general, there are many ways to implement a STONITH mechanism. At Nevis, the way
# we do it is to shut-off the power on the UPS connected to the failed node.

# (By the way, this is why you have to be careful about restarting hypatia or orestes.
# The STONITH mechanism may cause the UPS on the restarting
# computer to turn off the power; it will never come back up.)

# At Nevis, the UPSes are monitored and controlled using the NUT package
# <http://www.networkupstools.org/>; details are on the Nevis wiki at
# <http://www.nevis.columbia.edu/twiki/bin/view/Nevis/Ups>.

# The official corosync distribution from <http://www.clusterlabs.org/> 
# does not include a script for NUT, so I had to write one. It's located at
# /home/bin//home/bin/fence_nut.pl on both hypatia and orestes; there are appropriate links
# to this script from /usr/sbin/fence_nut. 

# The following commands implement the STONITH mechanism for our cluster:

crm
   cib new stonith
   
   # The STONITH resource that can potentially shut down hypatia.
   
   primitive StonithHypatia stonith:fence_nut \
      params stonith-timeout="120s" pcmk_host_check="static-list" \
      pcmk_host_list="hypatia.nevis.columbia.edu" ups="hypatia-ups" username="XXXX" \
      password="XXXX" cycledelay="20" ondelay="20" offdelay="20" \
      noverifyonoff="1" debug="1"
      
   # The node that runs the above script cannot be hypatia; it's
   # not wise to trust a node to STONITH itself. Note that the score
   # is "negative infinity," which means "never run this resource
   # on the named node."

   location StonithHypatia_Location StonithHypatia -inf: hypatia.nevis.columbia.edu

   # The STONITH resource that can potentially shut down orestes.

   primitive StonithOrestes stonith:fence_nut \
      params stonith-timeout="120s" pcmk_host_check="static-list" \
      pcmk_host_list="orestes.nevis.columbia.edu" ups="orestes-ups" username="XXXX" \
      password="XXXX" cycledelay="20" ondelay="20" offdelay="20" \
      noverifyonoff="1" debug="1"

   # Again, orestes cannot be the node that runs the above script.
   
   location StonithOrestes_Location StonithOrestes -inf: orestes.nevis.columbia.edu
   
   cib commit stonith
   quit

# Now turn the STONITH mechanism on for the cluster.

crm configure property stonith-enabled=true

Again, the final configuration that results from the above commands is on pastebin.

Edit | Attach | Watch | Print version | History: r11 < r10 < r9 < r8 < r7 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r11 - 2014-07-01 - WilliamSeligman
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback