Nevis particle-physics administrative cluster configuration

This is a reference page. It contains a text file that describes how the high-availability pacemaker configuration was set up on two administrative servers, hypatia and orestes.

Files

Key HA configuration files. Note: Even in an emergency, there's no reason to edit these files!

/etc/cluster/cluster.conf
/etc/lvm/lvm.conf
/etc/drbd.d/global_common.conf
/etc/drbd.d/admin.res
/home/bin/fence_nut.pl
/etc/rc.d/rc.local
/home/bin/recover-symlinks.sh
/etc/rc.d/rc.fix-pacemaker-delay (on hypatia only)

The links are to an external site, pastebin; I use this in case I want to consult with someone on the HA setup. If you're reading this from a hardcopy, you can find all these files by visiting http://pastebin.com/u/wgseligman and searching for 20130103.

One-time set-up

The commands to set up a dual-primary cluster are outlined here. The details can be found in Clusters From Scratch and Redhat Cluster Tutorial.

Warning: Do not type any of these commands in the hopes of fixing a problem! They will erase the shared DRBD drive.

DRBD set-up

Edit /etc/drbd.d/global_common.conf and create /etc/drbd.d/admin.res. Then on hypatia:

/sbin/drbdadm create-md admin
/sbin/service drbd start
/sbin/drbdadm up admin

Then on orestes:

/sbin/drbdadm --force create-md admin
/sbin/service drbd start
/sbin/drbdadm up admin

Back to hypatia:

/sbin/drbdadm -- --overwrite-data-of-peer primary admin
cat /proc/drbd

Keeping looking at the contents of /proc/drbd. It will take a while, but eventually the two disks will sync up.

Back to orestes:

/sbin/drbdadm primary admin
cat /proc/drbd

The result should be something like this:

# cat /proc/drbd
version: 8.4.1 (api:1/proto:86-100)
GIT-hash: 91b4c048c1a0e06777b5f65d312b38d47abaea80 build by root@hypatia-tb.nevis.columbia.edu, 2012-02-14 17:04:51
 0: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r-----
    ns:162560777 nr:78408289 dw:240969067 dr:747326438 al:10050 bm:1583 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0

Here's a guide to understanding the contents of /proc/drbd.

Clustered LVM setup

Most of the following commands only have to be issued on one of the nodes. See Clusters From Scratch and Redhat Cluster Tutorial for details.

  • Edit /etc/lvm/lvm.conf on both systems; search this file for the initials WGS for a complete list of changes.
    • Change the filter line to search for DRBD partitions:
      filter = [ "a|/dev/drbd.*|", "a|/dev/md1|", "r|.*|" ]
    • For lvm locking:
      locking_type = 3

  • Edit /etc/sysconfig/cman to disable quorum (because there's only two nodes on the cluster):
    sed -i.sed "s/.*CMAN_QUORUM_TIMEOUT=.*/CMAN_QUORUM_TIMEOUT=0/g" /etc/sysconfig/cman

  • Create a physical volume and a clustered volume group on the DRBD partition:
    pvcreate /dev/drbd0
    vgcreate -c y ADMIN /dev/drbd0

  • For each logical volume in the volume group, create the volume and install a GFS2 filesystem; for example:
    lvcreate -L 200G -n usr ADMIN # ... and so on
    mkfs.gfs2 -p lock_dlm -j 2 -t Nevis_HA:usr /dev/ADMIN/usr
    Note that Nevis_HA is the cluster name defined in /etc/cluster/cluster.conf.

  • Make sure that cman, clvm2, and pacemaker daemons will start at boot; on both nodes, do:
    /sbin/chkconfig cman on
    /sbin/chkconfig clvmd on
    /sbin/chkconfig pacemaker on

  • Reboot both nodes.

Pacemaker configuration

Commands

The configuration has definitely changed from that listed below. To see the current configuration, run this as root on either hypatia or orestes:

crm configure show
To see the status of all the resources:
crm resource status
To get a constantly-updated display of the resource status, the following command is the corosync equivalent of "top" (use Ctrl-C to exit):
crm_mon
You can run the above commands via sudo, but you'll have to extend your path; e.g.,
export PATH=/sbin:/usr/sbin:${PATH}
sudo crm_mon

Concepts

This may help as you work your way through the configuration:

crm configure primitive MyIPResource ocf:heartbeat:IPaddr2 params ip=192.168.85.3 \
   cidr_netmask=32 op monitor interval=30s timeout=60s

# Which is composed of
    * crm ::= "cluster resource manager", the command we're executing
    * primitive ::= The type of resource object that we’re creating.
    * MyIPResource ::= Our name for the resource
    * IPaddr2 ::= The script to call
    * ocf ::= The standard it conforms to
    * ip=192.168.85.3 ::= Parameter(s) as name/value pairs
    * cidr_netmask ::= netmask; 32-bits means use this exact IP address
    * op ::== what follows are options
    * monitor interval=30s ::= check every 30 seconds that this resource is working
    * timeout ::= how long to wait before you assume an "op" is dead. 

How to find out which scripts exist, that is, which resources can be controlled by the HA cluster:

crm ra classes
Based on the result, I looked at:
crm ra list ocf heartbeat
To find out what IPaddr2 parameters I needed, I used:
crm ra meta ocf:heartbeat:IPaddr2

Initial configuration guide

This work was done in Apr-2012. The configuration has almost certainly changed since then. Hopefully, the following commands and comments will guide you to understanding any future changes and the reasons for them.

# The commands ultimately used to configure the high-availability (HA) servers:

# The beginning: make sure pacemaker is running on both hypatia and orestes:

/sbin/service pacemaker status
crm node status
crm resource status

# We'll configure STONITH later (see below)

crm configure property stonith-enabled=false

# Let's continue by entering the crm utility for short sessions. I'm going to 
# test groups of commands before I commit them. I omit the "crm configure show' 
# and "crm status" commands that I frequently typed in, in order to see that 
# everything was correct. 

# I also omit some of the standard resource options
# (e.g., "... op monitor interval="20" timeout="40" depth="0"...) to make the
# commands look simpler. You can see the
# complete list with "crm configure show".  
   
# DRBD is a service that synchronizes the hard drives between two machines.
# When one machine makes any change to the DRBD disk, the other machine 
# immediately duplicates that change on the block level. We have a dual-primary
# configuration, which means both machines can mount the DRBD disk at once.

# Start by entering the resource manager. 

crm

   # Define a "shadow" configuration, to test things without committing them
   # to the HA cluster:
   cib new drbd
   
   # The "drbd_resource" parameter points to a configuration defined in /etc/drbd.d/admin.res
   
   primitive AdminDrbd ocf:linbit:drbd \
      params drbd_resource="admin" \
      meta target-role="Master"
   
   # The following resources defines how the DRBD resource (AdminDrbd) is to
   # duplicated ("cloned") among the nodes. The parameters clarify that there are 
   # two copies, one on each node, and both can be the master.
   
   ms AdminClone AdminDrbd \
      meta master-max="2" master-node-max="1" clone-max="2" clone-node-max="1" \
      notify="true" interleave="true"

   configure show

   # Looks good, so commit the change. 
   cib commit drbd
   quit

# Now define resources that depend on ordering.
crm
   cib new disk
   
   # The DRBD disk is available to the system. The next step is to tell LVM
   # that the volume group ADMIN exists on the disk. 

   # To find out that there was a resource "ocf:heartbeat:LVM" that I could use,
   # I used the command:
   ra classes
   
   # Based on the result, I looked at:
   
   ra list ocf heartbeat
   
   # To find out what LVM parameters I needed, I used:
   
   ra meta ocf:heartbeat:LVM
   
   # All of the above led me to create the following resource configuration:
   
   primitive AdminLvm ocf:heartbeat:LVM \
      params volgrpname="ADMIN" 

   # After I set up the volume group, I want to mount the logical volumes
   # (partitions) within the volume group. Here's one of the partitions, /usr/nevis;
   # note that I begin all the filesystem resources with FS so they'll be next
   # to each other when I type "crm configure show".

   primitive FSUsrNevis ocf:heartbeat:Filesystem \
      params device="/dev/mapper/ADMIN-usr" directory="/usr/nevis" fstype="gfs2" \
      options="defaults,noatime,nodiratime" 
      
   # I have similar definitions for the other logical volumes in volume group ADMIN:
   # /mail, /var/nevis, etc. 
   
   # Now I'm going to define a resource group. The following command means:
   #    - Put all these resources on the same node;
   #    - Start these resources in the order they're listed;
   #    - The resources depend on each other in the order they're listed. For example,
   #       if AdminLvm fails, FSUsrNevis will not start, or will be stopped if it's running.

   group FilesystemGroup AdminLvm FSUsrNevis FSVarNevis FSVirtualMachines FSMail FSWork

   # We want these logical volumes (or partitions or filesystems) to be available
   # on both nodes. To do this, we define a clone resource. 

   clone FilesystemClone FilesystemGroup meta interleave="true"
   
   # It's important that we not try to set up the filesystems
   # until the DRBD admin resource is running on a node, and has been
   # promoted to master. 
   
   # A score of "inf:" means "infinity": 'FileSystemClone' must be on a node on which
   # 'AdminClone' is in the Master state; if the DRBD resource 'AdminClone' can't
   # be promoted, then don't start the 'FilesystemClone' resource. (You can use numeric
   # values instead of infinity, in which case these constraints become suggestions
   # instead of being mandatory.) 
   
   colocation Filesystem_With_Admin inf: FilesystemClone AdminClone:Master
   order Admin_Before_Filesystem inf: AdminClone:promote FilesystemClone:start
   
   cib commit disk
   quit

# Once all the filesystems are mounted, we start can start other resources. Let's
# define a set of cloned IP addresses that will always point to at least one of the nodes,
# possibly both. 

crm
   cib new ip

   # One address for each network

   primitive IP_cluster ocf:heartbeat:IPaddr2 \
      params ip="129.236.252.11" cidr_netmask="32" nic="eth0"
   primitive IP_cluster_local ocf:heartbeat:IPaddr2 \
      params ip="10.44.7.11" cidr_netmask="32" nic="eth2"
   primitive IP_cluster_sandbox ocf:heartbeat:IPaddr2 \
      params ip="10.43.7.11" cidr_netmask="32" nic="eth0.3"

   # Group them together

   group IPGroup IP_cluster IP_cluster_local IP_cluster_sandbox

   # The option "globally-unique=true" works with IPTABLES to make
   # sure that ethernet connections are not disrupted even if one of 
   # nodes goes down; see "Clusters From Scratch" for details. 

   clone IPClone IPGroup \
      meta globally-unique="true" clone-max="2" clone-node-max="2" interleave="false"

   # Make sure the filesystems are mounted before starting the IP resources.
   colocation IP_With_Filesystem inf: IPClone FilesystemClone
   order Filesystem_Before_IP inf: FilesystemClone IPClone

   cib commit ip
   quit

# We have to export some of the filesystems via NFS before some of the virtual machines
# will be able to run. 

crm
   cib new exports

   # This is an example NFS export resource; I won't list them all here. See
   # "crm configure show" for the complete list.

   primitive ExportUsrNevis ocf:heartbeat:exportfs \
      description="Site-wide applications installed in /usr/nevis" \
      params clientspec="*.nevis.columbia.edu" directory="/usr/nevis" fsid="20" \
      options="ro,no_root_squash,async"

   # Define a group for all the exportfs resources. You can see it's a long list, 
   # which is why I don't list them all explicitly. I had to be careful
   # about the exportfs definitions; despite the locking mechanims of GFS2, 
   # we'd get into trouble if two external systems tried to write to the same
   # DRBD partition at once via NFS.

   group ExportsGroup ExportMail ExportMailInbox ExportMailFolders ExportMailForward ExportMailProcmailrc \
      ExportUsrNevisHermes ExportUsrNevis ExportUsrNevisOffsite ExportWWW

   # Clone the group so both nodes export the partitions. Make sure the 
   # filesystems are mounted before we export them. 

   clone ExportsClone ExportsGroup
   colocation Exports_With_Filesystem inf: ExportsClone FilesystemClone
   order Filesystem_Before_Exports inf: FilesystemClone ExportsClone

   cib commit exports
   quit

# Symlinks: There are some scripts that I want to run under cron. These scripts are
# located in the DRBD /var/nevis file system. For them to run via cron, they have to
# found in /etc/cron.d somehow. A symlink is the easiest way, and there's a
# symlink pacemaker resource to manage this. 

crm
   cib new cron

   # The ambient-temperature script periodically checks the computer room's
   # environment monitor, and shuts down the cluster if the temperature gets
   # too high. 

   primitive CronAmbientTemperature ocf:heartbeat:symlink \
      description="Shutdown cluster if A/C stops" \
      params link="/etc/cron.d/ambient-temperature" target="/var/nevis/etc/cron.d/ambient-temperature" \
      backup_suffix=".original" 

   # We don't want to clone this resource; I only want one system to run this script
   # at any one time.

   colocation Temperature_With_Filesystem inf: CronAmbientTemperature FilesystemClone
   order Filesystem_Before_Temperature inf: FilesystemClone CronAmbientTemperature

   # Every couple of months, make a backup of the virtual machine's disk images.

   primitive CronBackupVirtualDiskImages ocf:heartbeat:symlink \
      description="Periodically save copies of the virtual machines" \
      params link="/etc/cron.d/backup-virtual-disk-images" \
      target="/var/nevis/etc/cron.d/backup-virtual-disk-images" \
      backup_suffix=".original"
   colocation BackupImages_With_Filesystem inf: CronBackupVirtualDiskImages FilesystemClone
   order Filesystem_Before_BackupImages inf: FilesystemClone CronBackupVirtualDiskImages

   cib commit cron
   quit

# These are the most important resources on the HA cluster: the virtual 
# machines.

crm
   cib new vm

   # In order to start a virtual machine, the libvirtd daemon has to run. The "lsb:" means
   # "Linux Standard Base", which in turn means any script located in 
   # /etc/init.d.

   primitive Libvirtd lsb:libvirtd

   # libvirtd looks for configuration files that define the virtual machines. 
   # These files are kept in /var/nevis, like the above cron scripts, and are
   # "placed" via symlinks.

   primitive SymlinkEtcLibvirt ocf:heartbeat:symlink \
      params link="/etc/libvirt" target="/var/nevis/etc/libvirt" backup_suffix=".original"
   primitive SymlinkQemuSnapshot ocf:heartbeat:symlink \
      params link="/var/lib/libvirt/qemu/snapshot" target="/var/nevis/lib/libvirt/qemu/snapshot" \
      backup_suffix=".original"

   # Again, define a group for these resources, clone the group so they
   # run on both nodes, and make sure they don't run unless the 
   # filesystems are mounted. 

   group LibvirtdGroup SymlinkEtcLibvirt SymlinkQemuSnapshot Libvirtd
   clone LibvirtdClone LibvirtdGroup
   colocation Libvirtd_With_Filesystem inf: LibvirtdClone FilesystemClone

   # A tweak: some virtual machines require the directories exportted
   # by the exportfs resources defined above. Don't start the VMs until
   # the exports are complete.

   order Exports_Before_Libvirtd inf: ExportsClone LibvirtdClone

   # The typical definition of a resource that runs a VM. I won't list
   # them all, just the one for the mail server. Note that all the
   # virtual-machine resource names start with VM_, so they'll show
   # up next to each other in the output of "crm configure show".

   # VM migration is a neat feature. If pacemaker has the chance to move
   # a virtual machine, it can transmit it to another node without stopping it
   # on the source node and restarting it at the destination. If a machine
   # crashes, migration can't happen, but it can greatly speed up the 
   # controlled shutdown ofa node. 

   primitive VM_franklin ocf:heartbeat:VirtualDomain \
      params config="/etc/libvirt/qemu/franklin.xml" \ 
      migration_transport="ssh" meta allow-migrate="true"

   # We don't want to clone the VMs; it will just confuse things if there
   # two mail servers (with the same IP address!) running at the same time.

   colocation Mail_With_Libvirtd inf: VM_franklin LibvirtdClone
   order Libvirtd_Before_Mail inf: LibvirtdClone VM_franklin

   cib commit vm
   quit

# A less-critical resource is tftp. As above, we define the basic xinetd
# resource found in /etc/init.d, include a configure file with a symlink, 
# then clone the resource and specify it can't run until the filesystems
# are mounted. 

crm
   cib new tftp

   primitive Xinetd lsb:xinetd
   primitive SymlinkTftp ocf:heartbeat:symlink \
      params link="/etc/xinetd.d/tftp" target="/var/nevis/etc/xinetd.d/tftp" \
      backup_suffix=".original"

   group TftpGroup SymlinkTftp Xinetd
   clone TftpClone TftpGroup
   colocation Tftp_With_Filesystem inf: TftpClone FilesystemClone
   order Filesystem_Before_Tftp inf: FilesystemClone TftpClone

   cib commit tftp
   quit

# More important is dhcpd, which assigns IP addresses dynamically. 
# Many systems at Nevis require a DHCP server for their IP address,
# include the wireless routers. This follows the same pattern as above,
# except that we don't clone the dhcpd daemon, since we want only
# one DHCP server at Nevis. 

crm
   cib new dhcp
   
   configure primitive Dhcpd lsb:dhcpd

   # Associate an IP address with the DHCP server. This is a mild
   # convenience for the times I update the list of MAC addresses
   # to be assigned permanent IP addresses. 
   primitive IP_dhcp ocf:heartbeat:IPaddr2 \
      params ip="10.44.107.11" cidr_netmask="32" nic="eth2"

   primitive SymlinkDhcpdConf ocf:heartbeat:symlink \
      params link="/etc/dhcp/dhcpd.conf" target="/var/nevis/etc/dhcpd.conf" \
      backup_suffix=".original"
   primitive SymlinkDhcpdLeases ocf:heartbeat:symlink \
      params link="/var/lib/dhcpd" target="/var/nevis/dhcpd" \
      backup_suffix=".original"
   primitive SymlinkSysconfigDhcpd ocf:heartbeat:symlink \
      params link="/etc/sysconfig/dhcpd" target="/var/nevis/etc/sysconfig/dhcpd"\
      backup_suffix=".original"

   group DhcpGroup SymlinkDhcpdConf SymlinkSysconfigDhcpd SymlinkDhcpdLeases Dhcpd IP_dhcp
   colocation Dhcp_With_Filesystem inf: DhcpGroup FilesystemClone
   order Filesystem_Before_Dhcp inf: FilesystemClone DhcpGroup
      
   cib commit dhcp
   quit

# An important part of a high-availability configuration is STONITH = "Shoot the
# other node in the head." Here's the idea: suppose one node fails for some reason. The
# other node will take over as needed. 

# Suppose the failed node tries to come up again. This can be a problem: The other node
# may have accumulated changes that the failed node doesn't know about. There can be
# synchronization issues that require manual intervention.

# The STONITH mechanism means: If a node fails, the remaining node(s) in a cluster will
# force a permanent shutdown of the failed node; it can't automatically come back up again.
# This is a special case of "fencing": once a node or resource fails, it can't be allowed
# to start up again automatically.

# In general, there are many ways to implement a STONITH mechanism. At Nevis, the way
# we do it is to shut-off the power on the UPS connected to the failed node.

# (By the way, this is why you have to be careful about restarting hypatia or orestes.
# The STONITH mechanism may cause the UPS on the restarting
# computer to turn off the power; it will never come back up.)

# At Nevis, the UPSes are monitored and controlled using the NUT package
# <http://www.networkupstools.org/>; details are on the Nevis wiki at
# <http://www.nevis.columbia.edu/twiki/bin/view/Nevis/Ups>.

# The official corosync distribution from <http://www.clusterlabs.org/> 
# does not include a script for NUT, so I had to write one. It's located at
# /home/bin//home/bin/fence_nut.pl on both hypatia and orestes; there are appropriate links
# to this script from /usr/sbin/fence_nut. 

# The following commands implement the STONITH mechanism for our cluster:

crm
   cib new stonith
   
   # The STONITH resource that can potentially shut down hypatia.
   
   primitive StonithHypatia stonith:fence_nut \
      params stonith-timeout="120s" pcmk_host_check="static-list" \
      pcmk_host_list="hypatia.nevis.columbia.edu" ups="hypatia-ups" username="XXXX" \
      password="XXXX" cycledelay="20" ondelay="20" offdelay="20" \
      noverifyonoff="1" debug="1"
      
   # The node that runs the above script cannot be hypatia; it's
   # not wise to trust a node to STONITH itself. Note that the score
   # is "negative infinity," which means "never run this resource
   # on the named node."

   location StonithHypatia_Location StonithHypatia -inf: hypatia.nevis.columbia.edu

   # The STONITH resource that can potentially shut down orestes.

   primitive StonithOrestes stonith:fence_nut \
      params stonith-timeout="120s" pcmk_host_check="static-list" \
      pcmk_host_list="orestes.nevis.columbia.edu" ups="orestes-ups" username="XXXX" \
      password="XXXX" cycledelay="20" ondelay="20" offdelay="20" \
      noverifyonoff="1" debug="1"

   # Again, orestes cannot be the node that runs the above script.
   
   location StonithOrestes_Location StonithOrestes -inf: orestes.nevis.columbia.edu
   
   cib commit stonith
   quit

# Now turn the STONITH mechanism on for the cluster.

crm configure property stonith-enabled=true


This topic: Main > Computing > PacemakerDualPrimaryConfiguration
Topic revision: r6 - 2013-01-04 - WilliamSeligman
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback