Nevis particle-physics administrative cluster configuration
Archived 20-Sep-2013: The high-availability cluster has been set aside in favor of a more traditional single-box admin server. HA is grand in theory, but in the three years we operated the cluster we had no hardware problems which the HA set-up would have prevented, but many hours of downtime due to problems with the HA software.
This mailing-list post
has some details.
This is a reference page. It contains a text file that describes how the high-availability
Pacemaker/Corosync
configuration was set up on two administrative
servers,
hypatia
and
orestes
.
Files
Key HA configuration files. Note: Even in an emergency, there's no reason to edit these files:
/etc/drbd.conf
/etc/drbd.d/*.res
/etc/lvm/lvm.conf
/etc/corosync/corosync.conf
/home/bin/nut.sh
/home/bin/rsync-config.sh # Daily rsync from hypatia to orestes:
Commands
The configuration has definitely changed from that listed below. To see the current configuration, run this as root on either
hypatia
or
orestes
:
crm configure show
To get a constantly-updated display of the resource status, the following command is the corosync equivalent of "top" (use Ctrl-C to exit):
crm_mon
For a GUI, you can use this utility. You have to select "Connect" and login via an account that's a member of the
haclient
group; you may have to edit
/etc/group
.
crm_gui &
You can run the above commands via sudo, but you'll have to extend your path; e.g.,
export PATH=/sbin:/usr/sbin:${PATH}
sudo crm_mon
Concepts
This may help as you work your way through the configuration:
crm configure primitive IP ocf:heartbeat:IPaddr2 params ip=192.168.85.3 \
cidr_netmask=32 op monitor interval=30s
# Which is composed of
* crm ::= "cluster resource manager", the command we're executing
* primitive ::= The type of resource object that we’re creating.
* IP ::= Our name for the resource
* IPaddr2 ::= The script to call
* ocf ::= The standard it conforms to
* ip=192.168.85.3 ::= Parameter(s) as name/value pairs
* cidr_netmask ::= netmask; 32-bits means use this exact IP address
* op ::== what follows are options
* monitor interval=30s ::= check every 30 seconds that this resource is working
# ... timeout = how to long wait before you assume a resource is dead.
How to find out which scripts exist, that is, which resources can be controlled by the HA cluster:
ra classes
Based on the result, I looked at:
ra list ocf heartbeat
To find out what IPaddr2 parameters I needed, I used:
ra meta ocf:heartbeat:IPaddr2
Configuration
This work was done in Sep-2010, with major revisions for stability in Aug-2011. The configuration has almost certainly changed since then. Hopefully, the following commands and comments will guide you to understanding any future changes and the reasons for them.
# The commands ultimately used to configure the high-availability (HA) servers:
# The beginning: make sure corosync is running on both hypatia and orestes:
/sbin/service corosync start
# The following line is needed because we have only two machines in
# the HA cluster.
crm configure property no-quorum-policy=ignore
# We'll configure STONITH later (see below)
crm configure property stonith-enabled=false
# Define IP addresses to be managed by the HA systems.
crm configure primitive ClusterIP ocf:heartbeat:IPaddr2 params ip=129.236.252.11 \
cidr_netmask=32 op monitor interval=30s
crm configure primitive LocalIP ocf:heartbeat:IPaddr2 params ip=10.44.7.11 \
cidr_netmask=32 op monitor interval=30s
crm configure primitive SandboxIP ocf:heartbeat:IPaddr2 params ip=10.43.7.11 \
cidr_netmask=32 op monitor interval=30s
# Group these together, so they'll all be assigned to the same machine.
# The name of the group is "MainIPGroup".
crm configure group MainIPGroup ClusterIP LocalIP SandboxIP
# Let's continue by entering the crm utility for short sessions. I'm going to
# test groups of commands before I commit them. (I omit the "configure show'
# and "status" commands that I frequently typed in, in order to see that
# everything was correct.)
# DRBD is a service that synchronizes the hard drives between two machines.
# For our cluster, one machine will have access to the "master" copy
# and make all the changes to that copy; the other machine will have the
# "slave" copy and mindlessly duplicate all the changes.
# I previously configured the DRBD resources 'admin' and 'work'. What the
# following commands do is put the maintenance of these resources under
# the control of Pacemaker.
crm
# Define a "shadow" configuration, to test things without committing them
# to the HA cluster:
cib new drbd
# The "drbd_resource" parameter points to a configuration defined in /etc/drbd.d/
configure primitive AdminDrbd ocf:linbit:drbd params drbd_resource=admin op monitor interval=60s
# DRBD functions with a "master/slave" setup as described above. The following command
# defines the name of the master disk partition ("Admin"). The remaining parameters
# clarify that there are two copies, but only one can be the master, and
# at most one can be a slave.
configure master Admin AdminDrbd meta master-max=1 master-node-max=1 \
clone-max=2 clone-node-max=1 notify=true globally-unique=false
# The machine that gets the master copy (the one that will make changes to the drive)
# should also be the one with the main IP address.
configure colocation AdminWithMainIP inf: MainIPGroup Admin:Master
# We want to wait before assigning IPs to a node until we know that
# Admin has been promoted to master on that node.
configure order AdminBeforeMainIP inf: Admin:promote MainIPGroup
# I like these commands, so commit them to the running configuration.
cib commit drbd
# Things look good, so let's add another disk resource. I defined another drbd resource
# with some spare disk space, called "work". The idea is that I can play with alternate
# virtual machines and save them on "work" before I copy them to the more robust "admin".
configure primitive WorkDrbd ocf:linbit:drbd params drbd_resource=work op monitor interval=60s
configure master Work WorkDrbd meta master-max=1 master-node-max=1 \
clone-max=2 clone-node-max=1 notify=true globally-unique=false
# I prefer the work directory to be on the main admin box, but it doesn't have to be. "500:" is
# weighting factor; compare it to "inf:" (for infinity) which is used in most of these commands.
configure colocation WorkPrefersMain 500: Work:Master MainIPGroup
# Given a choice, try to put the Admin:Master on hypatia
configure location DefinePreferredMainNode Admin 100: hypatia.nevis.columbia.edu
cib commit drbd
quit
# Now try a resource that depends on ordering: On the node that has the master
# resource for "work," mount that disk image as /work.
crm
cib new workdisk
# To find out that there was an "ocf:heartbeat:Filesystem" that I could use,
# I used the command:
ra classes
# Based on the result, I looked at:
ra list ocf heartbeat
# To find out what Filesystem parameters I needed, I used:
ra meta ocf:heartbeat:Filesystem
# All of the above led me to create the following resource configuration:
configure primitive WorkDirectory ocf:heartbeat:Filesystem \
params device="/dev/drbd2" directory="/work" fstype="ext4"
# Note that I had previously created an ext4 filesystem on /dev/drbd2.
# Now specify that we want this to be on the same node as Work:Master:
configure colocation DirectoryWithWork inf: WorkDirectory Work:Master
# One more thing: It's important that we not try to mount the directory
# until after Work has been promoted to master on the node.
# A score of "inf" means "infinity"; if the DRBD resource 'work' can't
# be set up, then don't mount the /work partition.
configure order WorkBeforeDirectory inf: Work:promote WorkDirectory:start
cib commit workdisk
quit
# We've made the relatively-unimportant DRBD resource 'work' function. Let's do it for 'admin'.
# Previously I created some LVM volumes on the admin DRBD master. We need to use a
# resource to active them, but we can't activate them until after the Admin:Master
# is loaded.
crm
cib new lvm
# Activate the LVM volumes, but only after DRBD has figured out where
# Admin:Master is located.
configure primitive Lvm ocf:heartbeat:LVM \
params volgrpname="admin"
configure colocation LvmWithAdmin inf: Lvm Admin:Master
configure order AdminBeforeLvm inf: Admin:promote Lvm:start
cib commit lvm
# Go back to the actual, live configuration
cib use live
# See if everything is working
configure show
status
# Go back to the shadow for more commands.
cib use lvm
# We have a whole bunch of filesystems on the "admin" volume group. Let's
# create the commands to mount them.
# The 'timeout="240s' piece is to give a four-minute interval to start
# up the mount. This allows for a "it's been too long, do an fsck" check
# on mounting the filesystem.
# We also allow five minutes for the unmounting to stop, just in case
# it's taking a while for some job on server to let go of the mount.
# It's better that it take a while to switch over the system service
# than for the mount to be forcibly terminated.
configure primitive UsrDirectory ocf:heartbeat:Filesystem \
params device="/dev/admin/usr" directory="/usr/nevis" fstype="ext4" \
op start interval="0" timeout="240s" \
op stop interval="0" timeout="300s"
configure primitive VarDirectory ocf:heartbeat:Filesystem \
params device="/dev/admin/var" directory="/var/nevis" fstype="ext4" \
op start interval="0" timeout="240s" \
op stop interval="0" timeout="300s"
configure primitive MailDirectory ocf:heartbeat:Filesystem \
params device="/dev/admin/mail" directory="/mail" fstype="ext4" \
op start interval="0" timeout="240s" \
op stop interval="0" timeout="300s"
configure primitive XenDirectory ocf:heartbeat:Filesystem \
params device="/dev/admin/xen" directory="/xen" fstype="ext4" \
op start interval="0" timeout="240s" \
op stop interval="0" timeout="300s"
configure group AdminDirectoriesGroup UsrDirectory VarDirectory MailDirectory XenDirectory
# We can't mount any of them until LVM is set up:
configure colocation DirectoriesWithLVM inf: AdminDirectoriesGroup Lvm
configure order LvmBeforeDirectories inf: Lvm AdminDirectoriesGroup
cib commit lvm
quit
# Some standard Linux services are under corosync's control. They depend on some or
# all of the filesystems being mounted.
# Let's start with a simple one: enable the printing service (cups):
crm
cib new printing
# lsb = "Linux Standard Base." It just means any service which is
# controlled by the one of the standard scripts in /etc/init.d
configure primitive Cups lsb:cups
# Cups stores its spool files in /var/spool/cups. If the cups service
# were to switch to a different server, we want the new server to see
# the spooled files. So create /var/nevis/cups, link it with:
# mv /var/spool/cups /var/spool/cups.ori
# ln -sf /var/nevis/cups /var/spool/cups
# and demand that the cups service only start if /var/nevis (and the other
# high-availability directories) have been mounted.
configure colocation CupsWithVar inf: Cups AdminDirectoriesGroup
# In order to prevent chaos, make sure that the high-availability directories
# have been mounted before we try to start cups.
configure order VarBeforeCups inf: AdminDirectoriesGroup Cups
cib commit printing
quit
# The other services (xinetd, dhcpd) follow the same pattern as above:
# Make sure the services start on the same machine as the admin directories,
# and after the admin directories are successfully mounted.
crm
cib new services
configure primitive Xinetd lsb:xinetd
configure primitive Dhcpd lsb:dhcpd
configure colocation XinetdWithVar inf: Xinetd AdminDirectoriesGroup
configure order VarBeforeXinetd inf: VarDirectory Xinetd
configure colocation DhcpdWithVar inf: Dhcpd AdminDirectoriesGroup
configure order VarBeforeDhcpd inf: VarDirectory Dhcpd
cib commit services
quit
# The high-availability servers export some of the admin directories to other
# systems, both real and virtual; for example, the /usr/nevis directory is
# exported to all the other machines on the Nevis Linux cluster.
# NFS exporting of a shared directory can be a little tricky. As with CUPS
# spooling, we want to preserve the NFS export state in a way that the
# backup server can pick it up. The safest way to do this is to create a
# small separate LVM partition ("nfs") and mount it as "/var/lib/nfs",
# the NFS directory that contains files that keep track of the NFS state.
crm
cib new nfs
# Define the mount for the NFS state directory /var/lib/nfs
configure primitive NfsStateDirectory ocf:heartbeat:Filesystem \
params device="/dev/admin/nfs" directory="/var/lib/nfs" fstype="ext4"
configure colocation NfsStateWithVar inf: NfsStateDirectory AdminDirectoriesGroup
configure order VarBeforeNfsState inf: AdminDirectoriesGroup NfsStateDirectory
# Now that the NFS state directory is mounted, we can start nfslockd. Note that
# that we're starting NFS lock on both the primary and secondary HA systems;
# by default a "clone" resource is started on all systems in a cluster.
# (Placing nfslockd under the control of Pacemaker turned out to be key to
# successful transfer of cluster services to another node. The nfslockd and
# nfs daemon information stored in /var/lib/nfs have to be consistent.)
configure primitive NfsLockInstance lsb:nfslock
configure clone NfsLock NfsLockInstance
configure order NfsStateBeforeNfsLock inf: NfsStateDirectory NfsLock
# Once nfslockd has been set up, we can start NFS. (We say to colocate
# NFS with 'NfsStateDirectory', instead of nfslockd, because nfslockd
# is going to be started on both nodes.)
configure primitive Nfs lsb:nfs
configure colocation NfsWithNfsState inf: Nfs NfsStateDirectory
configure order NfsLockBeforeNfs inf: NfsLock Nfs
cib commit nfs
quit
# The whole point of the entire setup is to be able to run guest virtual machines
# under the control of the high-availability service. Here is the set-up for one example
# virtual machine. I previously created the hogwarts virtual machine and copied its
# configuration to /xen/configs/hogwarts.cfg.
# I duplicated the same procedure for franklin (mail server), ada (web server), and
# so on, but I don't show that here.
crm
cib new hogwarts
# Give the virtual machine a long stop interval before flagging an error.
# Sometimes it takes a while for Linux to shut down.
configure primitive Hogwarts ocf:heartbeat:Xen params \
xmfile="/xen/configs/Hogwarts.cfg" \
op stop interval="0" timeout="240"
# All the virtual machine files are stored in the /xen partition, which is one
# of the high-availability admin directories. The virtual machine must run on
# the system with this directory.
configure colocation HogwartsWithDirectories inf: Hogwarts AdminDirectoriesGroup
# All of the virtual machines depend on NFS-mounting directories which
# are exported by the HA server. The safest thing to do is to make sure
# NFS is running on the HA server before starting the virtual machine.
configure order NfsBeforeHogwarts inf: Nfs Hogwarts
cib commit hogwarts
quit
# An important part of a high-availability configuration is STONITH = "Shoot the
# other node in the head." Here's the idea: suppose one node fails for some reason. The
# other node will take over as needed.
# Suppose the failed node tries to come up again. This can be a problem: The other node
# may have accumulated changes that the failed node doesn't know about. There can be
# synchronization issues that require manual intervention.
# The STONITH mechanism means: If a node fails, the remaining node(s) in a cluster will
# force a permanent shutdown of the failed node; it can't automatically come back up again.
# This is a special case of "fencing": once a node or resource fails, it can't be allowed
# to start up again automatically.
# In general, there are many ways to implement a STONITH mechanism. At Nevis, the way
# we do it is to shut-off the power on the UPS connected to the failed node.
# (By the way, this is why you have to be careful about restarting hypatia or orestes.
# The STONITH mechanism may cause the UPS on the restarting
# computer to turn off the power; it will never come back up.)
# At Nevis, the UPSes are monitored and controlled using the NUT package
# <http://www.networkupstools.org/>; details are on the Nevis wiki at
# <http://www.nevis.columbia.edu/twiki/bin/view/Nevis/Ups>.
# The official corosync distribution from <http://www.clusterlabs.org/>
# does not include a script for NUT, so I had to write one. It's located at
# /home/bin/nut.sh on both hypatia and orestes; there are appropriate links
# to this script from the stonith/external directory.
# By the way, I sent the script to Cluster Labs, who accepted it.
# The next generation of their distribution will include the script.
# The following commands implement the STONITH mechanism for our cluster:
crm
cib new stonith
# The STONITH resource that can potentially shut down hypatia.
configure primitive HypatiaStonith stonith:external/nut \
params hostname="hypatia.nevis.columbia.edu" \
ups="hypatia-ups" username="admin" password="acdc"
# The node that runs the above script cannot be hypatia; it's
# not wise to trust a node to STONITH itself. Note that the score
# is "negative infinity," which means "never run this resource
# on the named node."
configure location HypatiaStonithLoc HypatiaStonith -inf: hypatia.nevis.columbia.edu
# The STONITH resource that can potentially shut down orestes.
configure primitive OrestesStonith stonith:external/nut \
params hostname="orestes.nevis.columbia.edu" \
ups="orestes-ups" username="admin" password="acdc"
# Again, orestes cannot be the node that runs the above script.
configure location OresetesStonithLoc OrestesStonith -inf: orestes.nevis.columbia.edu
cib commit stonith
quit
# Now turn the STONITH mechanism on for the cluster.
crm configure property stonith-enabled=true
# At this point, the key elements of the high-availability configuration have
# been set up. There is one non-critical frill: One node (probably hypatia) will be
# running the important services, while the other node (probably orestes) would
# be "twiddling its thumbs." Instead, let's have orestes do something useful: execute
# condor jobs.
# For orestes to do this, it requires the condor service. It also requires that
# library:/usr/nevis is mounted, the same as every other batch machine on the
# Nevis condor cluster. We can't use the automount daemon (amd) to do this for
# us, the way we do on the other batch nodes; we have to make corosync do the
# mounts.
crm
cib new condor
# Mount library:/usr/nevis. A bit of a name confusion here: there's a /work
# partition on the primary node, but the name 'LibraryOnWork" means that
# the nfs-mount of /usr/nevis is located on the secondary or "work" node.
configure primitive LibraryOnWork ocf:heartbeat:Filesystem \
params device="library:/usr/nevis" directory="/usr/nevis" \
fstype="nfs"
# Corosync must not NFS-mount library:/usr/nevis on the system has already
# mounted /usr/nevis directly as part of AdminDirectoriesGroup
# described above.
# Note that if there's only one node remaining in the high-availability
# cluster, it will be running the resource AdminDirectoriesGroup, and
# LibraryOnWork will never be started. This is fine; if there's only one
# node left, I _don't_ want it running condor jobs.
configure colocation NoRemoteMountWithDirectories -inf: LibraryOnWork AdminDirectoriesGroup
# Determine on which machine we mount library:/usr/nevis _after_ the NFS
# export of /usr/nevis has been set up.
configure order NfsBeforeLibrary inf: Nfs LibraryOnWork
# Define the IPs associated with the backup system, and group them together.
# This is a non-critical definition, and I don't want to assign it until the more important
# "secondary" resources have been set up.
configure primitive Burr ocf:heartbeat:IPaddr2 params ip=129.236.252.10 \
cidr_netmask=32 op monitor interval=30s
configure primitive BurrLocal ocf:heartbeat:IPaddr2 params ip=10.44.7.10
cidr_netmask=32 op monitor interval=30s
configure group AssistantIPGroup Burr BurrLocal
colocation AssistantWithLibrary inf: AssistantIPGroup LibraryOnWork
order LibraryBeforeAssistant inf: LibraryOnWork AssistantIPGroup
# The standard condor execution service. As with all the batch nodes,
# I've already configured /etc/condor/condor_config.local and created
# scratch directories in /data/condor.
configure primitive Condor lsb:condor
# If we're able mount library:/usr/nevis, then it's safe to start condor.
# If we can't mount library:/usr/nevis, then condor will never be started.
# (We stated above that AssistantIPGroup won't start until after LibraryOnWork).
configure colocation CondorWithAssistant inf: Condor AssistantIPGroup
configure order AssistantBeforeCondor inf: AssistantIPGroup Condor
cib commit condor
quit